BoW Example
In literature reviews, researchers read and summarize as many available texts about a subject as possible. Sometimes they end up reading duplicate articles, or summaries of articles they have already read. You have been given 20 articles about crude oil as an R object named crude_tibble
. Instead of jumping straight to reading each article, you have decided to see what words are shared across these articles. To do so, you will start by building a bag-of-words representation of the text.
This exercise is part of the course
Introduction to Natural Language Processing in R
Exercise instructions
- Create a BoW representation by counting the number of words by article using the column
article_id
. - Use the output to determine how many unique unique article/word combinations were created.
- Filter the results to mentions of
'prices'
. - How many articles have the word
prices
used in them?
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Count occurrence by article_id and word
words <- crude_tibble %>%
unnest_tokens(output = "word", token = "words", input = text) %>%
anti_join(stop_words) %>%
count(___, ___, sort=TRUE)
# How many different word/article combinations are there?
unique_combinations <- nrow(___)
# Filter to responses with the word "prices"
words_with_prices <- words %>%
___(word == "___")
# How many articles had the word "prices"?
number_of_price_articles <- nrow(___)