BoW Example

In literature reviews, researchers read and summarize as many available texts about a subject as possible. Sometimes they end up reading duplicate articles, or summaries of articles they have already read. You have been given 20 articles about crude oil as an R object named crude_tibble. Instead of jumping straight to reading each article, you have decided to see what words are shared across these articles. To do so, you will start by building a bag-of-words representation of the text.

This exercise is part of the course

Introduction to Natural Language Processing in R

View Course

Exercise instructions

  • Create a BoW representation by counting the number of words by article using the column article_id.
  • Use the output to determine how many unique unique article/word combinations were created.
  • Filter the results to mentions of 'prices'.
  • How many articles have the word prices used in them?

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Count occurrence by article_id and word
words <- crude_tibble %>%
  unnest_tokens(output = "word", token = "words", input = text) %>%
  anti_join(stop_words) %>%
  count(___, ___, sort=TRUE)

# How many different word/article combinations are there?
unique_combinations <- nrow(___)

# Filter to responses with the word "prices"
words_with_prices <- words %>%
  ___(word == "___")

# How many articles had the word "prices"?
number_of_price_articles <- nrow(___)