CommencerCommencer gratuitement

Creating a tibble from a corpus

To further explore the corpus on crude oil data that you received from a coworker, you have decided to create a pipeline to clean the text contained in the documents. Instead of exploring how to do this with the tm package, you have decided to transform the corpus into a tibble so you can use the functions unnest_tokens(), count(), and anti_join() that you are already familiar with. The corpus crude contains both the metadata and the text of each document.

Cet exercice fait partie du cours

Introduction to Natural Language Processing in R

Afficher le cours

Instructions

  • Convert the corpus into a tibble.
  • Use names to print out the column names.
  • Tokenize (by word), count, and remove stop words from the text column of crude_tibble.

Exercice interactif pratique

Essayez cet exercice en complétant cet exemple de code.

# Create a tibble & Review
crude_tibble <- ___(crude)
___(crude_tibble)

crude_counts <- crude_tibble %>%
  # Tokenize by word 
  ___(___, text) %>%
  # Count by word
  ___(word, sort = TRUE) %>%
  # Remove stop words
  ___(stop_words)
Modifier et exécuter le code