Creating a tibble from a corpus
To further explore the corpus on crude oil data that you received from a coworker, you have decided to create a pipeline to clean the text contained in the documents. Instead of exploring how to do this with the tm
package, you have decided to transform the corpus into a tibble so you can use the functions unnest_tokens()
, count()
, and anti_join()
that you are already familiar with. The corpus crude
contains both the metadata and the text of each document.
Diese Übung ist Teil des Kurses
Introduction to Natural Language Processing in R
Anleitung zur Übung
- Convert the corpus into a tibble.
- Use
names
to print out the column names. - Tokenize (by word), count, and remove stop words from the
text
column ofcrude_tibble
.
Interaktive Übung
Versuche dich an dieser Übung, indem du diesen Beispielcode vervollständigst.
# Create a tibble & Review
crude_tibble <- ___(crude)
___(crude_tibble)
crude_counts <- crude_tibble %>%
# Tokenize by word
___(___, text) %>%
# Count by word
___(word, sort = TRUE) %>%
# Remove stop words
___(stop_words)