Text preprocessing: Stemming
The root of words are often more important than their endings, especially when it comes to text analysis. The book Animal Farm is obviously about animals. However, knowing that the book mentions animal's 248 times, and animal 107 times might not be helpful for your analysis.
tidy_animal_farm
contains a tibble of the words from Animal Farm, tokenized and without stop words. The next step is to stem the words and explore the results.
Diese Übung ist Teil des Kurses
Introduction to Natural Language Processing in R
Anleitung zur Übung
- Use
dplyr
andSnowballC
to stem the words fromtidy_animal_farm
. - Print the old word frequencies from
tidy_animal_farm
. - Print the new word frequencies from
stemmed_animal_farm
.
Interaktive Übung
Versuche dich an dieser Übung, indem du diesen Beispielcode vervollständigst.
# Perform stemming on tidy_animal_farm
stemmed_animal_farm <- tidy_animal_farm %>%
___(word = ___(___))
# Print the old word frequencies
___ %>%
___(word, sort = ___)
# Print the new word frequencies
___ %>%
___(word, sort = ___)