Intro to word stemming and stem completion
Still, another useful preprocessing step involves word-stemming and stem completion. Word stemming reduces words to unify across documents. For example, the stem of "computational", "computers" and "computation" is "comput". But because "comput" isn't a real word, we want to reconstruct the words so that "computational", "computers", and "computation" all refer to a recognizable word, such as "computer". The reconstruction step is called stem completion.
The tm
package provides the stemDocument()
function to get to a word's root. This function either takes in a character vector and returns a character vector, or takes in a PlainTextDocument
and returns a PlainTextDocument
.
For example,
stemDocument(c("computational", "computers", "computation"))
returns "comput" "comput" "comput"
.
You will use stemCompletion()
to reconstruct these word roots back into a known term. stemCompletion()
accepts a character vector and a completion dictionary. The completion dictionary can be a character vector or a Corpus
object. Either way, the completion dictionary for our example would need to contain the word "computer," so all instances of "comput" can be reconstructed.
This is a part of the course
“Text Mining with Bag-of-Words in R”
Exercise instructions
- Create a vector called
complicate
consisting of the words "complicated", "complication", and "complicatedly" in that order. - Store the stemmed version of
complicate
to an object calledstem_doc
. - Create
comp_dict
that contains one word, "complicate". - Create
complete_text
by applyingstemCompletion()
tostem_doc
. Re-complete the words usingcomp_dict
as the reference corpus. - Print
complete_text
to the console.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Create complicate
complicate <- ___
# Perform word stemming: stem_doc
stem_doc <- ___
# Create the completion dictionary: comp_dict
comp_dict <- ___
# Perform stem completion: complete_text
complete_text <- ___
# Print complete_text
complete_text
This exercise is part of the course
Text Mining with Bag-of-Words in R
Learn the bag of words technique for text mining with R.
In this chapter, you'll learn the basics of using the bag-of-words method for analyzing text data.
Exercise 1: What is text mining?Exercise 2: Understanding text miningExercise 3: Quick taste of text miningExercise 4: Getting startedExercise 5: Load some textExercise 6: Make the vector a VCorpus object (1)Exercise 7: Make the vector a VCorpus object (2)Exercise 8: Make a VCorpus from a data frameExercise 9: Cleaning and preprocessing textExercise 10: Common cleaning functions from tmExercise 11: Cleaning with qdapExercise 12: All about stop wordsExercise 13: Intro to word stemming and stem completionExercise 14: Word stemming and stem completion on a sentenceExercise 15: Apply preprocessing steps to a corpusExercise 16: The TDM & DTMExercise 17: Understanding TDM and DTMExercise 18: Make a document-term matrixExercise 19: Make a term-document matrixWhat is DataCamp?
Learn the data skills you need online at your own pace—from non-coding essentials to data science and machine learning.