Intro to word stemming and stem completion
Still, another useful preprocessing step involves word-stemming and stem completion. Word stemming reduces words to unify across documents. For example, the stem of "computational", "computers" and "computation" is "comput". But because "comput" isn't a real word, we want to reconstruct the words so that "computational", "computers", and "computation" all refer to a recognizable word, such as "computer". The reconstruction step is called stem completion.
The tm
package provides the stemDocument()
function to get to a word's root. This function either takes in a character vector and returns a character vector, or takes in a PlainTextDocument
and returns a PlainTextDocument
.
For example,
stemDocument(c("computational", "computers", "computation"))
returns "comput" "comput" "comput"
.
You will use stemCompletion()
to reconstruct these word roots back into a known term. stemCompletion()
accepts a character vector and a completion dictionary. The completion dictionary can be a character vector or a Corpus
object. Either way, the completion dictionary for our example would need to contain the word "computer," so all instances of "comput" can be reconstructed.
This is a part of the course
“Text Mining with Bag-of-Words in R”
Exercise instructions
- Create a vector called
complicate
consisting of the words "complicated", "complication", and "complicatedly" in that order. - Store the stemmed version of
complicate
to an object calledstem_doc
. - Create
comp_dict
that contains one word, "complicate". - Create
complete_text
by applyingstemCompletion()
tostem_doc
. Re-complete the words usingcomp_dict
as the reference corpus. - Print
complete_text
to the console.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Create complicate
complicate <- ___
# Perform word stemming: stem_doc
stem_doc <- ___
# Create the completion dictionary: comp_dict
comp_dict <- ___
# Perform stem completion: complete_text
complete_text <- ___
# Print complete_text
complete_text