Session Ready
Exercise

Intro to word stemming and stem completion

Still, another useful preprocessing step involves word-stemming and stem completion. Word stemming reduces words to unify across documents. For example, the stem of "computational", "computers" and "computation" is "comput". But because "comput" isn't a real word, we want to reconstruct the words so that "computational", "computers", and "computation" all refer to a recognizable word, such as "computer". The reconstruction step is called stem completion.

The tm package provides the stemDocument() function to get to a word's root. This function either takes in a character vector and returns a character vector, or takes in a PlainTextDocument and returns a PlainTextDocument.

For example,

stemDocument(c("computational", "computers", "computation"))

returns "comput" "comput" "comput".

You will use stemCompletion() to reconstruct these word roots back into a known term. stemCompletion() accepts a character vector and a completion dictionary. The completion dictionary can be a character vector or a Corpus object. Either way, the completion dictionary for our example would need to contain the word "computer," so all instances of "comput" can be reconstructed.

Instructions
100 XP
  • Create a vector called complicate consisting of the words "complicated", "complication", and "complicatedly" in that order.
  • Store the stemmed version of complicate to an object called stem_doc.
  • Create comp_dict that contains one word, "complicate".
  • Create complete_text by applying stemCompletion() to stem_doc. Re-complete the words using comp_dict as the reference corpus.
  • Print complete_text to the console.