Session Ready
Exercise

Word stemming and stem completion on a sentence

Let's consider the following sentence as our document for this exercise:

"In a complicated haste, Tom rushed to fix a new complication, too complicatedly."

This sentence contains the same three forms of the word "complicate" that we saw in the previous exercise. The difference here is that even if you called stemDocument() on this sentence, it would return the sentence without stemming any words. Take a moment and try it out in the console. Be sure to include the punctuation marks.

This happens because stemDocument() treats the whole sentence as one word. In other words, our document is a character vector of length 1, instead of length n, where n is the number of words in the document. To solve this problem, we first remove the punctuation marks with the removePunctuation() function, you learned a few exercises back. We then strsplit() this character vector of length 1 to length n, unlist(), then proceed to stem and re-complete.

Don't worry if that was confusing. Let's go through the process step by step!

Instructions
100 XP

The document text_data and the completion dictionary comp_dict are loaded in your workspace.

  • Remove the punctuation marks in text_data using removePunctuation(), assigning to rm_punc.
  • Call strsplit() on rm_punc with the split argument set equal to " ". Nest this inside unlist(), assigning to n_char_vec.
  • Use stemDocument() again to perform word stemming on n_char_vec, assigning to stem_doc.
  • Create complete_doc by re-completing your stemmed document with stemCompletion() and using comp_dict as your reference corpus.

Are stem_doc and complete_doc what you expected?