Word stemming and stem completion on a sentence
Let's consider the following sentence as our document for this exercise:
"In a complicated haste, Tom rushed to fix a new complication, too complicatedly."
This sentence contains the same three forms of the word "complicate" that we saw in the previous exercise. The difference here is that even if you called stemDocument() on this sentence, it would return the sentence without stemming any words. Take a moment and try it out in the console. Be sure to include the punctuation marks.
This happens because stemDocument() treats the whole sentence as one word. In other words, our document is a character vector of length 1, instead of length n, where n is the number of words in the document. To solve this problem, we first remove the punctuation marks with the removePunctuation() function, you learned a few exercises back. We then strsplit() this character vector of length 1 to length n, unlist(), then proceed to stem and re-complete.
Don't worry if that was confusing. Let's go through the process step by step!
This exercise is part of the course
Text Mining with Bag-of-Words in R
Exercise instructions
The document text_data and the completion dictionary comp_dict are loaded in your workspace.
- Remove the punctuation marks in
text_datausingremovePunctuation(), assigning torm_punc. - Call
strsplit()onrm_puncwith thesplitargument set equal to" ". Nest this insideunlist(), assigning ton_char_vec. - Use
stemDocument()again to perform word stemming onn_char_vec, assigning tostem_doc. - Create
complete_docby re-completing your stemmed document withstemCompletion()and usingcomp_dictas your reference corpus.
Are stem_doc and complete_doc what you expected?
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Remove punctuation: rm_punc
rm_punc <- ____
# Create character vector: n_char_vec
n_char_vec <- unlist(___(___, split = " "))
# Perform word stemming: stem_doc
stem_doc <- ___
# Print stem_doc
stem_doc
# Re-complete stemmed document: complete_doc
complete_doc <- ___
# Print complete_doc
complete_doc