Word stemming and stem completion on a sentence
Let's consider the following sentence as our document for this exercise:
"In a complicated haste, Tom rushed to fix a new complication, too complicatedly."
This sentence contains the same three forms of the word "complicate" that we saw in the previous exercise. The difference here is that even if you called stemDocument()
on this sentence, it would return the sentence without stemming any words. Take a moment and try it out in the console. Be sure to include the punctuation marks.
This happens because stemDocument()
treats the whole sentence as one word. In other words, our document is a character vector of length 1, instead of length n, where n is the number of words in the document. To solve this problem, we first remove the punctuation marks with the removePunctuation()
function, you learned a few exercises back. We then strsplit()
this character vector of length 1 to length n, unlist()
, then proceed to stem and re-complete.
Don't worry if that was confusing. Let's go through the process step by step!
This exercise is part of the course
Text Mining with Bag-of-Words in R
Exercise instructions
The document text_data
and the completion dictionary comp_dict
are loaded in your workspace.
- Remove the punctuation marks in
text_data
usingremovePunctuation()
, assigning torm_punc
. - Call
strsplit()
onrm_punc
with thesplit
argument set equal to" "
. Nest this insideunlist()
, assigning ton_char_vec
. - Use
stemDocument()
again to perform word stemming onn_char_vec
, assigning tostem_doc
. - Create
complete_doc
by re-completing your stemmed document withstemCompletion()
and usingcomp_dict
as your reference corpus.
Are stem_doc
and complete_doc
what you expected?
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Remove punctuation: rm_punc
rm_punc <- ____
# Create character vector: n_char_vec
n_char_vec <- unlist(___(___, split = " "))
# Perform word stemming: stem_doc
stem_doc <- ___
# Print stem_doc
stem_doc
# Re-complete stemmed document: complete_doc
complete_doc <- ___
# Print complete_doc
complete_doc