Word stemming and stem completion on a sentence
Let's consider the following sentence as our document for this exercise:
"In a complicated haste, Tom rushed to fix a new complication, too complicatedly."
This sentence contains the same three forms of the word "complicate" that we saw in the previous exercise. The difference here is that even if you called stemDocument() on this sentence, it would return the sentence without stemming any words. Take a moment and try it out in the console. Be sure to include the punctuation marks.
This happens because stemDocument() treats the whole sentence as one word. In other words, our document is a character vector of length 1, instead of length n, where n is the number of words in the document. To solve this problem, we first remove the punctuation marks with the removePunctuation() function, you learned a few exercises back. We then strsplit() this character vector of length 1 to length n, unlist(), then proceed to stem and re-complete.
Don't worry if that was confusing. Let's go through the process step by step!
Cet exercice fait partie du cours
Text Mining with Bag-of-Words in R
Instructions
The document text_data and the completion dictionary comp_dict are loaded in your workspace.
- Remove the punctuation marks in
text_datausingremovePunctuation(), assigning torm_punc. - Call
strsplit()onrm_puncwith thesplitargument set equal to" ". Nest this insideunlist(), assigning ton_char_vec. - Use
stemDocument()again to perform word stemming onn_char_vec, assigning tostem_doc. - Create
complete_docby re-completing your stemmed document withstemCompletion()and usingcomp_dictas your reference corpus.
Are stem_doc and complete_doc what you expected?
Exercice interactif pratique
Essayez cet exercice en complétant cet exemple de code.
# Remove punctuation: rm_punc
rm_punc <- ____
# Create character vector: n_char_vec
n_char_vec <- unlist(___(___, split = " "))
# Perform word stemming: stem_doc
stem_doc <- ___
# Print stem_doc
stem_doc
# Re-complete stemmed document: complete_doc
complete_doc <- ___
# Print complete_doc
complete_doc