Apply preprocessing steps to a corpus

The tm package provides a function tm_map() to apply cleaning functions to an entire corpus, making the cleaning steps easier.

tm_map() takes two arguments, a corpus and a cleaning function. Here, removeNumbers() is from the tm package.

corpus <- tm_map(corpus, removeNumbers)

For compatibility, base R and qdap functions need to be wrapped in content_transformer().

corpus <- tm_map(corpus, content_transformer(replace_abbreviation))

You may be applying the same functions over multiple corpora; using a custom function like the one displayed in the editor will save you time (and lines of code). clean_corpus() takes one argument, corpus, and applies a series of cleaning functions to it in order, then returns the updated corpus.

The order of cleaning steps makes a difference. For example, if you removeNumbers() and then replace_number(), the second function won't find anything to change! Check, check, and re-check your results!

Edit the custom function clean_corpus() in the sample code to apply (in order):
- tm's removePunctuation().
- Base R's tolower().
- Append "mug" to the stop words list.
- tm's stripWhitespace().

Jumping into Text Mining with Bag-of-Words

Word Clouds and More Interesting Visuals

Adding to Your TM Skills

Battle of the Tech Giants for Talent

Exercise

Apply preprocessing steps to a corpus

Instructions 1/2