1. Learn
  2. /
  3. Courses
  4. /
  5. Text Mining with Bag-of-Words in R

Exercise

Apply preprocessing steps to a corpus

The tm package provides a function tm_map() to apply cleaning functions to an entire corpus, making the cleaning steps easier.

tm_map() takes two arguments, a corpus and a cleaning function. Here, removeNumbers() is from the tm package.

corpus <- tm_map(corpus, removeNumbers)

For compatibility, base R and qdap functions need to be wrapped in content_transformer().

corpus <- tm_map(corpus, content_transformer(replace_abbreviation))

You may be applying the same functions over multiple corpora; using a custom function like the one displayed in the editor will save you time (and lines of code). clean_corpus() takes one argument, corpus, and applies a series of cleaning functions to it in order, then returns the updated corpus.

The order of cleaning steps makes a difference. For example, if you removeNumbers() and then replace_number(), the second function won't find anything to change! Check, check, and re-check your results!

Instructions 1/2

undefined XP
    1
    2
  • Edit the custom function clean_corpus() in the sample code to apply (in order):
    • tm's removePunctuation().
    • Base R's tolower().
    • Append "mug" to the stop words list.
    • tm's stripWhitespace().