Get startedGet started for free

Apply preprocessing steps to a corpus

The tm package provides a function tm_map() to apply cleaning functions to an entire corpus, making the cleaning steps easier.

tm_map() takes two arguments, a corpus and a cleaning function. Here, removeNumbers() is from the tm package.

corpus <- tm_map(corpus, removeNumbers)

For compatibility, base R and qdap functions need to be wrapped in content_transformer().

corpus <- tm_map(corpus, content_transformer(replace_abbreviation))

You may be applying the same functions over multiple corpora; using a custom function like the one displayed in the editor will save you time (and lines of code). clean_corpus() takes one argument, corpus, and applies a series of cleaning functions to it in order, then returns the updated corpus.

The order of cleaning steps makes a difference. For example, if you removeNumbers() and then replace_number(), the second function won't find anything to change! Check, check, and re-check your results!

This exercise is part of the course

Text Mining with Bag-of-Words in R

View Course

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Alter the function code to match the instructions
clean_corpus <- function(corpus) {
  # Remove punctuation
  corpus <- tm_map(corpus, ___)
  # Transform to lower case
  corpus <- tm_map(corpus, ___)
  # Add more stopwords
  corpus <- tm_map(corpus, removeWords, words = c(stopwords("en"), "coffee", ___))
  # Strip whitespace
  ___
  return(corpus)
}
Edit and Run Code