Apply preprocessing steps to a corpus
The tm package provides a function tm_map() to apply cleaning functions to an entire corpus, making the cleaning steps easier.
tm_map() takes two arguments, a corpus and a cleaning function. Here, removeNumbers() is from the tm package.
corpus <- tm_map(corpus, removeNumbers)
For compatibility, base R and qdap functions need to be wrapped in content_transformer().
corpus <- tm_map(corpus, content_transformer(replace_abbreviation))
You may be applying the same functions over multiple corpora; using a custom function like the one displayed in the editor will save you time (and lines of code). clean_corpus() takes one argument, corpus, and applies a series of cleaning functions to it in order, then returns the updated corpus.
The order of cleaning steps makes a difference. For example, if you removeNumbers() and then replace_number(), the second function won't find anything to change! Check, check, and re-check your results!
This exercise is part of the course
Text Mining with Bag-of-Words in R
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Alter the function code to match the instructions
clean_corpus <- function(corpus) {
# Remove punctuation
corpus <- tm_map(corpus, ___)
# Transform to lower case
corpus <- tm_map(corpus, ___)
# Add more stopwords
corpus <- tm_map(corpus, removeWords, words = c(stopwords("en"), "coffee", ___))
# Strip whitespace
___
return(corpus)
}