Apply preprocessing steps to a corpus
The tm
package provides a function tm_map()
to apply cleaning functions to an entire corpus, making the cleaning steps easier.
tm_map()
takes two arguments, a corpus and a cleaning function. Here, removeNumbers()
is from the tm
package.
corpus <- tm_map(corpus, removeNumbers)
For compatibility, base R and qdap
functions need to be wrapped in content_transformer()
.
corpus <- tm_map(corpus, content_transformer(replace_abbreviation))
You may be applying the same functions over multiple corpora; using a custom function like the one displayed in the editor will save you time (and lines of code). clean_corpus()
takes one argument, corpus
, and applies a series of cleaning functions to it in order, then returns the updated corpus.
The order of cleaning steps makes a difference. For example, if you removeNumbers()
and then replace_number()
, the second function won't find anything to change! Check, check, and re-check your results!
This exercise is part of the course
Text Mining with Bag-of-Words in R
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Alter the function code to match the instructions
clean_corpus <- function(corpus) {
# Remove punctuation
corpus <- tm_map(corpus, ___)
# Transform to lower case
corpus <- tm_map(corpus, ___)
# Add more stopwords
corpus <- tm_map(corpus, removeWords, words = c(stopwords("en"), "coffee", ___))
# Strip whitespace
___
return(corpus)
}