TM refresher (I)
In the Text Mining: Bag of Words course you learned that a corpus is a set of texts, and you studied some functions for preprocessing the text. To recap, one way to create & clean a corpus is with the functions below. Even though this is a different course, sentiment analysis is part of text mining so a refresher can be helpful.
- Turn a character vector into a text source using
VectorSource()
. - Turn a text source into a corpus using
VCorpus()
. - Remove unwanted characters from the corpus using cleaning functions like
removePunctuation()
andstripWhitespace()
fromtm
, andreplace_abbreviation()
fromqdap
.
In this exercise a custom clean_corpus()
function has been created using standard preprocessing functions for easier application.
clean_corpus()
accepts the output of VCorpus()
and applies cleaning functions. For example:
processed_corpus <- clean_corpus(my_corpus)
This exercise is part of the course
Sentiment Analysis in R
Exercise instructions
Your R session has a text vector, tm_define
, containing two small documents and the function clean_corpus()
.
- Create an object called
tm_vector
by applyingVectorSource()
totm_define
. - Make
tm_corpus
usingVCorpus()
ontm_vector
. - Use
content()
to examine the contents of the first document intm_corpus
.- Documents in the corpus are accessed using list syntax, so use double square brackets, e.g.
[[1]]
.
- Documents in the corpus are accessed using list syntax, so use double square brackets, e.g.
- Clean the corpus text using the custom function
clean_corpus()
ontm_corpus
. Call this new objecttm_clean
. - Examine the first document of the new
tm_clean
object again to see how the text changed afterclean_corpus()
was applied.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# clean_corpus(), tm_define are pre-defined
clean_corpus
tm_define
# Create a VectorSource
tm_vector <- ___
# Apply VCorpus
tm_corpus <- ___
# Examine the first document's contents
___(___[[___]])
# Clean the text
tm_clean <- ___
# Reexamine the contents of the first doc
___