TM refresher (I)
In the Text Mining: Bag of Words course you learned that a corpus is a set of texts, and you studied some functions for preprocessing the text. To recap, one way to create & clean a corpus is with the functions below. Even though this is a different course, sentiment analysis is part of text mining so a refresher can be helpful.
- Turn a character vector into a text source using
VectorSource(). - Turn a text source into a corpus using
VCorpus(). - Remove unwanted characters from the corpus using cleaning functions like
removePunctuation()andstripWhitespace()fromtm, andreplace_abbreviation()fromqdap.
In this exercise a custom clean_corpus() function has been created using standard preprocessing functions for easier application.
clean_corpus() accepts the output of VCorpus() and applies cleaning functions. For example:
processed_corpus <- clean_corpus(my_corpus)
Este exercício faz parte do curso
Sentiment Analysis in R
Instruções do exercício
Your R session has a text vector, tm_define, containing two small documents and the function clean_corpus().
- Create an object called
tm_vectorby applyingVectorSource()totm_define. - Make
tm_corpususingVCorpus()ontm_vector. - Use
content()to examine the contents of the first document intm_corpus.- Documents in the corpus are accessed using list syntax, so use double square brackets, e.g.
[[1]].
- Documents in the corpus are accessed using list syntax, so use double square brackets, e.g.
- Clean the corpus text using the custom function
clean_corpus()ontm_corpus. Call this new objecttm_clean. - Examine the first document of the new
tm_cleanobject again to see how the text changed afterclean_corpus()was applied.
Exercício interativo prático
Experimente este exercício completando este código de exemplo.
# clean_corpus(), tm_define are pre-defined
clean_corpus
tm_define
# Create a VectorSource
tm_vector <- ___
# Apply VCorpus
tm_corpus <- ___
# Examine the first document's contents
___(___[[___]])
# Clean the text
tm_clean <- ___
# Reexamine the contents of the first doc
___