Common cleaning functions from tm
Now that you know two ways to make a corpus, you can focus on cleaning, or preprocessing, the text. First, you'll clean a small piece of text; then, you will move on to larger corpora.
In bag of words text mining, cleaning helps aggregate terms. For example, it might make sense for the words "miner", "mining," and "mine" to be considered one term. Specific preprocessing steps will vary based on the project. For example, the words used in tweets are vastly different than those used in legal documents, so the cleaning process can also be quite different.
Common preprocessing functions include:
tolower()
: Make all characters lowercaseremovePunctuation()
: Remove all punctuation marksremoveNumbers()
: Remove numbersstripWhitespace()
: Remove excess whitespace
tolower()
is part of base R, while the other three functions come from the tm
package. Going forward, we'll load tm
and qdap
for you when they are needed. Every time we introduce a new package, we'll have you load it the first time.
The variable text
, containing a sentence, is shown in the script.
Cet exercice fait partie du cours
Text Mining with Bag-of-Words in R
Instructions
Apply each of the following functions to text
, simply printing results to the console:
- `tolower()`
- `removePunctuation()`
- `removeNumbers()`
- `stripWhitespace()`
Exercice interactif pratique
Essayez cet exercice en complétant cet exemple de code.
# Create the object: text
text <- "She woke up at 6 A.M. It\'s so early! She was only 10% awake and began drinking coffee in front of her computer."
# Make lowercase
___
# Remove punctuation
____
# Remove numbers
___
# Remove whitespace
___