Get Started

Common cleaning functions from tm

Now that you know two ways to make a corpus, you can focus on cleaning, or preprocessing, the text. First, you'll clean a small piece of text; then, you will move on to larger corpora.

In bag of words text mining, cleaning helps aggregate terms. For example, it might make sense for the words "miner", "mining," and "mine" to be considered one term. Specific preprocessing steps will vary based on the project. For example, the words used in tweets are vastly different than those used in legal documents, so the cleaning process can also be quite different.

Common preprocessing functions include:

  • tolower(): Make all characters lowercase
  • removePunctuation(): Remove all punctuation marks
  • removeNumbers(): Remove numbers
  • stripWhitespace(): Remove excess whitespace

tolower() is part of base R, while the other three functions come from the tm package. Going forward, we'll load tm and qdap for you when they are needed. Every time we introduce a new package, we'll have you load it the first time.

The variable text, containing a sentence, is shown in the script.

This is a part of the course

“Text Mining with Bag-of-Words in R”

View Course

Exercise instructions

Apply each of the following functions to text, simply printing results to the console:

- `tolower()`
- `removePunctuation()`
- `removeNumbers()`
- `stripWhitespace()`

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Create the object: text
text <- "She woke up at       6 A.M. It\'s so early!  She was only 10% awake and began drinking coffee in front of her computer."

# Make lowercase
___

# Remove punctuation
____

# Remove numbers
___

# Remove whitespace
___

This exercise is part of the course

Text Mining with Bag-of-Words in R

IntermediateSkill Level
5.0+
7 reviews

Learn the bag of words technique for text mining with R.

In this chapter, you'll learn the basics of using the bag-of-words method for analyzing text data.

Exercise 1: What is text mining?Exercise 2: Understanding text miningExercise 3: Quick taste of text miningExercise 4: Getting startedExercise 5: Load some textExercise 6: Make the vector a VCorpus object (1)Exercise 7: Make the vector a VCorpus object (2)Exercise 8: Make a VCorpus from a data frameExercise 9: Cleaning and preprocessing textExercise 10: Common cleaning functions from tm
Exercise 11: Cleaning with qdapExercise 12: All about stop wordsExercise 13: Intro to word stemming and stem completionExercise 14: Word stemming and stem completion on a sentenceExercise 15: Apply preprocessing steps to a corpusExercise 16: The TDM & DTMExercise 17: Understanding TDM and DTMExercise 18: Make a document-term matrixExercise 19: Make a term-document matrix

What is DataCamp?

Learn the data skills you need online at your own pace—from non-coding essentials to data science and machine learning.

Start Learning for Free