Get Started

All about stop words

Often there are words that are frequent but provide little information. These are called stop words, and you may want to remove them from your analysis. Some common English stop words include "I", "she'll", "the", etc. In the tm package, there are 174 common English stop words (you'll print them in this exercise!)

When you are doing an analysis, you will likely need to add to this list. In our coffee tweet example, all tweets contain "coffee", so it's important to pull out that word in addition to the common stop words. Leaving "coffee" in doesn't add any insight and will cause it to be overemphasized in a frequency analysis.

Using the c() function allows you to add new words to the stop words list. For example, the following would add "word1" and "word2" to the default list of English stop words:

all_stops <- c("word1", "word2", stopwords("en"))

Once you have a list of stop words that makes sense, you will use the removeWords() function on your text. removeWords() takes two arguments: the text object to which it's being applied and the list of words to remove.

This is a part of the course

“Text Mining with Bag-of-Words in R”

View Course

Exercise instructions

  • Review standard stop words by calling stopwords("en").
  • Remove "en" stopwords from text.
  • Add "coffee" and "bean" to the standard stop words, assigning to new_stops.
  • Remove the customized stopwords, new_stops, from text.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

## text is preloaded into your workspace

# List standard English stop words
___

# Print text without standard stop words
removeWords(___, ___("___"))

# Add "coffee" and "bean" to the list: new_stops
new_stops <- c("___", "___", ___)

# Remove stop words from text
___
Edit and Run Code