Text cleaning basics

1. Text cleaning basics

If you have ever heard the phrase "garbage in, garbage out" when creating a model, the same applies with text analysis. We just learned how to tokenize, which can really expose potential garbage in our text. Let's take the next step after tokenization and create better input text, so we get out better analysis.

2. The Russian tweet data set

Before we look at some simple preprocessing steps to clean our data, I'd like to introduce a second dataset we will be exploring. fivethirtyeight recently published a ton of public data. One of these datasets consisted of almost 3 million Russian troll tweets. These are tweets from bots that tweeted during the 2016 US election cycle. We will explore the first 20,000 tweets, as well as use some of the meta data, such as the number of followers, number following, publishing data, and account type to aid in some of our analysis. This is a great dataset for topic modeling, classification tasks, named entity recognition, and others.

3. Top occurring words

You can imagine, tweets probably have a lot of garbage. To show this, look at most common words in the troll tweet dataset. First we tokenize by words, and then we count how often these words occur. The results are not that surprising. t.co is a shorthand for twitter's web address and was probably picked up when these tweets were scraped from the web. https has a similar story, But none of the top four occurring words are helpful. We need to remove them.

4. Remove stop words

Removing stop words with the tidytext package takes just one additional command. tidytext's anti_join function will remove a tibble of words from a column of text. The typical entry in this tibble is the word you want to remove, and the lexicon or source for where that word came from. anti_join will return the original tibble, with all stop words removed from the text column. Note that stop_words is a tibble of common words provided by the tidytext package. Let's look at the results a second time. OK - so t.co, https, and http are still problematic, but we finally have two interesting top words, blacklivesmatter, and trump. We will not get political in this course, but these are still interesting results!

5. Custom stop words

We still need to work on those first common words. We can add to our tibble of stop words, or create our own. Here I am adding three stop words to the stop_words tibble, https, http, and t.co. We can run through the process of removing stop words and counting the word occurrences one last time

6. Final results

We get some interesting results. Within the first 20,000 tweets, these 7 words occurred the most often.

7. Stemming

One additional step I want to cover is called stemming. Stemming is the process of transforming words into their roots. For example, both enlisted and enlisting, would be trimmed to their root, enlist. This is an important step when trying to really understand which words are being used. We will use the wordStem function from the SnowballC package, as it works extremely well with the tidy principles. Consider this example. We want to take our tidy tweets and perform a mutation. This mutation will stem the words using the wordStem function.

8. Stemming Results

Notice here that matter was trimmed to matt, even though it was part of a much larger word, and Cop was the 7th most common word before stemming, but now it has jumped to second.

9. Example time.

Let's look at a few examples of text preprocessing.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.