Processing twitter text

1. Processing twitter text

So far, we analyzed the metadata around the tweets. Now it is time to process and analyze the tweet text itself to provide direct information about user opinion.

2. Lesson overview

In this lesson, we will understand why tweet text must be processed before it is analyzed. We will also learn the steps involved in processing text including removing redundant information, converting text into a corpus, and removing common words called stop words.

3. Why process tweet text?

Why should we process tweet text? Tweet text is unstructured, noisy, and raw and contains emoticons, URLs, and numbers that have to be removed before it is analyzed for reliable results.

4. Steps in text processing

Let's look at the steps involved in text processing. First, redundant information is removed. This includes URLs, special characters, punctuation, and numbers.

5. Steps in text processing

The text is then converted into a corpus. A corpus is a list of text documents and is the starting point for various text processing functions.

6. Steps in text processing

Next, all letters in the text are converted to lower case.

7. Steps in text processing

Finally, common words called stop words are removed from the corpus.

8. Extract tweet text

First, let's extract 1000 tweets on "Obesity", excluding all retweets. The text column containing the tweet text is saved as a data frame.

9. Extract tweet text

Let's view the first few rows of tweets on "Obesity".

10. Removing URLs

To remove URLs from the text, we will use the rm_twitter_url() function from the qdapRegex library. This function takes the tweet text data frame as input.

11. Removing URLs

The URLs are now removed from the tweets.

12. Special characters, punctuation & numbers

The next step is to remove special characters, punctuation, and numbers using the gsub() function. gsub() takes 3 arguments: the pattern to search, the character to replace with, and the text source.

13. Special characters, punctuation & numbers

In the output, we can see that all content other than letters has been replaced with spaces.

14. Convert to text corpus

Now, we will convert the text to a corpus using the tm library. First, we input the tweet_text as an argument to the VectorSource() function. This function converts the tweet text to a vector of texts. A vector is a sequence of elements of the same data type. This vector is next converted to a corpus using the Corpus() function. Let's view the third element from the column content in the corpus.

15. Convert to lowercase

When we analyze text, we want to ensure that a word is not counted as two different words because the case is different in the two instances. Hence, we change all the words to lowercase using tm_map(). This function takes the tweet corpus and tolower() as arguments. We see here that all characters are now in lowercase.

16. What are stop words?

Stop words are commonly used words like a, an, and but. stopwords() from the tm library has a list of default English stop words. Some examples of stop words are shown here.

17. Remove stop words

The stop words need to be removed to focus on the important words in the corpus. We will remove the default English stop words from our corpus using the removeWords() function within tm_map(). tm_map() takes 3 arguments: the corpus, removeWords(), and stopwords() with "english" as its value. The default stop words are now removed in the corpus.

18. Remove additional spaces

In the final step, the additional spaces created in the previous processing steps need to be removed to create a clean corpus. The additional spaces are removed using tm_map() which takes two arguments: the corpus and the function stripWhitespace() which collapses multiple spaces to a single space. The processed tweet corpus is now ready to be used for further analysis.

19. Let's practice!

We learned how to process twitter text and create a clean corpus. Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.

Analyzing Social Media Data in R

IntermediateSkill Level

4.9+

25 reviews