Get startedGet started for free

Counting words

1. Counting words

Great job! Now that we're able to analyze tweets at scale, it's time to begin processing tweet text. One of the most basic analyses we can perform is counting the number of times a word or a phrase has appeared in text, and how many times it has appeared compared to other keywords in the text.

2. Why count words?

Why should we count words? Counting words is the most basic step we can take in automating the analysis of text. It allows us to convert words into numbers. In practice, counting words can tell us how many times a company, product, or hashtag is mentioned. In the exercises, we'll look at how the R and Python hashtags compare to each other. While this won't settle the question of which programming language is better, it can give us an idea about whether there's more talk about one hashtag compared to the other.

3. Counting with str.contains

To count the frequency of a particular keyword, we'll use `str-dot-contains`. This is a string method for the pandas Series object which tells us whether or not a row contains the keyword in question. In other words, `str-dot-contains` will return a boolean `Series` object, that is, it will contain only True/False values. It also takes the keyword argument case. Setting `case equal to False` will make it case insensitive.

4. Companies dataset

We'll first look in the `text` column of our Twitter data frame. Say we have a dataset of tweets mentioning three companies: Apple, Facebook, and Google. We want to know what proportion of tweets mention Apple. First, we'll flatten the tweets and load them into a data frame. Then, we'll use `str-dot-contains` on the `text` column. We'll then use `numpy-dot-sum` to add up all the True values, since they are numerically equal to one. Lastly, we'll divide that by the number of total items to get the proportion of tweets which mention 'apple'. Here, Apple is mentioned 11-point-2 percent of the time.

5. Counting in multiple text fields

However, a single tweet contains multiple places where relevant keywords can appear. Remember that a tweet may contain a retweet, a quoted tweet, and text over 140 characters. Furthermore, we also may want to search in user locations and user descriptions. You can search these fields separately, or you can loop through these fields and use the logical `or` operator to connect them together. Recall that a logical `or` will evaluate to True if at least one of the values is true. In this example, we first evaluate the 'text' field. We then loop through the extended tweet and retweet text fields to find instances of the keyword. We `or` them with each other using the pipe operator. This searches the text and extended text of both the original tweet and the retweet. Lastly, we print out the proportion of tweets which contain the keyword. We see that the proportion has gone up from 11-point-2 percent to 12-point-8 percent.

6. Let's practice!

Now it's your turn to count keywords in some Twitter data.