Get startedGet started for free

Processing Twitter text

1. Processing Twitter text

In the first chapter, we collected Twitter data. In this chapter, we're going to analyze the text of the tweets themselves.

2. Text in Twitter JSON

In the prior lesson , we saw that there are many parts to the Twitter JSON. Arguably, the most important part of the tweet is the text. Recall that the primary location of the text in the Twitter JSON can be accessed via the `text` key. However, that's not the only place where we can extract meaningful text-based elements.

3. More than 140 characters

Twitter has made it possible to use more than 140 characters. While this is great for self-expression and fun emoji games, to accommodate for this, Twitter added a child JSON to the Twitter JSON object. For tweets which are over 140 characters, we can access the text through the `extended_tweet` child JSON object and the `full_text` field.

4. Retweets and quoted tweets

Also recall, to get those objects in retweets and quoted tweets, we have to access the `retweeted_status` or `quoted_status` child JSON objects. Once we parse the JSON, we can access these elements by chaining together dictionaries.

5. Textual user information

Parts of the user JSON object may have informative textual elements as well. We can gain insight about a user's communities, partisan behavior, or geographical location through their user profile. To extract these, we access the `description` and `location` fields in the user child JSON.

6. Flattening Twitter JSON

So far we've only deal with individual tweets. To analyze tweets at scale, it's helpful to put everything into a pandas data frame. This allows us to apply certain analysis methods across all rows and multiple columns. However, with the multiple JSON children, we can't access values in those children easily in columns. To do this, we'll flatten the JSON by storing the children values in top-level keys. For convenience, we'll separate the original keys with a dash. This is how we would flatten a single 280 character tweet, for instance.

7. Flattening Twitter JSON

To put these all into a DataFrame, we will have to loop through all of the tweets individually. We'll open a JSON file full of tweets -- in this case `all_tweets.json`. We'll then split the JSON by the newline character. Then, for each tweet, we'll parse the JSON with `json.loads`, which converts JSON to Python. We'll then check if the tweet contains a field of interest, such as a 280 character tweet in `extended_tweet`. If it does, we'll create a top-level field for it in the tweet dictionary. Here, we are creating a new field, `extended_tweet-full_text`. We'll then add this tweet object to a list of tweets. This is now a list of dictionaries which represent tweets. Because this is a list of dictionaries, we can pass it as an argument to the pandas data frame constructor, which will convert this JSON file full of tweets to a data frame.

8. Let's practice!

In the exercises, we're going to write a function which flattens multiple fields which we can use for the rest of this course.