1. Filtering tweets
Hi! I am Vivek Vijayaraghavan, a data science coach and consultant in analytics.
The high volume and variety of tweets, posted at high velocity, make it necessary to apply filters for extracting relevant tweets for analysis.
2. Lesson Overview
In this lesson, we will apply a few powerful filters on the tweet components to:
extract original tweets, filter tweets based on language, and filter popular tweets based on a minimum number of retweets and favorites.
3. Filtering for original tweets
Let's look at the first filter to extract original tweets.
What is an original tweet?
It is an original posting by a twitter user and is not a retweet, quote, or reply.
A good percentage of original tweets posted ensures that your content is not repetitive and retains user engagement levels.
4. Filtering for original tweets
The minus filter is used to extract original tweets.
This filter is combined with:
retweets to exclude all retweets, quote to filter out quoted tweets, and
replies to exclude tweet replies.
5. Extract tweets without filters
To understand how the minus filter works, let's first extract tweets on "digital marketing" without applying any filters.
We use the search_tweets() function to extract 100 tweets.
6. Extract tweets without filters
In the tweets data frame, let's focus on the columns reply_to_screen_name, is_quote, and is_retweet and count the number of values for each level.
The count() function from the plyr library takes the column names as input and counts the values.
The presence of counts against screen names under reply_to_screen_name indicates that some tweets are replies.
7. Extract tweets without filters
The presence of values for "TRUE" under both is_quote and
8. Extract tweets without filters
is_retweet confirms the presence of quotes and retweets.
9. Exclude retweets, quotes, and replies
Let's now introduce the minus filters to exclude retweets, quotes, and replies.
Under search_tweets(), we introduce the three minus filters within the search query.
10. Exclude retweets, quotes, and replies
In the output, we see only "NA" values under reply_to_screen_name which confirms that replies have been excluded.
11. Exclude retweets, quotes, and replies
All values for is_quote and is_retweet show "FALSE" indicating that retweets and quotes have also been filtered out, retaining the original tweets only.
12. Filtering tweets on language
Another simple but powerful filter that we can use is the lang filter which filters tweets based on language.
It matches tweets classified as being of a particular language.
The table shows the language code for a few languages.
13. Filtering tweets on language
Let's filter and extract tweets posted in Spanish.
search_tweets() takes two arguments here:
the search query "brand marketing" and the lang argument with the value "es" for language code.
14. Filtering tweets on language
We see that the extracted tweets are in Spanish in the text column.
15. Filtering tweets on language
We can also look at the lang column which shows "es", the language identifier for Spanish.
16. Filter by retweet and favorite counts
We move onto a third filtering option that can be applied to extract tweets that have a minimum number of retweets and favorites.
The min_faves filter looks for tweets that have a minimum count of favorites applied to them.
Similarly, the min_retweets filter finds tweets that have received a minimum count of retweets.
To ensure both conditions are met, we use the AND operator between the two filters.
17. Filter by retweet and favorite counts
To illustrate the filter, let's extract tweets on "bitcoin" and include the arguments min_faves and min_retweets with their values set to 100 for the minimum counts.
18. Filter by retweet and favorite counts
To see the filtering, we extract the columns retweet_count and favorite_count and assign them to a new data frame.
The columns retweet_count and favorite_count store the number of retweets and favorites respectively.
We can see that the values in both these columns are greater than 100 as a result of the filtering.
19. Filter by retweet and favorite counts
You can look at the text column to view the popular tweets which got retweeted and favorited at least 100 times.
20. Let's practice!
Let's practice filtering tweets based on tweet components!