Identify profiles
We are still working on an exploration of our dataset of tweets. These elements are contained in a nested list of 5055 sublists, which we are exploring with purrr
.
In this exercise, we will answer a question about users behavior: how many users have only retweeted, without ever publishing any "original content"? A general rule of thumb on twitter is that roughly 80% of people only retweet, while 20% publish content, following the Pareto's law. We will verify this in this exercise.
To do so, we'll need to split our dataset in two, and then count how many users there are in total, and how many users are only in the "retweet only" group.
purrr
has been loaded for you, and the rstudioconf
list is still available in your workspace.
This exercise is part of the course
Intermediate Functional Programming with purrr
Exercise instructions
Create a sublist of retweets, extract the
user_id
element, and remove the duplicate withunique()
Create a sublist of original tweets, extract the
user_id
element, and remove the duplicate withunique()
.Combine
union()
(from base R) andlength()
to know the total number of users.Use the
setdiff()
function (from base R) to get the users that are only in the retweet sublist.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Keep the RT, extract the user_id, remove the duplicate
rt <- ___(___, "is_retweet") %>%
___("user_id") %>%
___()
# Remove the RT, extract the user id, remove the duplicate
non_rt <- ___(rstudioconf, "is_retweet") %>%
___("user_id") %>%
___()
# Determine the total number of users
___(rt, non_rt) %>% ___()
# Determine the number of users who has just retweeted
___(rt, non_rt) %>% ___()