Get startedGet started for free

Identify profiles

We are still working on an exploration of our dataset of tweets. These elements are contained in a nested list of 5055 sublists, which we are exploring with purrr.

In this exercise, we will answer a question about users behavior: how many users have only retweeted, without ever publishing any "original content"? A general rule of thumb on twitter is that roughly 80% of people only retweet, while 20% publish content, following the Pareto's law. We will verify this in this exercise.

To do so, we'll need to split our dataset in two, and then count how many users there are in total, and how many users are only in the "retweet only" group.

purrr has been loaded for you, and the rstudioconf list is still available in your workspace.

This exercise is part of the course

Intermediate Functional Programming with purrr

View Course

Exercise instructions

  • Create a sublist of retweets, extract the user_id element, and remove the duplicate with unique()

  • Create a sublist of original tweets, extract the user_id element, and remove the duplicate with unique().

  • Combine union() (from base R) and length() to know the total number of users.

  • Use the setdiff() function (from base R) to get the users that are only in the retweet sublist.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Keep the RT, extract the user_id, remove the duplicate
rt <- ___(___, "is_retweet") %>%
  ___("user_id") %>% 
  ___()

# Remove the RT, extract the user id, remove the duplicate
non_rt <- ___(rstudioconf, "is_retweet") %>%
  ___("user_id") %>% 
  ___()

# Determine the total number of users
___(rt, non_rt) %>% ___()

# Determine the number of users who has just retweeted
___(rt, non_rt) %>% ___()
Edit and Run Code