Aan de slagGa gratis aan de slag

Identify profiles

We are still working on an exploration of our dataset of tweets. These elements are contained in a nested list of 5055 sublists, which we are exploring with purrr.

In this exercise, we will answer a question about users behavior: how many users have only retweeted, without ever publishing any "original content"? A general rule of thumb on twitter is that roughly 80% of people only retweet, while 20% publish content, following the Pareto's law. We will verify this in this exercise.

To do so, we'll need to split our dataset in two, and then count how many users there are in total, and how many users are only in the "retweet only" group.

purrr has been loaded for you, and the rstudioconf list is still available in your workspace.

Deze oefening maakt deel uit van de cursus

Intermediate Functional Programming with purrr

Cursus bekijken

Oefeninstructies

  • Create a sublist of retweets, extract the user_id element, and remove the duplicate with unique()

  • Create a sublist of original tweets, extract the user_id element, and remove the duplicate with unique().

  • Combine union() (from base R) and length() to know the total number of users.

  • Use the setdiff() function (from base R) to get the users that are only in the retweet sublist.

Praktische interactieve oefening

Probeer deze oefening eens door deze voorbeeldcode in te vullen.

# Keep the RT, extract the user_id, remove the duplicate
rt <- ___(___, "is_retweet") %>%
  ___("user_id") %>% 
  ___()

# Remove the RT, extract the user id, remove the duplicate
non_rt <- ___(rstudioconf, "is_retweet") %>%
  ___("user_id") %>% 
  ___()

# Determine the total number of users
___(rt, non_rt) %>% ___()

# Determine the number of users who has just retweeted
___(rt, non_rt) %>% ___()
Code bewerken en uitvoeren