LoslegenKostenlos loslegen

Lumping variables by proportion

Many times, you won't have specific levels you want to change to other or collapse together. Rather, you want to keep the most common levels and put everything else into "other." Especially when there are many levels and most of them rare, this is helpful for displaying your data. Let's try this out using the question from the Kaggle survey about which machine learning methods people wanted to try next year. multiple_choice_responses has been loaded for you. When you're counting, remember that sort = TRUE corresponds to descending order by default.

Diese Übung ist Teil des Kurses

Categorical Data in the Tidyverse

Kurs anzeigen

Anleitung zur Übung

  • Remove people who didn't select a method.
  • Create a new variable, ml_method, from MLMethodNextYearSelect that preserves titles that at least 5% of respondents have and lump the rest as "Other" (the default value).
  • Finally, count your new variable, sorted in descending order.

Interaktive Übung

Versuche dich an dieser Übung, indem du diesen Beispielcode vervollständigst.

multiple_choice_responses %>%
  # Remove NAs of MLMethodNextYearSelect
  filter(___) %>%
  # Create ml_method, which lumps all those with less than 5% of people into "Other"
  mutate(ml_method = ___(MLMethodNextYearSelect, ___)) %>%
  # Count the frequency of your new variable, sorted in descending order
  ___(___, ___)
Code bearbeiten und ausführen