1. Collapsing factor levels
When working with qualitative variables, sometimes you’ll run into the problem that there are simply too many categories. When you have tons of categories, they clutter up your graph and tables and make interpreting the data much harder. Sometimes there are so many you can’t even graph it.
2. FiveThirtyEight heights
There are two main solutions to this problem. The first is that you can decide which levels to collapse together. For example, let’s take a look at 538's survey on flying etiquette.
One of the questions they asked is on people’s heights. Let’s make a bar chart of the responses. This is helpful, but not ideal. There isn’t a lot of data for some of these heights - some have fewer than 20 people! And we might guess that the difference between someone under five feet and someone five feet one inch may not be great.
3. fct_collapse()
In this case, we can decide to collapse the tallest people and the shortest people into groups. We can see the survey has already done this for us - we originally had "under 5 feet" and "above 6 feet 6 inches" as categories. We can use fct_collapse() to change that to "under 5 feet 3 inches" and "over 6 feet 1 inch." fct_collapse() allows you to make a new level that is a combination of other levels.
4. fct_other(): keep
The second way to reduce the number of your categories is to set some of the categories to be called “other”. You can do this one of two ways: you can specify which categories should be “other” or you can choose based on how common that category is.
If you know what categories you want to be renamed to “other”, you can use the forcats function fct_other(). The first argument is the factor variable and the second argument is either the levels you want to keep or the levels you want to drop. Let's take the height variable again and see what happens if we create a new variable, new_height, as the result of setting the fct_other() argument, keep, equal to 6'4" and 5'1".
We see everything except NA and those levels become "Other"!
5. fct_other(): drop
If we want to specify what categories to change into "Other," rather than which ones to keep, we can use "drop." Let's try dropping five levels: from under five feet to five feet three inches.
Indeed, we can see those levels have disappeared since we have collapsed them into "Other"!
6. fct_lump_prop()
On the other hand, sometimes you’ll want to use "Other" to group the least common levels together. In this case, you can use the one of the fct_lump() functions. In addition to the variable you want to modify, you can use either fct_lump_n(), to preserve n number of levels, or fct_lump_prop(), to preserve the levels that appear at least prop percent of the time. For example, if we wanted to keep only the heights that at least 8% of the survey respondents have and lump the remaining categories into “Other”, you would run fct_lump_prop() with height and the second argument equal to point-08.
7. fct_lump_n()
If you wanted to keep the top three categories instead and have everything else be “Other”, you would run fct_lump_n() height, with n equal to 3. Now we see there are only the top three levels, NA, and other.
8. Let's practice!
Now it's your turn.