Get startedGet started for free

Categorical data problems

1. Categorical data problems

Now that we've discussed membership constraints, we'll take a deeper dive into categorical data and discuss other ways to address those pesky values that don't belong besides removing them.

2. Categorical data problems

There are two specific types of dirty categorical data that we'll discuss. The first is when there is inconsistency within a category. For example, "Lizard" with a capital "L" and "lizard" with a lowercase "L" should be mapped to the same category even though their capitalization differs. The other type is when there are too many categories. If we only have one data point with "Pug", one data point with "Lab", and one data point with "Boxer", those categories might not be so useful on their own. It will be easier to work with our data if we collapse them all into one category, "Dog".

3. Example: animal classification

Let's start out with an example. We have a data frame called animals which contains different characteristics of animals and their type, such as mammal, fish, or bird.

4. Checking categories

We can explore all the different type categories in the data using count. At a glance, we can see if there are any categories that need correction. Right off the bat, we can see that there are multiple categories for mammal, including " mammal " with extra spaces before and after the word, "MAMMAL" in all caps,and "Mammal " with one capital letter and an extra space.

5. Case inconsistency

To fix the case inconsistency between the different mammal categories, we can use the str_to_lower function from the stringr package. Looking at row 3, we can see that the mammal with a capital M has been converted to all lowercase.

6. Case inconsistency

If we now count the type_lower category, we can see that there are nine categories instead of ten since the all caps "MAMMAL" was converted to lowercase "mammal".

7. Case inconsistency

We could also convert everything to uppercase using the str_to_upper function.

8. Whitespace inconsistency

To address the inconsistency in the white space around the word "mammal", we can use the stringr function str_trim. This will remove any white space from the beginning of the string and the end of the string, but not the middle of the string. Now the category in row 3 matches the category in row 1.

9. Whitespace inconsistency

If we now look at the different categories, there's only one mammal category, so all of the mammals in the dataset have been mapped to the same category.

10. Too many categories

Take a look at the categories we have now. Notice that most of our dataset is comprised of mammals and birds. There are only two amphibians, two fish, one bug, one invertebrate, and one reptile. That's 5 extra categories for only 7 data points, and summary statistics for these groups won't be very useful since they only contain one or two observations.

11. Collapsing categories

We can solve this problem by collapsing these categories into a new, broader category called "other". First, we'll create a vector called other_categories that stores the categories we want to collapse together. Next, we'll load the forcats package and add a new column to the animals data frame called type_collapsed. To create this column, we'll use the fct_collapse function. We pass type_trimmed to fct_collapse since this is the factor we want to base our new column on, and then we use other equals other_categories. This will tell the function that all the categories contained in the other_categories vector should be renamed to "other". Now, row 4 has the type "other" instead of "fish".

12. Collapsing categories

If we count the categories in our data frame now, there are only three! This makes it easier to compare all other animals to birds and mammals.

13. Let's practice!

Time to solve some categorical data problems of your own!