1. Setting category variables
To get the most out of using the pandas categorical dtype, we need to understand how to set, add, and remove categories.
2. New dataset: adoptable dogs
Before we begin, let's checkout another interesting dataset. The adoptable dogs dataset contains information on 2937 adoptable dogs and contains a lot of great categorical columns for us to explore.
3. A dog's coat
Let's start by converting the coat variable to a category using the astype method, and then check the frequency distribution using the value counts method. We are setting the dropna parameter to false to check for any missing entries.
We see that a short coat is the most common, while a long coat is the least common.
4. The .cat accessor object
We are going to use the dot-cat accessor object a lot in this chapter. This object let's us access and manipulate the categories of a categorical Series.
Most of the methods we will introduce use the following parameters: new-categories - which is a list of new categories for the Series, inplace - which is a Boolean value for whether or not the method should overwrite the current Series, and ordered - which is a Boolean for whether or not the new Series should be treated as an ordered categorical or not.
Our first example of using this object and these parameters will be setting new categories.
5. Setting Series categories
cat-dot-set categories is used to set specific categories for a Series. Any values not listed in the new-categories list will be dropped.
Checking the value counts of this Series again, we see that the wirehaired responses have been set to NaN. This happens because the wirehaired category is not listed in the new-categories parameter and is no longer recognized.
6. Setting order
We can set the order of the categories using the ordered parameter. Checking the head of the pandas Series shows us that the Series now knows the categories have a specific order.
7. Missing categories
In the likes-people column, there are 938 rows without a response. Maybe the dog shelter did not check, or maybe they checked and could not tell. Let's add a couple of categories to clean this up.
8. Adding categories
We can add two categories using the cat-dot-add-categories method.
Here we have added two categories, to help clarify what a missing value actually means. Notice that categories not listed in the new-categories parameter are not replaced with NaN values this time and are simply left alone. We can check the final categories using cat-dot-categories on our pandas Series. Awesome - both categories were added and can now be used in this Series.
9. New categories
Although we added categories, this doesn't mean any rows of our data were set to these categories. Checking the value counts one more time verifies this. We will learn how to update values in a different lesson.
10. Removing categories
We can also remove categories using the cat-dot-remove categories method. This method takes a list of categories to remove using the removals parameter. In this example, we remove the wirehaired category altogether. This also means that all wirehaired values will be set to NaN values.
11. Methods recap
Let’s recap the methods covered in this lesson. We first learned how to set categories using the set-categories method, which drops values that are not specified. Add-categories can be used to add new categories, and categories not specified are left alone. Finally, remove-categories can be used to set matching values to NaN.
12. Practice updating categories
Let's work through a few examples of setting, adding, and removing categories.