Categorical pitfalls

1. Categorical pitfalls

Hello again and welcome to the last chapter on working with categorical data in python. Our first lesson will be on common pitfalls when using categorical data.

2. Used cars: the final dataset

We briefly discussed memory savings in chapter one, but let's revisit it here. To do that, we need to introduce our last dataset. The used cars dataset contains information on over 38,000 used cars including the manufacturer, model, and sale price. This dataset is commonly used to practice building predictive models.

3. Huge memory savings

In chapter 1, we discussed how using a categorical Series can save a lot of memory. This isn't always the case. Consider the manufacturer name column. It has 55 unique entries and is currently stored as an object. If we compare the number of bytes used, using the nbytes attribute, of this column as an object and compare it to this column as a category, we reduce memory usage by almost 90 percent.

4. Little memory savings

However, if we convert a numerical column, or even an object column with lots of different unique values, we will see less memory savings. Consider the odometer value Series. It has over 6,000 unique values. If we convert this to categorical, we still save memory, but this time it's only a 60% memory reduction. This is because the number of bytes needed for a categorical column is proportional to the number of categories.

5. Using categories can be frustrating

If you are not carefully thinking through each change you make to your data, you'll likely run into issues. Consider these potentially frustrating challenges: First - the dot-str accessor object and the pandas apply method will always convert the Series back to an object, forcing you to convert it back to a category. Second, the common methods for updating and setting categories discussed in chapter two do not all handle missing categories the same way. And finally, a categorical Series is not a NumPy array. Using NumPy functions on categorical Series usually produces errors. Let's quickly look at how to handle each problem.

6. Check and convert

For the last time in this course, we will check our output and convert it back to a category if necessary! If you make changes to a Series using dot-str or dot-apply, you must convert the Series back to categorical. Always check your columns dtype using dot-dtype, and convert it if necessary using dot-astype and specifying category.

7. Look for missing values

Anytime you are updating categories, whether that is setting, adding, or removing, use value-counts to make sure the changes you made worked as intended. For the color Series, we have set the categories to only black, silver, and blue. Using the value counts method with dropna set to False, we see that over 18,000 entries have become NaN values. If this was not intended, we may need to use a different method for updating the categories.

8. Using NumPy arrays

Although a categorical series is not a NumPy array, it doesn't mean we can't turn it into an array. Humor me while we let the number of photos of a used car be a categorical column. Using NumPys sum function will give a type error, as NumPy doesn't understand the categorical dtype. However, we can quickly convert the series to an integer and us the sum method. This is common when using a categorical column that is an integer, such as the number of stars for a hotel. Note that the dot-str accessor object transforms the Series to an object dtype, which can use NumPy array methods such as contains.

9. Pitfall practice

Let's work through some examples.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.