1. Membership constraints
Fantastic work on Chapter 1! You're now equipped to treat more complex, and specific data cleaning problems.
2. In this chapter
In this chapter, we're going to take a look at common data problems with text and categorical data, so let's get started.
3. Categories and membership constraints
In this lesson, we'll focus on categorical variables. As discussed early in chapter 1, categorical data represent variables that represent predefined finite set of categories.
Examples of this range from marriage status, household income categories, loan status and others.
To run machine learning models on categorical data, they are often coded as numbers.
Since categorical data represent a predefined set of categories, they can't have values that go beyond these predefined categories.
4. Why could we have these problems?
We can have inconsistencies in our categorical data for a variety of reasons. This could be due to data entry issues with free text vs dropdown fields, data parsing errors and other types of errors.
5. How do we treat these problems?
There's a variety of ways we can treat these, with increasingly specific solutions for different types of inconsistencies.
Most simply, we can drop the rows with incorrect categories. We can attempt remapping incorrect categories to correct ones, and more.
We'll see a variety of ways of dealing with this throughout the chapter and the course, but for now we'll just focus on dropping data.
6. An example
Let's first look at an example. Here's a DataFrame named study_data containing a list of first names, birth dates, and blood types.
Additionally, a DataFrame named categories, containing the correct possible categories for the blood type column has been created as well.
7. An example
Notice the inconsistency here? There's definitely no blood type named Z+. Luckily, the categories DataFrame will help us systematically spot all rows with these inconsistencies.
It's always good practice to keep a log of all possible values of your categorical data, as it will make dealing with these types of inconsistencies way easier.
8. A note on joins
Now before moving on to dealing with these inconsistent values, let's have a brief reminder on joins. The two main types of joins we care about here are anti joins and inner joins.
We join DataFrames on common columns between them.
Anti joins, take in two DataFrames A and B, and return data from one DataFrame that is not contained in another.
In this example, we are performing a left anti join of A and B, and are returning the columns of DataFrames A and B for values only found in A of the common column between them being joined on.
Inner joins, return only the data that is contained in both DataFrames. For example, an inner join of A and B, would return columns from both DataFrames for values only found in A and B, of the common column between them being joined on.
9. A left anti join on blood types
In our example, an left anti join essentially returns all the data in study data with inconsistent blood types,
10. An inner join on blood types
and an inner join returns all the rows containing consistent blood types signs.
11. Finding inconsistent categories
Now let's see how to do that in Python.
We first get all inconsistent categories in the blood_type column of the study_data DataFrame. We do that by creating a set out of the blood_type column which stores its unique values, and use the difference method which takes in as argument the blood_type column from the categories DataFrame. This returns all the categories in blood_type that are not in categories.
We then find the inconsistent rows by finding all the rows of the blood_type columns that are equal to inconsistent categories by using the isin method, this returns a series of boolean values that are True for inconsistent rows and False for consistent ones.
We then subset the study_data DataFrame based on these boolean values, and voila we have our inconsistent data.
12. Dropping inconsistent categories
To drop inconsistent rows and keep ones that are only consistent. We just use the tilde symbol while subsetting which returns everything except inconsistent rows.
13. Let's practice!
Now that we know about treating categorical data, let's practice!