1. Categorical variables
Awesome work on the last lesson. Now let's discuss other types of problems that could affect categorical variables.
2. What type of errors could we have?
In the last lesson, we saw how categorical data has a value membership constraint, where columns need to have a predefined set of values. However, this is not the only set of problems we may encounter.
When cleaning categorical data, some of the problems we may encounter include value inconsistency, the presence of too many categories that could be collapsed into one, and making sure data is of the right type.
3. Value consistency
Let's start with making sure our categorical data is consistent. A common categorical data problem is having values that slightly differ because of capitalization.
Not treating this could lead to misleading results when we decide to analyze our data, for example, let's assume we're working with a demographics dataset, and we have a marriage status column with inconsistent capitalization.
Here's what counting the number of married people in the marriage_status Series would look like.
Note that the dot-value_counts() methods works on Series only.
4. Value consistency
For a DataFrame, we can groupby the column and use the dot-count() method.
5. Value consistency
To deal with this, we can either capitalize or lowercase the marriage_status column. This can be done with the str-dot-upper() or dot-lower() functions respectively.
6. Value consistency
Another common problem with categorical values are leading or trailing spaces. For example, imagine the same demographics DataFrame containing values with leading spaces. Here's what the counts of married vs unmarried people would look like.
Note that there is a married category with a trailing space on the right, which makes it hard to spot on the output, as opposed to unmarried.
7. Value consistency
To remove leading spaces, we can use the str-dot-strip() method which when given no input, strips all leading and trailing white spaces.
8. Collapsing data into categories
Sometimes, we may want to create categories out of our data, such as creating household income groups from income data.
To create categories out of data, let's use the example of creating an income group column in the demographics DataFrame. We can do this in 2 ways.
The first method utilizes the qcut function from pandas, which automatically divides our data based on its distribution into the number of categories we set in the q argument, we created the category names in the group_names list and fed it to the labels argument, returning the following.
Notice that the first row actually misrepresents the actual income of the income group, as we didn't instruct qcut where our ranges actually lie.
9. Collapsing data into categories
We can do this with the cut function instead, which lets us define category cutoff ranges with the bins argument. It takes in a list of cutoff points for each category, with the final one being infinity represented with np-dot-inf(). From the output, we can see this is much more correct.
10. Collapsing data into categories
Sometimes, we may want to reduce the amount of categories we have in our data. Let's move on to mapping categories to fewer ones.
For example, assume we have a column containing the operating system of different devices, and contains these unique values.
Say we want to collapse these categories into 2, DesktopOS, and MobileOS. We can do this using the replace method. It takes in a dictionary that maps each existing category to the category name you desire. In this case, this is the mapping dictionary. A quick print of the unique values of operating system shows the mapping has been complete.
11. Let's practice!
Now that we know about treating categorical data, let's practice!