Cleaning and accessing data
1. Cleaning and accessing data
For this next lesson, we will focus on cleaning categorical columns and accessing other data by filtering categorical data.2. Possible issues with categorical data
Data is messy, especially when you are working with strings or categories. Let's focus on a few of the main issues that may arise. First, categories may be inconsistent, and although you may recognize similar values as the same category, Python does not. Capitalization and white spaces are common culprits here and these issues may occur when appending different data sources or columns. Second, spelling issues can cause big problems. This occurs frequently in surveys or online forms when the field is left to the user to fill out. And finally, if we do make corrections, we need to make sure our column dtype remains category, and is not changed to an object.3. Identifying issues
The easiest way to identify issues in our categorical columns is to use either the cat dot categories method to view the categories, or the value counts method to see the counts of each category. Let's use the value counts method on the gets-along-with-cats column of the adoptable dogs dataset. We notice 3 issues. Varying capitalization, leading white spaces, and misspellings.4. Fixing issues: whitespace
Fortunately, we have the same resources available for fixing categorical values that we do for fixing strings. We can remove whitespace by accessing the string value of the column using str, and then using the strip method to remove leading and trailing whitespace. Checkout the one category at the bottom that no longer has a leading whitespace.5. Fixing issues: capitalization
We can fix capitalization issues with str again, but this time using either the title, upper, or lower methods, depending on what type of result we want. Here we have used title, and all of the responses have been switched to title case.6. Fixing issues: misspelled words
Finally, to fix a typo, we can use the same mapping methods we learned in the renaming categories lesson. First we make a mapping, and then we use the replace method to replace the values. This leaves us with only two final categories.7. Checking the data type
If you do use one of these methods, your column will be converted to an Object data type. Remember, to check the dtype we need to use the dtype property. As always, use the astype method to convert the Series back to categorical.8. Using the str accessor object
We have already seen that we can update categories using the str accessor object, but we can also filter data using str as well. One way to use str is to look for categories that contain a specific string, such as Shepherd. We can use this filter to see all of the dogs that have some sort of Shepherd in their breed name. We are setting the regex parameter to false in this example so that we use string matching and not a regular expression.9. Accessing data with loc
One of the great things about using columns that are categories is that the data access methods of loc and iloc work like normal. We can access the size of the dogs that get along with cats by using the loc method, specifying that dogs get along with cats, and selecting the size column. Let's add the value counts method at the end of accessing this data. Note that the value counts method does not automatically use the categorical order when printing results. We can use the sort parameter here so that the output will be ordered by the order of the category.10. Clean and access practice
Let's work through a few examples of the concepts covered in this lesson.Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.