1. Data type constraints
Hi and welcome! My name is Adel, and I'll be your host as we learn how to clean data in Python.
2. Course outline
In this course, we're going to understand how to diagnose different problems in our data and how they can can come up during our workflow.
3. Course outline
We will also understand the side effects of not treating our data correctly.
4. Course outline
and various ways to address different types of dirty data.
5. Course outline
In this chapter, we're going to discuss the most common data problems you may encounter and how to address them. So let's get started!
6. Why do we need to clean data?
To understand why we need to clean data, let's remind ourselves of the data science workflow.
In a typical data science workflow, we usually access our raw data, explore and process it, develop insights using visualizations or predictive models, and finally report these insights with dashboards or reports.
7. Why do we need to clean data?
Dirty data can appear because of duplicate values, mis-spellings, data type parsing errors and legacy systems.
8. Why do we need to clean data?
Without making sure that data is properly cleaned in the exploration and processing phase, we will surely compromise the insights and reports subsequently generated.
As the old adage says, garbage in garbage out.
9. Data type constraints
When working with data, there are various types that we may encounter along the way. We could be working with text data, integers, decimals, dates, zip codes, and others.
Luckily, Python has specific data type objects for various data types that you're probably familiar with by now. This makes it much easier to manipulate these various data types in Python.
As such, before preparing to analyze and extract insights from our data, we need to make sure our variables have the correct data types, other wise we risk compromising our analysis.
10. Strings to integers
Let's take a look at the following example. Here's the head of a DataFrame containing revenue generated and quantity of items sold for a sales order. We want to calculate the total revenue generated by all sales orders. As you can see, the Revenue column has the dollar sign on the right hand side.
A close inspection of the DataFrame column's data types using the dot-dtypes attribute returns object for the Revenue column, which is what pandas uses to store strings.
11. String to integers
We can also check the data types as well as the number of missing values per column in a DataFrame, by using the dot-info() method.
12. String to integers
Since the Revenue column is a string, summing across all sales orders returns one large concatenated string containing each row's string.
To fix this, we need to first remove the $ sign from the string so that pandas is able to convert the strings into numbers without error.
We do this with the dot-str-dot-strip() method, while specifying the string we want to strip as an argument, which is in this case the dollar sign.
Since our dollar values do not contain decimals, we then convert the Revenue column to an integer by using the dot-astype() method, specifying the desired data type as argument.
Had our revenue values been decimal, we would have converted the Revenue column to float.
We can make sure that the Revenue column is now an integer by using the assert statement, which takes in a condition as input, as returns nothing if that condition is met, and an error if it is not.
13. The assert statement
For example, here we are testing the equality that 1+1 equals 2. Since it is the case, the assert statement returns nothing.
However, when testing the equality 1+1 equals 3, we receive an assertionerror.
You can test almost anything you can imagine of by using assert, and we'll see more ways to utilize it as we go along the course.
14. Numeric or categorical?
A common type of data seems numeric but actually represents categories with a finite set of possible categories. This is called categorical data. We will look more closely at categorical data in Chapter 2, but let's take a look at this example.
Here we have a marriage status column, which is represented by 0 for never married, 1 for married, 2 for separated, and 3 for divorced.
However it will be imported of type integer, which could lead to misleading results when trying to extract some statistical summaries.
15. Numeric or categorical?
We can solve this by using the same dot-astype() method seen earlier, but this time specifying the category data type.
When applying the describe again, we see that the summary statistics are much more aligned with that of a categorical variable, discussing the number of observations, number of unique values, most frequent category instead of mean and standard deviation.
16. Let's practice!
Now that we have a solid understanding of data type constrains - let's get to practice!