1. Introduction to preprocessing
Welcome to course! I'm James, and together, we'll be learning the skills and best practices for preparing data for modeling. Let's jump right in!
2. What is data preprocessing?
Data preprocessing comes after we've explored and cleaned our dataset, so we understand its contents, structure, and quality.
Once we've explored our data, we'll probably have a good idea about how we'd like to model it. Having this idea early-on will also help us decide on how to best preprocess the data so it's ready for modeling. Think of preprocessing as a prerequisite for modeling.
Recall that machine learning models in Python require numerical features, so if our dataset contains categorical features, we'll need to transform them. This is a really common preprocessing step.
3. Why preprocess?
The goal of preprocessing is not only to transform our dataset into a form that suitable for modeling, but also to improve the performance of our models, and in turn, produce more reliable results.
4. Recap: exploring data with pandas
The files we'll be working with in this course should be recognizable, and we can use common pandas functions for importing, such as read_json and read_csv.
One of the first steps after importing data is to inspect it, which we can do with the dot-head method.
5. Recap: exploring data with pandas
It's also useful to know what features are present in the dataset and what their data types are. We can quickly find this information using the dot-info method, which provides other useful information including the number of rows and columns, and also the number of non-missing values in each column.
6. Recap: exploring data with pandas
Finally, we can quickly generate some summary statistics about a DataFrame's features, such as the mean, standard deviation, and quartiles using the dot-describe method.
7. Removing missing data
One of the first steps we can take to preprocess our data is to remove missing data. There's a lot of ways to deal with missing data, but here we're only going to cover ways to remove either columns or rows containing missing data.
The dropna method can be used to drop all rows containing missing values. This could be a good option if only a small number of rows contain missing data.
8. Removing missing data
We can drop specific rows by passing index labels to the drop function, which defaults to dropping rows.
9. Removing missing data
Usually we'll want to focus on dropping a particular column, especially if all or most of its values are missing. We can use the drop method here as well, though the arguments are different. The first argument is the column name to drop, in this case, A. We have to specify axis=1 to designate that we want to drop a column rather than a row.
10. Removing missing data
What if we want to drop rows where data is missing in a particular column?
First, let's take a look at how many missing values we have in each column, using isna to identify nan values, and then using sum to count them in each column.
To filter out rows with missing values in particular columns, such as column B, we can specify a list of labels to the subset argument of dropna.
11. Removing missing data
Finally, we can specify how many non-missing values we require in each row using the thresh argument.
12. Let's practice!
Ok, now it's your turn to tackle missing data!