Get startedGet started for free

Data preprocessing

1. Data Preparation

Now that you've done some exploratory analysis and have a better understanding of your dataset, it's time to preprocess it in preparation for modeling.

2. Model assumptions

Recall from the previous lesson that many machine learning models make certain assumptions about how the data is distributed. If the features in your dataset do not meet these assumptions, then the results of your models won't be reliable. That's why the data preprocessing stage is so critical.

3. Data types

Many machine learning models only accept numerical data types. So if any of your features are categorical, they will need to first be encoded numerically.

4. Data types (Part 2)

You can look at the data types in the telco DataFrame using its dtype attribute. Int64 and float64 are numerical data types such as international calls and evening charges, while any columns that include text, such as State, are encoded as "object".

5. Encoding binary features

Some features that have the object data type such as International Plan have two possible values: "no" and "yes". To represent these numerically,

6. Encoding binary features

you can encode "no" as 0 and "yes" as 1, using either the replace method, as shown here, or scikit-learn's LabelEncoder function, as shown here.

7. Encoding state

The "State" feature is a bit more complex to represent numerically, because there are so many states. We could assign a number to each state - 0 for Kansas, 1 for Ohio, 2 for New Jersey, and so on. But assigning arbitrary numbers like this is dangerous, as it implies some form of ordering in the states. This would make sense for a feature that had categories like "low", "medium", or "high", but in this case, it doesn't make sense to order states, and doing so would make your model less effective.

8. One hot encoding

Instead, you can encode states by using what is known as one hot encoding.

9. One hot encoding

This creates new binary features corresponding to which state a given customer is from.

10. One hot encoding

Each row in the DataFrame will have a "1" in exactly one State column, and zeroes in all of the other State columns. In doing so, your model can use the information about which state a customer is from, without mistakenly thinking there is some form of ordering in State feature.

11. Feature scaling

Another important preprocessing step is feature scaling. Most models require features to be on the same scale, but this is rarely true of real-world data.

12. Feature scaling

In our telco DataFrame, for example, the International Calls feature ranges from 0 to 20, while the Night Minutes feature ranges from 23 to 395. So we need to rescale our data and ensure all our features are on the same scale.

13. Standardization

We'll do this using a process known as standardization, which centers the distribution around the mean of your data and calculates the number of standard deviations away from the mean each point is. To standardize your data, you can use the StandardScaler function from sklearn dot preprocessing, as shown in these lines of code, where you first instantiate it, and then fit it to your data.

14. Let's practice!

In the exercises, you'll have the chance to encode categorical features numerically and scale your data. Let's practice!