Preparation for modeling

1. Preparation for modeling

You've done a great job! Now, we will explore the data preparation techniques for supervised learning models.

2. Data sample

First thing you want to do is to explore the data sample. A head method called on a pandas dataframe will print the first five rows. Here we will use the telecom dataset.

3. Data types

Next, it's always a good practice to review data types. As you can see most of the columns are strings marked as objects. This means they have text data that we'll have to transform so our model can use it.

4. Separate categorical and numerical columns

Before we start our data preparation steps, we need to separate the identifier such as customer id, and the target variable which in this case is the churn flag. We store them as separate lists that we will use later. Then, we separate categorical column names using a rule which defines a column as categorical if it has less than 10 unique values. This number is arbitrary, and it is a good practice to explore your data to see if there are variables with more unique values. We can explore them by running telco_raw.nunique() command on a dataframe and exploring the output. We will analyze this in the exercises for this lesson. The next step is to remove the target variable called Churn from this list so we don't do any transformations on it. Finally, we store the remaining column names into a list called numerical. We use list comprehensions here which are like loops only fit one line of code. It commands to extract all columns that are in the telco_raw but excludes the ones that are not in lists we have defined earlier with column names for customer id, target and categorical variables.

5. One-hot encoding

Now, we will convert these variables into binary columns with ones and zeros. This is called one-hot encoding. It transforms a categorical variable with string values like product names by building as many columns as there are unique values in that variable. This way the machine learning model gets an integer value instead of a text. Here's an example column with color values for each row. When we transform it with one hot encoding we will get a different table.

6. One-hot encoding result

Here's the result after transformation. As you can see, we have three columns instead of one, and each column has ones or zeros instead of text. There's a simple command in Python that makes this an easy task.

7. One-hot encoding categorical variables

One-hot encoding is done using get_dummies function from pandas library. We use the drop_first=True argument. This ensures we remove the first encoded column as it is redundant and can be inferred from the others - most machine learning algorithms prefer it that way. For example, if someone asked you if you own a car, there are two answers - yes and no. Without dropping the first category, we would end up with two columns yes and no, with perfect negative correlation - that is, whenever one column is 1, the other will be 0 as there is only one value per observation in the original column. We will explore this later in more depth.

8. Scaling numerical features

Another important thing to do before modeling is to scale the numerical features. A manual way to do this would be to subtract the average value of each column and divide by standard deviation. This gives us the columns with average of 0 and standard deviation of 1. Some machine learning models require this pre-processing step because otherwise the variables with larger mean and standard deviation values would have more influence as predictors. In python, we do this by importing StandardScaler library from scikit learn, initializing its instance, and running a fit_transform method on the numerical columns of the dataset. The result is a numpy array. We can create a scaled pandas dataframe from this numpy array by calling the dataframe function and feeding the numpy array and the column names from the original numerical dataframe.

9. Bringing it all together

Finally, we combine everything together. First, we drop the non-scaled numerical columns from the telco_raw dataset. Then, we merge it with the scaled numerical data. This is done with pandas merge function. We tell the function to make a left merge type, which is equivalent to a left join in SQL. This means that all values will be kept in the dataset on the left which is telco_raw. Then we look for those values in the dataset on the right - which is scaled_numerical - and if there are missing values, they will be recorded as null values.

10. Let's practice pre-processing data!

This is it! You are now ready to practice these pre-processing steps before we move to building a machine learning model on this data!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.