Numeric predictors
1. Numeric predictors
In this section, we will focus on common preprocessing techniques for numeric predictor variables.2. Correlated predictor variables
Correlation measures the linear relationship between two numeric variables. Highly correlated predictors variables will have correlation values near -1 or 1 and provide redundant information. This phenomenon, known as multicollinearity, causes instability in machine learning optimization algorithms and can lead to model fitting errors. The total_clicks and pages_per_visit columns in the leads_training data are an example of highly correlated predictors. We see that customers with large total click values tend to have large average page views per visit. Knowing the value of one variable gives us the likely value of the other, so both are not needed.3. Finding correlated predictor variables
We can discover correlated variables by creating a correlation matrix, which lists all pairwise correlations in a numeric dataset. To create one using the leads_training data, we pass it to the select_if() function where we provide the is dot-numeric function as an argument. This selects only the numeric columns. Then we pass the results to the cor() function. As you can see, the pages_per_visit and total_clicks variables have a correlation of 0-point-96.4. Processing correlated predictors
To preprocess correlated predictor variables, we begin by specifying a recipe. For the lead scoring data, we add the same model formula and data argument. Then we pass our recipe object to the step_corr() function, which has two Rs instead of one, and provide the names of all numeric columns in the leads_training dataset separated by commas. We also provide a correlation threshold of 0-point-9 to the threshold argument. A correlation threshold is in absolute value terms, meaning that a threshold of 0-point-9 will remove variables with a correlation of 0-point-9 or more and negative 0 point 9 or less.5. Selecting predictors by type
Instead of typing the names of all numeric columns in step_corr(), we can use special selector functions. The all_outcomes() function selects the outcome variable while the all_numeric() selector will select all numeric variables. This will include the outcome variable if it is numeric. An equivalent way of specifying our recipe would be to pass all_numeric() to step_corr(). If we had a numeric outcome variable, we would also pass minus all_outcomes() as well to exclude it from preprocessing.6. Training and applying the recipe
After training and applying our recipe to the test data, we see that pages_per_visit was removed due to its high correlation value in the leads_training data. When we use our recipe, it will be removed from all future datasets as well.7. Normalization
Another common task is centering and scaling numeric variables, known as normalization. For each numeric column, we subtract the mean and divide by the standard deviation. This transforms numeric variables to standard deviation units with a mean of 0 and standard deviation of 1. Interpreting normalized variable values is very intuitive. From the normalized total_time value, we see that spending 1,273 seconds on the website is 1-point-19 standard deviations greater than the average time spent by customers.8. Combining data preprocessing steps
To normalize variables, we add the step_normalize() function to our preprocessing steps. The means and standard deviations from the training data columns will be used to transform existing and new data sources. Recipes can have multiple preprocessing steps. We just pass multiple step functions to our sequence of steps in our recipe and they are carried out in the order we enter them. In the leads_norm_rec recipe object, correlated predictors will be removed first, followed by normalization.9. Transforming the test data
When we train and apply the leads_norm_rec recipe to the leads_test data, we see that the pages_per_visit column is removed and all numeric predictors are normalized.10. Let's practice!
Let's practice transforming numeric predictor variables!Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.