Get startedGet started for free

Why transform existing features?

1. Why transform existing features?

Transforming features before modeling can improve the performance of a machine-learning model by making the data more manageable for the model to learn from. Such as taking the logarithm or normalizing the data.

2. Making your model's life easier

Feature transformation can make the data more suitable for the chosen model and increase its accuracy. Have a look at the loans_num dataset. It has 614 observations and five predictor variables that are all numeric. We want to predict loan status, which is a binary factor.

3. Log transformation

Log-transforming a variable can help reduce skewness in the data, reduce the impact of outliers and convert multiplicative relations into additive ones, thereby making the data more suitable for modeling. In our loans_num dataset, some features are skewed to the right, like LoanAmont in blue. Taking the log of LoanAmount transforms its distribution, transforming it into a more symmetric one, shown in green. Note, however, that this works only for positive values. Therefore we usually use log(variable + 1) to avoid zero value errors.

4. Normalization

Normalizing or scaling numerical features helps prevent one feature from dominating the others due to differences in scale. In this example, the loan amount term values in our case vary significantly, with the shortest term below 100 days and the longest close to 500.

5. Normalization

Normalizing this feature will center it at zero with a standard deviation of one. This process is also helpful in handling outliers and making the data more suitable for modeling.

6. Defining the model and the recipe

We impute missing values using k-nearest neighbors in the first step, followed by normalizing the Loan_Amount_Terms and applying a log transformation to all numeric predictors except for Loan_Amount_Term. We do this by using the selector function all_numeric_predictors as the first argument of step_log, followed by the variable Loan_Amount_Term preceded by a negative sign to exclude it. The parameter offset = 1 adds one to the argument of the logarithm to avoid a zero value which will generate an error. As usual, printing the recipe object will generate a handy summary of all the steps applied.

7. Measuring performance efficiently

Building models is an iterative process, so it is a good idea to automate the performance measurement part. We can accomplish this using the metric_set function from the yardstick package, included in the tidymodels metapackage, by adding our desired metrics as argument and saving this to an object (the complete list of available functions is documented in the package website). In our case, we will call it class_evaluate. Once this object is defined, we can use it like any other function. The arguments are the truth in our test dataset. This prediction needs to be explicitly defined as the "estimate." Since we want to use the ROC area under the curve, we also provide a column with our predicted probabilities. All these are available in the augmented fit object. The result is a customized set of metrics we can run as we create alternative models.

8. Let's practice!

Let us code!