Get startedGet started for free

Common feature transformations

1. Common feature transformations

Some transformations come up quite frequently. Let's take a look at two of the most common ones.

2. Two families of transformations

Box-Cox is a widely used transformation, typically used to transform non-normal variables into near-normal ones. Which is a great advantage for certain models. However, it is really a family of transformations that includes the inverse, log, square, and cubic roots as special cases. One limitation is that this only works for strictly positive values. The specific form of the transformation is controlled by the parameter lambda, which, in tidymodels, is calculated behind the scenes by the recipe step. A more recent family of transformations was proposed by Yeo and Johnson in 2000 and shares many properties with Box-Cox with the advantage that it can handle zero and negative values. Moreover, when values are positive, Yeo-Johnson is equivalent to Box-Cox for y + 1.

3. The loans_num dataset

We will be working with a version of the loans dataset that contains mostly numeric features and will try to predict Loan_Status. Note that the variable CoapplicantIncome has some zero values.

4. Applying transformations

Let's start with a plain recipe that only defines the model's formula, to set a baseline. We have already created the test and train datasets and declared the model logistic regression. Running our user-defined class_evaluate function, we get an accuracy of 0-point-817 and a ROC AUC of 0-point-641. Next, let's experiment with some transformations.

5. Applying transformations

We can modify our recipe by adding a step to apply the Box-Cox transformation to all numeric features, including our target. Running this script gives us a warning, as CoapplicantIncome has a few non-positive values.

6. Applying transformations

We can avoid the warning by de-selecting the offending variable in step_BoxCox by adding it following a comma and preceded by a negative sign. As a result, our recipe will not transform it. Evaluating the workflow, we obtain similar accuracy as before but a slight decrease in ROC AUC. One route would be to find a transformation to deal specifically with CoapplicantIncome, but let's try using Yeo-Johnson instead.

7. Applying transformations

The Yeo-Johnson transformation has no restrictions on the sign of the variables, and when they are strictly positive, it is equivalent to Box-Cox of y + 1. We can apply it to all the numeric variables in our recipe without getting any warnings or errors. When we evaluate the complete workflow using our test data, we obtain a similar accuracy value but a better ROC AUC, which is an indicator of the quality of our model.

8. Let's practice!

It is time to practice what we learned.