Multiple preprocessing methods

1. Multiple preprocessing methods

The preprocess argument to train can do a lot more than missing value imputation.

2. The wide world of preProcess

It exposes a very wide range of pre-processing steps that can have a large impact on the results of your models. You can also chain together multiple preprocessing steps. For example, you can use median imputation, then center and scale your data, then fit a glm model. This is a common "recipe" for preprocessing data prior to fitting a linear model. Note that there is an "order of operations" to the preprocessing steps. For example, centering and scaling always happens prior to median imputation, and principal components analysis always happens after centering and scaling. You can read the help file for the preProcess function for much more detail on this.

3. Example: preprocessing mtcars

Let's load the mtcars dataset, and add some missing at random data. Now let's use our linear model recipe: center and scale, then median imputation, then fit a glm model. This is as simple as passing a character vector of instructions to the preProcess argument for train. Centering and scaling will happen first, then imputation, then fitting the glm model.

4. Example: preprocessing mtcars

We can add additional transformations to our model as well, for example principle components analysis after imputation. This yields a slightly more accurate model, in terms of RMSE.

5. Example: preprocessing mtcars

There are many other cool transformation we can use, for example the spatial sign transformation. This transformation projects your data onto a sphere, and is very useful for datasets with lots of outliers or particularly high dimensionality, but in this case it doesn't improve our model.

6. Preprocessing cheat sheet

The number of preprocessing steps caret provides can be a little overwhelming, so I'll leave you with this cheat sheet: First of all, always start with median imputation. This will save you all kinds of weird issues with messy datasets. Just remember to also try knn imputation if you suspect your data might have values missing not-at-random. Second, for linear models like lm, glm, and glmnet, always center and scale. You just get better results. Third, it's worth trying PCA and spatial sign transformation for your linear models. Sometimes these methods can yield better results. Finally, tree-based models such as random forests or GBMs typically don't need much preprocessing. You can usually get away with just median imputation.

7. Let’s practice!

Let's try these transformations on some other datasets.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.