Multiple preprocessing methods
1. Multiple preprocessing methods
The preprocess argument to train can do a lot more than missing value imputation.2. The wide world of preProcess
It exposes a very wide range of pre-processing steps that can have a large impact on the results of your models. You can also chain together multiple preprocessing steps. For example, you can use median imputation, then center and scale your data, then fit a glm model. This is a common "recipe" for preprocessing data prior to fitting a linear model. Note that there is an "order of operations" to the preprocessing steps. For example, centering and scaling always happens prior to median imputation, and principal components analysis always happens after centering and scaling. You can read the help file for the preProcess function for much more detail on this.3. Example: preprocessing mtcars
Let's load the mtcars dataset, and add some missing at random data. Now let's use our linear model recipe: center and scale, then median imputation, then fit a glm model. This is as simple as passing a character vector of instructions to the preProcess argument for train. Centering and scaling will happen first, then imputation, then fitting the glm model.4. Example: preprocessing mtcars
We can add additional transformations to our model as well, for example principle components analysis after imputation. This yields a slightly more accurate model, in terms of RMSE.5. Example: preprocessing mtcars
There are many other cool transformation we can use, for example the spatial sign transformation. This transformation projects your data onto a sphere, and is very useful for datasets with lots of outliers or particularly high dimensionality, but in this case it doesn't improve our model.6. Preprocessing cheat sheet
The number of preprocessing steps caret provides can be a little overwhelming, so I'll leave you with this cheat sheet: First of all, always start with median imputation. This will save you all kinds of weird issues with messy datasets. Just remember to also try knn imputation if you suspect your data might have values missing not-at-random. Second, for linear models like lm, glm, and glmnet, always center and scale. You just get better results. Third, it's worth trying PCA and spatial sign transformation for your linear models. Sometimes these methods can yield better results. Finally, tree-based models such as random forests or GBMs typically don't need much preprocessing. You can usually get away with just median imputation.7. Let’s practice!
Let's try these transformations on some other datasets.Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.