1. Handling low-information predictors
In the real world, the data we're using for predictive modeling is often messy.
2. No (or low) variance variables
Some variables in our dataset might not contain much information. For example, variables that are constant, or very close to constant, don't contain much useful information and it can sometimes be useful to remove them prior to modeling.
Nearly constant variables are particularly tricky, because it is easy for one fold of cross-validation to end up with a constant column. Constant columns can mess up a lot of models, and should be avoided. Furthermore, nearly constant columns contain little information, which means that these variables tend not to have an impact on the results of your model.
In general, I remove extremely low-variance variables from my datasets prior to modeling. This speeds up my models and makes them run with fewer bugs and generally doesn't have a large impact on their accuracy.
3. Example: constant column in mtcars
Let's have a look at the mtcars dataset from the last video. We'll add a constant-valued column to this dataset, and then try to fit our linear regression "recipe."
4. Example: constant column in mtcars
As you can see, something has gone horribly wrong with this model, but it's hard to tell what. All of the metrics are missing.
This error is due to the constant-valued column, which has a standard deviation of 0. Therefore, when we try to scale the column by dividing by the standard deviation, we end up with a whole bunch of missing values, which throw off the subsequent stages of modeling.
5. caret to the rescue (again)
Fortunately, caret again saves us a lot of work. We can add "zv" to the preprocessing argument to remove constant-valued columns, or "nzv" to remove nearly constant columns. By adding the "zv" argument to our pca and regression recipe, we solve the error and get useful results out of our caret model.
6. Let’s practice!
Let's explore nearly constant, or low-variance columns in more detail.