Get startedGet started for free

Reducing the model's features

1. Reducing the model's features

When we get our hands on a new dataset, it is not immediately clear which predictors are relevant to our problem and which are not.

2. Reasons to reduce the number of features

While it is tempting to keep all features in our model out of fear of losing valuable information, eliminating irrelevant or low-information variables can often result in benefits that include reducing variance without significantly increasing bias, which can help increase model performance on unseen variables. Computation time can be unnecessarily long if we include uninformative or low information variables that won't significantly improve our model's predictive capacity but will consume memory and processing resources. A less complex model is not only easier to run but also easier to interpret.

3. Sifting data through variable importance

Domain knowledge is, of course, our first line of defense when choosing relevant variables. But even with a good understanding of the problem context, there might be relationships that are far from evident or others that seem obvious but are less significant than expected. A valuable barometer to sift through the data is variable importance. For example, consider the full set of variables in the loans dataset. After fitting a logistic regression model using all available predictors, we can explore variable importance using the vip() package to identify the features with the highest predictive power. To implement vip() we need to use the extract_fit_parsnip() function to get the model object information needed by vip() to estimate variable performance. The top three are first Credit_History, second Property_Area, and third LoanAmount, so we will build a model using only these three variables.

4. Build a reduced model using the formula syntax

There are a few ways to build a reduced model that includes only a subset of the variables. Let's look at two approaches: Using the formula syntax and creating a features vector. The formula syntax provides a very intuitive and concise way to build our model, but it can become tedious if our "reduced" set consists of many features.

5. Build a reduced model by creating a features vector

Creating a features vector is a bit more involved, but for a large number of variables, we can take advantage of the selection methods available in the tidyverse, such as select(), along with its helper functions. You can learn more about advanced selection methods in Datacamp.

6. Creating the augmented objects

We can create the augmented objects for both approaches as usual, by fitting the workflow and piping it into augment(). To verify that both outcomes are the same, we can use the all_equal() function, which takes two data frames and compares them. Note that when we use the feature approach, the augment object will generate more variables, so we need to explicitly select all features that start with "dot-pred". As expected, both methods yield the same results. But how does our reduced version compares with the full model?

7. Comparing the full and reduced models

Since both of our reduced versions are really the same, we only need to compare one with the full model. So let's use the model with the formula syntax. We use the user-defined class_evaluate function that we have been using throughout the course to assess both models, the one built on the top three features and the one using all features, and find that the three-feature model performs almost as well as the full model, sporting the same accuracy and only slightly reducing ROC_AUC!

8. Let's practice!

Let's go and select some feature subsets!