Get startedGet started for free

Shrinkage methods

1. Shrinkage methods

Regularization, or shrinking, is a technique used to prevent overfitting and improve the generalization performance of models. When a model is too complex, it is prone to overfitting, as it learns the noise or random variations in the training data instead of the underlying patterns.

2. Two common regularization techniques

Regularization adds a penalty to the loss function that the model is trying to minimize during training. This term adds a cost to the model for using large weights, encouraging the model to use smaller weights and simpler models that generalize better. We will focus on Lasso and Ridge regularization. Lasso adds a penalty term proportional to the absolute value of the model weights. It can be used for feature selection, as it effectively eliminates unimportant features. Ridge adds a penalty term proportional to the square of the model weights. It does not shrink some coefficients to zero like Lasso but can effectively reduce overfitting. These techniques apply beyond linear and logistic regression.

3. A first look at Lasso

We set up a standard recipe, with a logistic regression model, and engine to "glmnet," which allows for tuning. We declare mixture equals one, to indicate Lasso. And an arbitrary penalty of zero-point-two. We can inspect the resulting weights by "tidying" the fit object.

4. Simple logistic regression vs. Lasso

Regularizing the model with a penalty of zero-point-two effectively shrank all coefficients to zero, except for the intercept and Credit_History. But is this a good model?

5. Hyperparameter tuning

The penalty is a hyperparameter. That means it needs to be supplied by us. We can run the model many times with different values by setting the penalty equal to "tune" and defining a grid to search a range of possible penalty values. We evaluate these values using a technique called cross-validation. Meaning that we split the training set into, say, five subsets and chose one for validation and the rest for training. We repeat that process, selecting each of the other splits and averaging the results according to a chosen metric. The chart shows how ROC_AUC varies in terms of the penalty.

6. Exploring the results

A loss function measures how well a machine learning model can predict the correct output. A penalty is a term added to the loss function to encourage or discourage certain behaviors in the model. To select the best penalty value, we explore a host of possibilities and measure them against some metric. Let's set ROC_AUC as our criteria using the select_by_one_err() function and fit the model using the best penalty. finalize_workflow() will incorporate the best penalty value to a tuned workflow.

7. Simple logistic regression vs. tuned Lasso

We see that the regularization model set 0-point-452 as the penalty and shrank all features to zero, except the intercept, Credit_History, Married, and Property_Area, while simple logistic regression assigns values to all feature coefficients.

8. Ridge regularization

Setting mixture to zero, we get Ridge regularization. After tuning, we get a different view of the regularization parameter, where it seems to improve ROC_AUC being a better model for this dataset, as it grows under our current domain.

9. Ridge regularization

After Ridge, we get a model with coefficients for all features, unlike Lasso, Ridge does not set coefficients to zero. However, some become quite small. With many features, Lasso is generally preferable, whereas Ridge performs better with a few features. We can also get a combined prediction by setting the mixture to a value between 0 and 1. This is known as the elastic net.

10. Ridge vs. Lasso

From the chart, we can see how Lasso effectively chooses three features and sets everything else to zero, while Ridge assigns small but non-zero values to all the features.

11. Let's practice!

That was a mouthful of concepts. Let's get to work!