Regression: regularization

1. Regression: regularization

In the last video, we looked at feature selection methods, one of which was embedded methods. I promised we'd cover them in this lesson. So let's get going!

2. Regularization algorithms

In this video we'll be covering Ridge regression, lasso regression, and their hybrid known as Elasticnet.

3. Ordinary least squares

Ridge, Lasso and elasticnet are forms of regularization, simple techniques designed to reduce model complexity and help prevent over-fitting. They do so by adding a penalty term to the ordinary least squares or OLS formula. OLS seeks to minimize the sum of the squared residuals meaning, given a regression line through the data, we calculate the distance from each data point to the regression line, square it, and sum all of those squared errors together. This is the quantity that ordinary least squares seeks to minimize.

4. Ridge loss function

With ridge, the penalty term is added by multiplying the penalty parameter, lambda, times the squared coefficient values or beta j's. This shrinks the value of the coefficients toward, but not to, zero which is called L2 regularization or L2-norm. The left image shows a plot with 4 coefficients beta and their estimates as lambda goes from near zero on the left, which essentially gives back OLS estimates for the coefficients, to over 10,000 on the right which is a rather large lambda that likely adds too much shrinkage, or values approaching zero, and will lead to underfitting. The grey lines are additional coefficients that start near zero and so are penalized little, if any, by the value of lambda.

5. Lasso loss function

Lasso, also called L1-regularization or L1-norm is similar to ridge except that it takes the absolute value of the coefficients instead of the squares. This has the result of shrinking less important feature coefficients to zero and results in a type of feature selection, which is demonstrated in the left image as coefficient estimates shrink to zero as lambda increases. Large values of alpha will make more coefficients than should be zero and cause model underfitting.

6. Ridge vs lasso

Ridge and lasso seem to oppose each other in most of their characteristics. While lasso has multiple sparse solutions, sparsity meaning when coefficients are zero, ridge has only one non-sparse solution. While lasso provides a form of feature selection by shrinking coefficients to zero and is robust to outliers, ridge does not and is not. Finally, ridge is able to learn complex data patterns, while lasso, although generating a simple and interpretable model, simply cannot.

7. ElasticNet

That brings us to Elasticnet, a hybrid of lasso and ridge, using what is called an L1-ratio. Intuititively, elasticnet combines the two penalization methods, with the penalty becoming an L2 when the L1-ratio is 0, and an L1 when it's 1. This allows for a much more flexible regularization anywhere between lasso and ridge. Lambda is a shared penalization parameter while alpha sets the ratio between L1 and L2 regularization. And although you won't be using it in the exercises that follow, it's good to note that the elasticnet function sets alpha automatically.

8. Regularization with Boston housing data

The Boston housing data is used to build a model that predicts the sale prices of homes given several features. Here are just a few of the estimated beta coefficients. With the original data, whether or not a house is on the Charles River and the nitric oxide concentrations seem as or more important than the number of bedrooms. After regularization however, the coefficients for the river and nitric oxide are zero, indicating their actual lack of importance in predicting housing prices while rooms remains relatively far from zero. Removing these unimportant features will result in less noise during model training and higher accuracy.

9. Regularization functions

A few regularization functions are, from sklearn dot linear model, lasso and lassocv, ridge and ridgecv, and elasticnet and elasticnetcv, all of which return their respective estimators with or without cross-validation. You've seen train test split and mse before, of course. Although you'll see the penalty parameter as lambda more often than not, in the sklearn ridge and lasso functions, it's the alpha keyword argument, something to keep in mind. From a trained cv estimator the best regularization parameter is called using the name of the trained model object dot alpha underscore. Finally, np dot logspace assigned to the alpha keyword argument gives a start, stop and number of values passes an array of values tried during model training.

10. Let's practice!

Let's get to regularizing!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.