Flight duration model: Regularization!
In the previous exercise you added more predictors to the flight duration model. The model performed well on testing data, but with so many coefficients it was difficult to interpret.
In this exercise you'll use Lasso regression (regularized with a L1 penalty) to create a more parsimonious model. Many of the coefficients in the resulting model will be set to zero. This means that only a subset of the predictors actually contribute to the model. Despite the simpler model, it still produces a good RMSE on the testing data.
You'll use a specific value for the regularization strength. Later you'll learn how to find the best value using cross validation.
The data (same as previous exercise) are available as flights
, randomly split into flights_train
and flights_test
.
There are two parameters for this model, λ (regParam
) and α (elasticNetParam
), where α determines the type of regularization and λ gives the strength of regularization.
This exercise is part of the course
Machine Learning with PySpark
Exercise instructions
- Fit a linear regression model to the training data. Set the regularization strength to 1.
- Calculate the RMSE on the testing data.
- Look at the model coefficients.
- How many of the coefficients are equal to zero?
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator
# Fit Lasso model (λ = 1, α = 1) to training data
regression = ____(____, ____, elasticNetParam=1).____(____)
# Calculate the RMSE on testing data
rmse = ____(____).____(____)
print("The test RMSE is", rmse)
# Look at the model coefficients
coeffs = regression.____
print(coeffs)
# Number of zero coefficients
zero_coeff = sum([____ for beta in regression.coefficients])
print("Number of coefficients equal to 0:", zero_coeff)