Dissecting the best flight duration model
You just set up a CrossValidator
to find good parameters for the linear regression model predicting flight duration.
The model pipeline has multiple stages (objects of type StringIndexer
, OneHotEncoder
, VectorAssembler
and LinearRegression
), which operate in sequence. The stages are available as the stages
attribute on the pipeline object. They are represented by a list and the stages are executed in the sequence in which they appear in the list.
Now you're going to take a closer look at the pipeline, split out the stages and use it to make predictions on the testing data.
The following objects have already been created:
cv
— a trainedCrossValidatorModel
object andevaluator
— aRegressionEvaluator
object.
The flights data have been randomly split into flights_train
and flights_test
.
This exercise is part of the course
Machine Learning with PySpark
Exercise instructions
- Retrieve the best model.
- Look at the stages in the best model.
- Isolate the linear regression stage and extract its parameters.
- Use the best model to generate predictions on the testing data and calculate the RMSE.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Get the best model from cross validation
best_model = cv.____
# Look at the stages in the best model
print(best_model.____)
# Get the parameters for the LinearRegression object in the best model
best_model.____.extractParamMap()
# Generate predictions on testing data using the best model then calculate RMSE
predictions = ____.____(____)
print("RMSE =", ____.____(____))