Dissecting the best flight duration model
You just set up a CrossValidator to find good parameters for the linear regression model predicting flight duration.
The model pipeline has multiple stages (objects of type StringIndexer, OneHotEncoder, VectorAssembler and LinearRegression), which operate in sequence. The stages are available as the stages attribute on the pipeline object. They are represented by a list and the stages are executed in the sequence in which they appear in the list.
Now you're going to take a closer look at the pipeline, split out the stages and use it to make predictions on the testing data.
The following objects have already been created:
cv— a trainedCrossValidatorModelobject andevaluator— aRegressionEvaluatorobject.
The flights data have been randomly split into flights_train and flights_test.
This exercise is part of the course
Machine Learning with PySpark
Exercise instructions
- Retrieve the best model.
- Look at the stages in the best model.
- Isolate the linear regression stage and extract its parameters.
- Use the best model to generate predictions on the testing data and calculate the RMSE.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Get the best model from cross validation
best_model = cv.____
# Look at the stages in the best model
print(best_model.____)
# Get the parameters for the LinearRegression object in the best model
best_model.____.extractParamMap()
# Generate predictions on testing data using the best model then calculate RMSE
predictions = ____.____(____)
print("RMSE =", ____.____(____))