Get startedGet started for free

Dissecting the best flight duration model

You just set up a CrossValidator to find good parameters for the linear regression model predicting flight duration.

The model pipeline has multiple stages (objects of type StringIndexer, OneHotEncoder, VectorAssembler and LinearRegression), which operate in sequence. The stages are available as the stages attribute on the pipeline object. They are represented by a list and the stages are executed in the sequence in which they appear in the list.

Now you're going to take a closer look at the pipeline, split out the stages and use it to make predictions on the testing data.

The following objects have already been created:

  • cv — a trained CrossValidatorModel object and
  • evaluator — a RegressionEvaluator object.

The flights data have been randomly split into flights_train and flights_test.

This exercise is part of the course

Machine Learning with PySpark

View Course

Exercise instructions

  • Retrieve the best model.
  • Look at the stages in the best model.
  • Isolate the linear regression stage and extract its parameters.
  • Use the best model to generate predictions on the testing data and calculate the RMSE.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Get the best model from cross validation
best_model = cv.____

# Look at the stages in the best model
print(best_model.____)

# Get the parameters for the LinearRegression object in the best model
best_model.____.extractParamMap()

# Generate predictions on testing data using the best model then calculate RMSE
predictions = ____.____(____)
print("RMSE =", ____.____(____))
Edit and Run Code