Get startedGet started for free

Cross validating simple flight duration model

You've already built a few models for predicting flight duration and evaluated them with a simple train/test split. However, cross-validation provides a much better way to evaluate model performance.

In this exercise you're going to train a simple model for flight duration using cross-validation. Travel time is usually strongly correlated with distance, so using the km column alone should give a decent model.

The data have been randomly split into flights_train and flights_test.

The following classes have already been imported: LinearRegression, RegressionEvaluator, ParamGridBuilder and CrossValidator.

This exercise is part of the course

Machine Learning with PySpark

View Course

Exercise instructions

  • Create an empty parameter grid.
  • Create objects for building and evaluating a linear regression model. The model should predict the "duration" field.
  • Create a cross-validator object. Provide values for the estimator, estimatorParamMaps and evaluator arguments. Choose 5-fold cross validation.
  • Train and test the model across multiple folds of the training data.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Create an empty parameter grid
params = ____().____()

# Create objects for building and evaluating a regression model
regression = ____(____)
evaluator = ____(____)

# Create a cross validator
cv = ____(estimator=____, estimatorParamMaps=____, evaluator=____, ____)

# Train and test model on multiple folds of the training data
cv = cv.____(____)

# NOTE: Since cross-valdiation builds multiple models, the fit() method can take a little while to complete.
Edit and Run Code