Cross validating simple flight duration model
You've already built a few models for predicting flight duration and evaluated them with a simple train/test split. However, cross-validation provides a much better way to evaluate model performance.
In this exercise you're going to train a simple model for flight duration using cross-validation. Travel time is usually strongly correlated with distance, so using the km
column alone should give a decent model.
The data have been randomly split into flights_train
and flights_test
.
The following classes have already been imported: LinearRegression
, RegressionEvaluator
, ParamGridBuilder
and CrossValidator
.
This exercise is part of the course
Machine Learning with PySpark
Exercise instructions
- Create an empty parameter grid.
- Create objects for building and evaluating a linear regression model. The model should predict the "duration" field.
- Create a cross-validator object. Provide values for the
estimator
,estimatorParamMaps
andevaluator
arguments. Choose 5-fold cross validation. - Train and test the model across multiple folds of the training data.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Create an empty parameter grid
params = ____().____()
# Create objects for building and evaluating a regression model
regression = ____(____)
evaluator = ____(____)
# Create a cross validator
cv = ____(estimator=____, estimatorParamMaps=____, evaluator=____, ____)
# Train and test model on multiple folds of the training data
cv = cv.____(____)
# NOTE: Since cross-valdiation builds multiple models, the fit() method can take a little while to complete.