Cross validating flight duration model pipeline
The cross-validated model that you just built was simple, using km
alone to predict duration
.
Another important predictor of flight duration is the origin airport. Flights generally take longer to get into the air from busy airports. Let's see if adding this predictor improves the model!
In this exercise you'll add the org
field to the model. However, since org
is categorical, there's more work to be done before it can be included: it must first be transformed to an index and then one-hot encoded before being assembled with km
and used to build the regression model. We'll wrap these operations up in a pipeline.
The following objects have already been created:
params
— an empty parameter gridevaluator
— a regression evaluatorregression
— aLinearRegression
object withlabelCol='duration'
.
The StringIndexer
, OneHotEncoder
, VectorAssembler
and CrossValidator
classes have already been imported.
This exercise is part of the course
Machine Learning with PySpark
Exercise instructions
- Create a string indexer. Specify the input and output fields as
org
andorg_idx
. - Create a one-hot encoder. Name the output field
org_dummy
. - Assemble the
km
andorg_dummy
fields into a single field calledfeatures
. - Create a pipeline using the following operations: string indexer, one-hot encoder, assembler and linear regression. Use this to create a cross-validator.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Create an indexer for the org field
indexer = ____(____, ____)
# Create an one-hot encoder for the indexed org field
onehot = ____(____, ____)
# Assemble the km and one-hot encoded fields
assembler = ____(____, ____)
# Create a pipeline and cross-validator.
pipeline = ____(stages=[____, ____, ____, ____])
cv = ____(estimator=____,
estimatorParamMaps=____,
evaluator=____)