Flight duration model: Pipeline stages
You're going to create the stages for the flights duration model pipeline. You will use these in the next exercise to build a pipeline and to create a regression model.
The StringIndexer
, OneHotEncoder
, VectorAssembler
and LinearRegression
classes are already imported.
This exercise is part of the course
Machine Learning with PySpark
Exercise instructions
- Create an indexer to convert the 'org' column into an indexed column called 'org_idx'.
- Create a one-hot encoder to convert the 'org_idx' and 'dow' columns into dummy variable columns called 'org_dummy' and 'dow_dummy'.
- Create an assembler which will combine the 'km' column with the two dummy variable columns. The output column should be called 'features'.
- Create a linear regression object to predict flight duration.
You might find it useful to revisit the slides from the lessons in the Slides panel next to the IPython Shell.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Convert categorical strings to index values
indexer = ____(____)
# One-hot encode index values
onehot = ____(
inputCols=____,
outputCols=____
)
# Assemble predictors into a single column
assembler = ____(inputCols=____, outputCol=____)
# A linear regression object
regression = ____(labelCol=____)