Get startedGet started for free

Flight duration model: Pipeline stages

You're going to create the stages for the flights duration model pipeline. You will use these in the next exercise to build a pipeline and to create a regression model.

The StringIndexer, OneHotEncoder, VectorAssembler and LinearRegression classes are already imported.

This exercise is part of the course

Machine Learning with PySpark

View Course

Exercise instructions

  • Create an indexer to convert the 'org' column into an indexed column called 'org_idx'.
  • Create a one-hot encoder to convert the 'org_idx' and 'dow' columns into dummy variable columns called 'org_dummy' and 'dow_dummy'.
  • Create an assembler which will combine the 'km' column with the two dummy variable columns. The output column should be called 'features'.
  • Create a linear regression object to predict flight duration.

You might find it useful to revisit the slides from the lessons in the Slides panel next to the IPython Shell.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Convert categorical strings to index values
indexer = ____(____)

# One-hot encode index values
onehot = ____(
    inputCols=____,
    outputCols=____
)

# Assemble predictors into a single column
assembler = ____(inputCols=____, outputCol=____)

# A linear regression object
regression = ____(labelCol=____)
Edit and Run Code