Get startedGet started for free

Flight duration model: Adding origin airport

Some airports are busier than others. Some airports are bigger than others too. Flights departing from large or busy airports are likely to spend more time taxiing or waiting for their takeoff slot. So it stands to reason that the duration of a flight might depend not only on the distance being covered but also the airport from which the flight departs.

You are going to make the regression model a little more sophisticated by including the departure airport as a predictor.

These data have been split into training and testing sets and are available as flights_train and flights_test. The origin airport, stored in the org column, has been indexed into org_idx, which in turn has been one-hot encoded into org_dummy. The first few records are displayed in the terminal.

This exercise is part of the course

Machine Learning with PySpark

View Course

Exercise instructions

  • Fit a linear regression model to the training data.
  • Make predictions for the testing data.
  • Calculate the RMSE for predictions on the testing data.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator

# Create a regression object and train on training data
regression = ____(____).____(____)

# Create predictions for the testing data
predictions = ____.____(____)

# Calculate the RMSE on testing data
____(____).____(____)
Edit and Run Code