Aan de slagGa gratis aan de slag

Flight duration model: Adding origin airport

Some airports are busier than others. Some airports are bigger than others too. Flights departing from large or busy airports are likely to spend more time taxiing or waiting for their takeoff slot. So it stands to reason that the duration of a flight might depend not only on the distance being covered but also the airport from which the flight departs.

You are going to make the regression model a little more sophisticated by including the departure airport as a predictor.

These data have been split into training and testing sets and are available as flights_train and flights_test. The origin airport, stored in the org column, has been indexed into org_idx, which in turn has been one-hot encoded into org_dummy. The first few records are displayed in the terminal.

Deze oefening maakt deel uit van de cursus

Machine Learning with PySpark

Cursus bekijken

Oefeninstructies

  • Fit a linear regression model to the training data.
  • Make predictions for the testing data.
  • Calculate the RMSE for predictions on the testing data.

Praktische interactieve oefening

Probeer deze oefening eens door deze voorbeeldcode in te vullen.

from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator

# Create a regression object and train on training data
regression = ____(____).____(____)

# Create predictions for the testing data
predictions = ____.____(____)

# Calculate the RMSE on testing data
____(____).____(____)
Code bewerken en uitvoeren