Get startedGet started for free

Flight duration model: More features!

Let's add more features to our model. This will not necessarily result in a better model. Adding some features might improve the model. Adding other features might make it worse.

More features will always make the model more complicated and difficult to interpret.

These are the features you'll include in the next model:

  • km
  • org (origin airport, one-hot encoded, 8 levels)
  • depart (departure time, binned in 3 hour intervals, one-hot encoded, 8 levels)
  • dow (departure day of week, one-hot encoded, 7 levels) and
  • mon (departure month, one-hot encoded, 12 levels).

These have been assembled into the features column, which is a sparse representation of 32 columns (remember one-hot encoding produces a number of columns which is one fewer than the number of levels).

The data are available as flights, randomly split into flights_train and flights_test.

This exercise is based on a small subset of the flights data.

This exercise is part of the course

Machine Learning with PySpark

View Course

Exercise instructions

  • Fit a linear regression model to the training data.
  • Generate predictions for the testing data.
  • Calculate the RMSE on the testing data.
  • Look at the model coefficients. Are any of them zero?

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

from pyspark.ml.regression import ____
from pyspark.ml.evaluation import ____

# Fit linear regression model to training data
regression = ____(____).____(____)

# Make predictions on testing data
predictions = regression.____(____)

# Calculate the RMSE on testing data
rmse = ____(____).____(____)
print("The test RMSE is", rmse)

# Look at the model coefficients
coeffs = regression.____
print(coeffs)
Edit and Run Code