Flight duration model: Adding departure time
In the previous exercise the departure time was bucketed and converted to dummy variables. Now you're going to include those dummy variables in a regression model for flight duration.
The data are in flights
. The km
, org_dummy
and depart_dummy
columns have been assembled into features
, where km
is index 0, org_dummy
runs from index 1 to 7 and depart_dummy
from index 8 to 14.
The data have been split into training and testing sets and a linear regression model, regression
, has been built on the training data. Predictions have been made on the testing data and are available as predictions
.
This exercise is part of the course
Machine Learning with PySpark
Exercise instructions
- Find the RMSE for predictions on the testing data.
- Find the average time spent on the ground for flights departing from OGG between 21:00 and 24:00.
- Find the average time spent on the ground for flights departing from OGG between 03:00 and 06:00.
- Find the average time spent on the ground for flights departing from JFK between 03:00 and 06:00.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Find the RMSE on testing data
from pyspark.ml.____ import ____
rmse = ____(____).____(____)
print("The test RMSE is", rmse)
# Average minutes on ground at OGG for flights departing between 21:00 and 24:00
avg_eve_ogg = regression.____
print(avg_eve_ogg)
# Average minutes on ground at OGG for flights departing between 03:00 and 06:00
avg_night_ogg = regression.____ + regression.____[9]
print(avg_night_ogg)
# Average minutes on ground at JFK for flights departing between 03:00 and 06:00
avg_night_jfk = regression.____ + regression.____[____] + regression.____[____]
print(avg_night_jfk)