Bucketing departure time
Time of day data are a challenge with regression models. They are also a great candidate for bucketing.
In this lesson you will convert the flight departure times from numeric values between 0 (corresponding to 00:00) and 24 (corresponding to 24:00) to binned values. You'll then take those binned values and one-hot encode them.
This exercise is part of the course
Machine Learning with PySpark
Exercise instructions
- Create a bucketizer object with bin boundaries at 0, 3, 6, …, 24 which correspond to times 0:00, 03:00, 06:00, …, 24:00. Specify input column as
depart
and output column asdepart_bucket
. - Bucket the departure times in the
flights
data. Show the first five values fordepart
anddepart_bucket
. - Create a one-hot encoder object. Specify output column as
depart_dummy
. - Train the encoder on the data and then use it to convert the bucketed departure times to dummy variables. Show the first five values for
depart
,depart_bucket
anddepart_dummy
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
from pyspark.ml.feature import Bucketizer, OneHotEncoder
# Create buckets at 3 hour intervals through the day
buckets = ____(splits=____, ____, ____)
# Bucket the departure times
bucketed = buckets.____(____)
bucketed.____(____).____(____)
# Create a one-hot encoder
onehot = ____(inputCols=[____], ____)
# One-hot encode the bucketed departure times
flights_onehot = ____.____(____).____(____)
flights_onehot.____(____).____(____)