Get startedGet started for free

Bucketing departure time

Time of day data are a challenge with regression models. They are also a great candidate for bucketing.

In this lesson you will convert the flight departure times from numeric values between 0 (corresponding to 00:00) and 24 (corresponding to 24:00) to binned values. You'll then take those binned values and one-hot encode them.

This exercise is part of the course

Machine Learning with PySpark

View Course

Exercise instructions

  • Create a bucketizer object with bin boundaries at 0, 3, 6, …, 24 which correspond to times 0:00, 03:00, 06:00, …, 24:00. Specify input column as depart and output column as depart_bucket.
  • Bucket the departure times in the flights data. Show the first five values for depart and depart_bucket.
  • Create a one-hot encoder object. Specify output column as depart_dummy.
  • Train the encoder on the data and then use it to convert the bucketed departure times to dummy variables. Show the first five values for depart, depart_bucket and depart_dummy.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

from pyspark.ml.feature import Bucketizer, OneHotEncoder

# Create buckets at 3 hour intervals through the day
buckets = ____(splits=____, ____, ____)

# Bucket the departure times
bucketed = buckets.____(____)
bucketed.____(____).____(____)

# Create a one-hot encoder
onehot = ____(inputCols=[____], ____)

# One-hot encode the bucketed departure times
flights_onehot = ____.____(____).____(____)
flights_onehot.____(____).____(____)
Edit and Run Code