1. Learn
  2. /
  3. Courses
  4. /
  5. Machine Learning with PySpark

Connected

Exercise

Bucketing departure time

Time of day data are a challenge with regression models. They are also a great candidate for bucketing.

In this lesson you will convert the flight departure times from numeric values between 0 (corresponding to 00:00) and 24 (corresponding to 24:00) to binned values. You'll then take those binned values and one-hot encode them.

Instructions

100 XP
  • Create a bucketizer object with bin boundaries at 0, 3, 6, …, 24 which correspond to times 0:00, 03:00, 06:00, …, 24:00. Specify input column as depart and output column as depart_bucket.
  • Bucket the departure times in the flights data. Show the first five values for depart and depart_bucket.
  • Create a one-hot encoder object, specifying depart_bucket as the input column and depart_dummy as the output column.
  • Fit the encoder to the bucketed data and then use it to transform this data to dummy variables. Show the first five values for depart, depart_bucket and depart_dummy.