Get startedGet started for free

Column manipulation

The Federal Aviation Administration (FAA) considers a flight to be "delayed" when it arrives 15 minutes or more after its scheduled time.

The next step of preparing the flight data has two parts:

  1. convert the units of distance, replacing the mile column with a kmcolumn; and
  2. create a Boolean column indicating whether or not a flight was delayed.

This exercise is part of the course

Machine Learning with PySpark

View Course

Exercise instructions

  • Import a function which will allow you to round a number to a specific number of decimal places.
  • Derive a new km column from the mile column, rounding to zero decimal places. One mile is 1.60934 km.
  • Remove the mile column.
  • Create a label column with a value of 1 indicating the delay was 15 minutes or more and 0 otherwise. Think carefully about the logical condition.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Import the required function
from pyspark.sql.functions import ____

# Convert 'mile' to 'km' and drop 'mile' column (1 mile is equivalent to 1.60934 km)
flights_km = flights.____('km', ____(____ * ____, 0)) \
                    .____('mile')

# Create 'label' column indicating whether flight delayed (1) or not (0)
flights_km = flights_km.____('label', (____).cast('integer'))

# Check first five records
flights_km.show(5)
Edit and Run Code