Column manipulation
The Federal Aviation Administration (FAA) considers a flight to be "delayed" when it arrives 15 minutes or more after its scheduled time.
The next step of preparing the flight data has two parts:
- convert the units of distance, replacing the
mile
column with akm
column; and - create a Boolean column indicating whether or not a flight was delayed.
This exercise is part of the course
Machine Learning with PySpark
Exercise instructions
- Import a function which will allow you to round a number to a specific number of decimal places.
- Derive a new
km
column from themile
column, rounding to zero decimal places. One mile is 1.60934 km. - Remove the
mile
column. - Create a
label
column with a value of 1 indicating the delay was 15 minutes or more and 0 otherwise. Think carefully about the logical condition.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Import the required function
from pyspark.sql.functions import ____
# Convert 'mile' to 'km' and drop 'mile' column (1 mile is equivalent to 1.60934 km)
flights_km = flights.____('km', ____(____ * ____, 0)) \
.____('mile')
# Create 'label' column indicating whether flight delayed (1) or not (0)
flights_km = flights_km.____('label', (____).cast('integer'))
# Check first five records
flights_km.show(5)