Making a Boolean
Consider that you're modeling a yes or no question: is the flight late? However, your data contains the arrival delay in minutes for each flight. Thus, you'll need to create a boolean column which indicates whether the flight was late or not!
This exercise is part of the course
Foundations of PySpark
Exercise instructions
- Use the
.withColumn()
method to create the columnis_late
. This column is equal tomodel_data.arr_delay > 0
. - Convert this column to an integer column so that you can use it in your model and name it
label
(this is the default name for the response variable in Spark's machine learning routines). - Filter out missing values (this has been done for you).
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Create is_late
model_data = model_data.withColumn("is_late", ____)
# Convert to an integer
model_data = model_data.withColumn("label", ____)
# Remove missing values
model_data = model_data.filter("arr_delay is not NULL and dep_delay is not NULL and air_time is not NULL and plane_year is not NULL")