Join the DataFrames
In the next two chapters you'll be working to build a model that predicts whether or not a flight will be delayed based on the flights data we've been working with. This model will also include information about the plane that flew the route, so the first step is to join the two tables: flights
and planes
!
This exercise is part of the course
Foundations of PySpark
Exercise instructions
- First, rename the
year
column ofplanes
toplane_year
to avoid duplicate column names. - Create a new DataFrame called
model_data
by joining theflights
table withplanes
using thetailnum
column as the key.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Rename year column
planes = planes.withColumnRenamed(____)
# Join the DataFrames
model_data = flights.join(____, on=____, how="leftouter")