Joining II
In PySpark, joins are performed using the DataFrame method .join()
. This method takes three arguments. The first is the second DataFrame that you want to join with the first one. The second argument, on
, is the name of the key column(s) as a string. The names of the key column(s) must be the same in each table. The third argument, how
, specifies the kind of join to perform. In this course we'll always use the value how="leftouter"
.
The flights
dataset and a new dataset called airports
are already in your workspace.
This exercise is part of the course
Foundations of PySpark
Exercise instructions
- Examine the
airports
DataFrame by calling.show()
. Note which key column will let you joinairports
to theflights
table. - Rename the
faa
column inairports
todest
by re-assigning the result ofairports.withColumnRenamed("faa", "dest")
toairports
. - Join the
flights
with theairports
DataFrame on thedest
column by calling the.join()
method onflights
. Save the result asflights_with_airports
.- The first argument should be the other DataFrame,
airports
. - The argument
on
should be the key column. - The argument
how
should be"leftouter"
.
- The first argument should be the other DataFrame,
- Call
.show()
onflights_with_airports
to examine the data again. Note the new information that has been added.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Examine the data
print(____)
# Rename the faa column
airports = ____
# Join the DataFrames
flights_with_airports = ____
# Examine the new DataFrame
print(____)