Joining II
In PySpark, joins are performed using the DataFrame method .join(). This method takes three arguments. The first is the second DataFrame that you want to join with the first one. The second argument, on, is the name of the key column(s) as a string. The names of the key column(s) must be the same in each table. The third argument, how, specifies the kind of join to perform. In this course we'll always use the value how="leftouter".
The flights dataset and a new dataset called airports are already in your workspace.
This exercise is part of the course
Foundations of PySpark
Exercise instructions
- Examine the
airportsDataFrame by calling.show(). Note which key column will let you joinairportsto theflightstable. - Rename the
faacolumn inairportstodestby re-assigning the result ofairports.withColumnRenamed("faa", "dest")toairports. - Join the
flightswith theairportsDataFrame on thedestcolumn by calling the.join()method onflights. Save the result asflights_with_airports.- The first argument should be the other DataFrame,
airports. - The argument
onshould be the key column. - The argument
howshould be"leftouter".
- The first argument should be the other DataFrame,
- Call
.show()onflights_with_airportsto examine the data again. Note the new information that has been added.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Examine the data
print(____)
# Rename the faa column
airports = ____
# Join the DataFrames
flights_with_airports = ____
# Examine the new DataFrame
print(____)