Pandafy a Spark DataFrame
Suppose you've run a query on your huge dataset and aggregated it down to something a little more manageable.
Sometimes it makes sense to then take that table and work with it locally using a tool like pandas
. Spark DataFrames make that easy with the .toPandas()
method. Calling this method on a Spark DataFrame returns the corresponding pandas
DataFrame. It's as simple as that!
This time the query counts the number of flights to each airport from SEA and PDX.
Remember, there's already a SparkSession
called spark
in your workspace!
This exercise is part of the course
Foundations of PySpark
Exercise instructions
- Run the query using the
.sql()
method. Save the result inflight_counts
. - Use the
.toPandas()
method onflight_counts
to create apandas
DataFrame calledpd_counts
. - Print the
.head()
ofpd_counts
to the console.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Don't change this query
query = "SELECT origin, dest, COUNT(*) as N FROM flights GROUP BY origin, dest"
# Run the query
flight_counts = ____
# Convert the results to a pandas DataFrame
pd_counts = ____
# Print the head of pd_counts
print(____)