Aggregating
All of the common aggregation methods, like .min()
, .max()
, and .count()
are GroupedData
methods. These are created by calling the .groupBy()
DataFrame method. You'll learn exactly what that means in a few exercises. For now, all you have to do to use these functions is call that method on your DataFrame. For example, to find the minimum value of a column, col
, in a DataFrame, df
, you could do
df.groupBy().min("col").show()
This creates a GroupedData
object (so you can use the .min()
method), then finds the minimum value in col
, and returns it as a DataFrame.
Now you're ready to do some aggregating of your own!
A SparkSession
called spark
is already in your workspace, along with the Spark DataFrame flights
.
This exercise is part of the course
Foundations of PySpark
Exercise instructions
- Find the length of the shortest (in terms of distance) flight that left PDX by first
.filter()
ing and using the.min()
method. Perform the filtering by referencing the column directly, not passing a SQL string. - Find the length of the longest (in terms of time) flight that left SEA by
filter()
ing and using the.max()
method. Perform the filtering by referencing the column directly, not passing a SQL string.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Find the shortest flight from PDX in terms of distance
flights.filter(____.____ == ____).groupBy().____(____).show()
# Find the longest flight from SEA in terms of air time
flights.filter(____).groupBy().____.show()