Grouping and Aggregating II
In addition to the GroupedData
methods you've already seen, there is also the .agg()
method.
This method lets you pass an aggregate column expression that uses any of the aggregate functions from the pyspark.sql.functions
submodule.
This submodule contains many useful functions for computing things like standard deviations. All the aggregation functions in this submodule take the name of a column in a GroupedData
table.
Remember, a SparkSession
called spark
is already in your workspace, along with the Spark DataFrame flights
. The grouped DataFrames you created in the last exercise are also in your workspace.
This exercise is part of the course
Foundations of PySpark
Exercise instructions
- Import the submodule
pyspark.sql.functions
asF
. - Create a
GroupedData
table calledby_month_dest
that's grouped by both themonth
anddest
columns. Refer to the two columns by passing both strings as separate arguments. - Use the
.avg()
method on theby_month_dest
DataFrame to get the averagedep_delay
in each month for each destination. - Find the standard deviation of
dep_delay
by using the.agg()
method with the functionF.stddev()
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Import pyspark.sql.functions as F
import ____ as F
# Group by month and dest
by_month_dest = flights.groupBy(____)
# Average departure delay by month and destination
by_month_dest.____.show()
# Standard deviation of departure delay
by_month_dest.agg(F.____(____)).show()