Grouping and aggregating data

Often you will want to compute a metric over a set of values that share a common characteristic, like the average price of a house in a certain region. To achieve this, you would need to group the data by region and compute an aggregate metric on that subgroup of data.

We’ve already seen in the video a couple of these aggregation metrics, on landingprices.csv_. We’ll inspect a few more now and apply them to _~/workspace/mnt/data_lake/landing/purchased.csv_. In particular, you’ll use the spark.sql aggregation functions avg() to compute the average value of some column in a group, stddev_samp() to compute the standard (sample) deviation and max() (which we alias as sfmax so as not to shadow Python’s built-in max()) to retrieve the largest value of some column in a group.

Use the .groupBy() method to group the data by the “Country” column.
In these groups, compute the average of the “Salary” column and name the resulting column “average_salary”.
Compute the standard deviation of the “Salary” column in each group in the same aggregation.
Retrieve the largest “Salary” in each group, in the same aggregation, and name the resulting column “highest_salary”.

script.py

IPython Shell

Ingesting Data

Creating a data transformation pipeline with PySpark

Testing your data pipeline

Managing and orchestrating a workflow

Exercise

Exercise

Grouping and aggregating data

Instructions