1. Data Aggregation
In this video, we’ll talk about aggregating data in Fabric using SQL, Spark and Dataflows.
2. When you should aggregate data?
Aggregation reduces the number of rows in a dataset by grouping data based on some criteria and using an aggregate function to calculate summary values.
The most common aggregation functions are count, summarize, average, maximum and minimum.
3. When you should aggregate data?
In the example, the table is aggregated by grouping by state. Two aggregations are used - the number of records per state and the sum of the order amount.
4. Tools for implementing aggregation of data
There are various tools and languages you can use to aggregate and disaggregate your data. These include SQL scripts, Spark Notebooks and Dataflows.
5. Aggregating data with SQL
The most commonly used SQL aggregate functions are SUM, COUNT, AVG, MIN and MAX.
SUM calculates the sum of the values in a set.
COUNT calculates the number of rows in a set.
These functions are used in combination with the GROUP BY clause to group the columns that are not part of the aggregation function.
SQL also includes statistical functions such as STDEV for the standard deviation and VAR for the variance.
6. Aggregating data with SQL
In this example, the query uses COUNT to return the number of orders and SUM to calculate the total amount per State.
7. Aggregating data with Spark
PySpark has equivalents of the most commonly used SQL aggregate functions SUM, COUNT, AVG, MIN and MAX.
Other functions include first and last, which return the first record or last record of a dataset, as well as statistical functions like standard deviation and variance.
These functions are used in combination with the DataFrame operations groupBy and agg to group columns and specify the aggregation function.
8. Aggregating data with Spark
In this example, the query uses the aggregate functions count and sum to return the number of orders and the total amount, grouped by State.
9. Aggregating data with Spark
The aggregation functions must be imported from the Spark SQL functions module.
Some of these functions have the same name as Python built-in functions, so it is common to import the module with an alias.
For example, in this case, our alias was the letter F, so we would use F dot SUM to call the sum function.
10. Aggregating data with Dataflows
In Dataflows, the Group by transform allows you to aggregate and group data.
It supports Sum, Average, Median, Minimum, Maximum, Percentile and Count aggregations.
11. Aggregating data with Dataflows
In this example, the Group by transform has been configured to group data by State and calculate two aggregations.
12. Let's practice!
Now, let's do a couple of exercises to practice data aggregation.