1. Advanced data analysis patterns
Hey there! Welcome back. In this lesson, we will be going over some more advanced techniques to analyze data with Databricks SQL.
2. Motivation
As a reminder, with Databricks SQL, we typically think of our analyses as happening from the end of the lakehouse architecture through our Gold layer. From this curated data, we can create whatever analysis we require, and can build some pretty sophisticated end-user results.
3. Sub queries
One technique that many analyts use within their analyses are sub-queries. Basically, a sub-query is a way to nest another SQL query within your direct query. In many ways, this sub-query can feel very similar to a separate table or view, without having to write the query results out.
Sub-queries are great for a large number of use cases. Commonly, sub-queries are used to retrieve the results of another query, to simplify large complex queries, and to have an aggregation within a query. In this example code, we have a sub query that is calculating aggregations on some sales data. Then, we have a query that is extracting specific fields from that sub-query. While this is a fairly basic example, it shows the power of what we can do with a sub-query.
4. Window functions
Next up, we can take a look at window functions. Window functions are a way in which we can apply some kind of calculation across a specific range of rows of our data. These data rows are related to each other in some way, and that is how we define the window of our function.
Most commonly, window functions are thought of in a few main use cases: calculating a metric across a period of time, calculating across neighbor rows, and calculating across dimensions.
Here we have a SQL query that is solving this last use case. The query is ranking each employee based on their salary within their department. In the results of that query, we can see that each person now has a rank column that corresponds to how much money they make, relative to their peers within their department.
5. Advanced Databricks SQL functions
There are a number of different functions that we can use within a window function. Often times, we can use standard functions like SUM() or AVG() with the window syntax. Other times, we need functions that are specifically created to work over a window function.
Rank() is one of the most common ones, and what we saw in the last slide. Rank will compare all rows within a given partition, and will calculate a rank value for each row. In the example SQL code, we can see the rank of each row from highest to lowest value in column b.
Lag and Lead are other common functions that we see with window functions. Lag and lead will return either the preceding or subsequent row value respectively. We can see this with our example code and resulting dataset provided.
6. Let's practice!
Now, let us go practice our advanced analysis techniques in Databricks SQL.