1. Understanding query performance
After doing the best we can to build our query, Redshift provides some additional tools to help us identify the optimizations we missed.
2. Query optimization process
We tackle those tools by first checking the STL-ALERT-EVENT-LOG table to see if our query caused any Redshift alerts. Then, running an explain on our query to see what steps might be taking the longest or returning more rows than we expected. Finally, we check the SVL-QUERY-SUMMARY and SVL-QUERY-REPORT to see any long-running step or data distribution issues with our query. This final step is beyond the scope of this course, but I'd encourage you to dig into it more yourself.
3. STL_ALERT_EVENT_LOG table
When attempting to improve queries, we can use the STL-ALERT-EVENT-LOG table to identify and correct potential performance issues. This table will highlight problems like unusable table statistics, loops, bad dist and sort keys, etc, and offer suggestions on how to fix them. Here, we are querying the table for a recent query and getting an alert about the table statistics missing and a suggestion to run an analyze operation on that table.
4. EXPLAINing the query execution plan
The EXPLAIN command is a powerful tool in database management. It provides insights into how the database engine executes a query by breaking it down into steps, displaying their relative costs, and estimating the number of rows processed at each step. The EXPLAIN statement can be used on any data manipulation language statements, commonly called DML, such as SELECT, INSERT, etc.
5. Explain example
Here's an example of how to use the EXPLAIN command to analyze a query. You might remember this query from a previous exercise on CTEs. In this query, we select the top ten divisions by revenue and filter for divisions with revenue greater than 100,000.
By running EXPLAIN before this query, we can obtain a detailed execution plan that breaks down each step, including any subqueries or joins, and provides cost estimates for each step. This information helps us understand how the database engine processes our query and where potential bottlenecks or optimizations might exist.
6. Explain results
The EXPLAIN command's output provides a detailed query execution plan. It shows various steps in the plan, each with its associated cost and details. The first step is a Hash Join, indicating that the query performs a join operation. It sequentially scans the division_names table, which is a low-cost operation. The query also utilizes a Subquery Scan and Limit operation to filter for the top ten divisions by revenue. A Sort operation arranges the results by revenue in descending order. Finally, a HashAggregate operation groups the data by division-id. Each line also has a cost these costs are in an arbitrary unit that doesn't directly equate to time or currency, but it's perfect for comparing the relative size of operations. The first value in the cost is the startup cost or the time to return the first row of data, and the value after the dots is the total cost for returning all the rows of data. Understanding this plan helps database administrators and developers identify performance bottlenecks, such as iterating over too many rows or selecting too many columns, and optimize queries for better execution times. The PostgreSQL documentation is a great reference for understanding EXPLAIN output.
7. Let's practice!
Let's go tackle some problems using what we just learned!