Get startedGet started for free

Writing optimal queries

1. Writing optimal queries

Now that we've seen several different query types that take advantage of Redshift's features and how tables are constructed, let's talk about writing optimal queries for Redshift.

2. Limit columns

The first optimization we should apply is to select only the columns we need in our outputs. Since Redshift is a columnar database, each column we add adds additional resource usage and time to read and process the data.

3. Use DISTKEY and SORTKEYs

Use dist and sort keys in join, where, and group by clauses whenever possible. Leveraging these keys helps distribute the work across multiple nodes and reduces data movement between redshift nodes, especially when providing filters for each table. Additionally, when using sort keys in an order by clause, put them in same order as they were set in the table's sortkey for a highly optimized statement.

4. Building good predicates

When writing predicates try to use the tables DISTKEY and SORTKEY when you can. If you are joining tables then put any predicates that target the table in the join clause, and avoid using a function in the predicates. Notice in this example how we added predicates to the cookies join to limit it just to the cookies available before our receipt date predicate and are still available. We could let the receipt cookie ID restrict it to the proper cookie rows via the first ON predicate. Still, by using these additional predicates, the distributed query can filter down the data we're interested in before merging the tables for the join and reducing the amount of data copied and loaded by our query. By putting these in the join condition's ON clause instead of the where clause, we allow these predicates to be distributed with the join request, not just the overall query.

5. Be consistent with column ordering

To prevent the query engine from needing to rearrange data multiple times and consuming extra resources, maintain column order between the group by and order by clauses. In our example here, we are ordering by col two, three, and then one; however, our group by doesn't match that order. If we adjust our group by to match our order by, we'll enable the engine to immediately work on the order without the need to rearrange the columnar data for processing.

6. Use subqueries wisely

Subqueries can cause Redshift to do extra work that we didn't intend, so we must use them wisely. Use a proper join instead of a subquery if you can. If you are using a subquery to check for the truthfulness of something, Redshift has an EXISTS statement that is more performant than the traditional IN statement. If you use a subquery in multiple steps or parts of your solution, consider making it a temp table or a CTE to take advantage of Redshift's powerful caching.

7. Let's practice!

Time to write some optimized queries.