Query optimization strategies

1. Query optimization strategies

Welcome! Now, it is time to learn how to optimize all the queries we have learned in the course.

2. Three rules of thumb

When optimizing a query, we should ask ourselves three questions. First, have we reduced the amount of data to be processed as much as possible? Second, have we optimized all the operations within the query? Third, have we reduced the query output as much as possible? Checking our query against these steps will help ensure we write optimal queries.

3. Reducing the amount of data

We can use some tried and true methods to create optimal queries. The first is generally a good rule when using SQL - to select only the columns we need. Another way to reduce the amount of data is by using CTEs as filters to pass down only the required data to the final query. This means we should perform operations like WHERE filters and any operation that can reduce the number of rows early, so our final query will need to scan less data.

4. Optimizing joins

Similarly, when performing a join, any measures we can take to limit the amount of data in the final step where the JOIN will take place will improve performance. While this is not possible in all cases, joining data using an integer, specifically the INT64 data type, is recommended. Since there are fewer possible combinations to join on, this also speeds up the join.

5. Optimizing the WHERE clause

BigQuery recommends using the BOOLEAN, INTEGER, FLOAT, and DATE data types with the WHERE clause. These data types will perform faster when filtering our data. STRING and BYTE data types are not optimal since they require more bytes to store and potentially have more options to search across. In these examples, we can see two queries. The first is looking for a product named "shoes". Imagine that in the same table, we also have product_id, a unique integer identifier for each product. The latter will get the same result but run using a more efficient query.

6. ORDER BY optimizations

In BigQuery, ORDER BY can be costly, depending on the data size. Think about it for a minute. This mistake happens often, especially when using ORDER BY within a CTE or before the end of our query.

7. ORDER BY without optimization

Consider this query. As we can see, the ORDER BY clause is contained within our CTE called order total. Many individual rows have details on the order, meaning that the ORDER BY will need to review all of them to order the data.

8. ORDER BY with optimization

Here, the ORDER BY clause to the outermost query. This requires adding the last-purchase-date column to the order-total CTE, which adds more data, but ultimately, this is a less costly addition than using ORDER BY over a large number of rows.

9. Using EXISTS vs. COUNT

One final optimization we can perform is for a specific use case. We need to determine if a value exists in a table; a standard query might aggregate that column and count the number of instances. BigQuery provides an alternate solution called EXISTS; it will query over the column, and if it finds the matching value, it will stop once that condition has been met. The COUNT and aggregation method would continue to query even if that condition is met. As on the right, we only need to wrap our query in parentheses and then use the statement SELECT EXISTS at the beginning.

10. Other optimization methods

Other methods that can optimize queries include using approximate aggregate functions like those seen here and filtering the query by date. Many large tables in BigQuery will require a partition by date, improving query performance.

11. Let's practice!

Now that we have reviewed the three rules for optimizing your queries let's put it into practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.