Query Optimization

Adapted from 15 Best Practices for SQL Optimization.

No optimization technique is universally true, these recommendations should improve performance in most cases. As with all optimization, test in your unique set of circumstances before assuming it to be true.

1. Avoid using `SELECT *`

Selecting only the fields you need to be returned improves query performance by reducing the amount of data that is processed internally.

A principle the Query Optimizer uses is to eliminate rows and columns to process as early as possible, SELECT * removes the option to remove columns from the data being processed.

2. Prune Early

Where available, use temporal filters (FOR DATE) to limit the date range which will limit the number of partitions that need to be read.

Not reading the record is faster than reading and working out if it needs to be filtered out of the result set.

3. `GROUP BY` field selection

VARCHAR Grouping by VARCHAR columns is usually slower than grouping by NUMERIC columns, if you have an option of grouping by a username or a numeric user id, prefer the user id.

cardinality Grouping by columns with high cardinality (mostly unique) is generally slower than grouping where there is a lot of duplication in the groups.

4. Avoid `CROSS JOIN`

Cross join will very likely create a lot of records that are not required - if you then filter these records from the two source tables using a WHERE clause, it's likely you should use an INNER JOIN instead.

5. Small table drives big table

Most JOINs require iterating over two relations, the left relation, which is the one in the FROM clause, and the right relation which is the one in the JOIN clause (SELECT * FROM left JOIN right). It is generally faster to put the smaller relation to the left.

6. Use `LIKE` when comparing strings

LIKE can be used for pattern matching but it can also be used for comparisons without wildcards and generally performs faster than = comparisons.

7. Use the correct `JOIN`

A CROSS JOIN can quickly generate millions of records to be filtered, if you can use any join other than the CROSS JOIN, do that.

8. Use `LIMIT`

LIMIT stops a query when it has returned the desired number of results; if you do not want the full dataset, using LIMIT can reduce the time taken to process a statement.

However, some operations are 'greedy', that is, they need all of the data for their operation (for example ORDER BY, and GROUP BY) - LIMIT does not have the same impact on these queries.

9. Use `WHERE` to filter before `GROUP BY`

Only use HAVING to filter the aggregation results of GROUP BY. GROUP BY is a relatively expensive operation in terms of memory and compute, filter as much before the GROUP BY by using the WHERE clause and only use HAVING to filter by the aggregation function (e.g. COUNT, SUM).

10. `IS` filters are generally faster than `=`

IS comparisons are optimized for a specific check and perform up to twice as fast as = comparisons. However, they are only available for a limited set of checks:

IS (NOT) NONE
IS (NOT) TRUE
IS (NOT) FALSE

Query Optimization

1. Avoid using SELECT *

2. Prune Early

3. GROUP BY field selection

4. Avoid CROSS JOIN

5. Small table drives big table

6. Use LIKE when comparing strings

7. Use the correct JOIN

8. Use LIMIT

9. Use WHERE to filter before GROUP BY

10. IS filters are generally faster than =