Optimization Strategies

Strategy	Type	Status
Split Conjunctive Predicates	Heuristic	Implemented
Predicate Pushdown	Schema-Aware	Implemented
Projection Pushdown	Schema-Aware	Implemented
Constant Folding	Heuristic	Implemented
Predicate Rewriter	Heuristic	Implemented
Morsel Defragmentation	Heuristic	Attempted
Aggregate Pushdown	Schema-Aware	Considered
IN (literal) to JOIN	Schema-Aware	Considered
Use Heap Sort	Heuristic	Implemented
Use pass-thru DISTINCT	Heuristic	Considered
Limit Pushdown	Heuristic	Considered
IN (subquery) to JOIN	Schema-Aware	Considered
CTE rewrite	Heuristic	Considered
Subquery flattening	Schema-Aware	Considered
JOIN ordering	Cost-Based	Considered
Predicate Ordering	Cost-Based	Attempted
Predicate Flattening	Schema-Aware	Attempted
Predicate Compaction	Schema-Aware	Designed
Correlated Predicates	Schema-Aware	Considered
Predicate Elimination	Heuristic	Implemented
JOIN Elimination	Schema-Aware	Considered

Split Conjunctive Predicates

status implemented
goal prepare
description Split ANDed (conjunctions) predicates into separate steps in the query plan

This optimization step is preparing for later optimizations, it splits filter conditions at ANDs to try to create smaller, simpler individual conditions (predicates). These predicates are then handled in subsequent strategies, for example Predicate Pushdown and Predicate Ordering.

To split conditions, no information is needed about the schema or about costs, this strategy is executed before the binder.

Predicate Pushdown

status implemented
goal reduce rows
description Move filters earlier in the query plan

This optimization aims to reduce the number of records as quickly as possible running through the execution engine by executing filters as soon as possible. This can include pushing filters into the actual data read (e.g. for SQL and parquet), straight after reading, or into JOIN conditions.

This optimization needs to know which relation identifiers are from to be able to push toward the correct scan/JOIN, this strategy is run after the binder.

improvements - Push complex predicates (e.g. ORed) to SQL sources - Where filters can be inferred from statistics and JOINs, create and push these

Projection Pushdown

status implemented
goal reduce columns
description Trim unwanted columns at read

This optimization aims to increase the number of records per morsel by removing unwanted columns from relations. This is generally done during the actual read, or straight after reading (e.g. for JSONL and CSV).

This optimization needs to know which relation identifiers are from to be able to push toward the correct scan, this strategy is run after the binder.

improvements - Push common functions and transforms to SQL sources

Constant Folding

status implemented
goal reduce calculations
description Pre-evaluate expressions

This optimization aims to reduce the work done to evaluate expressions by pre-evaluating expressions which don't rely on data from relations. This includes fixed constants (e.g. literals, PI()), variable constants (current_time) and variables.

This optimization requires variables to be resolved so is run after the binder.

improvements - filters which evaluate to FALSE (when ANDed) should prune the scan(s) below it - filters which evaluate to TRUE (when ORed) should remove the entire filter

Morsel Defragmentation

status designed
goal reduce morsels
description Combine small morsels together

This optimization aims to reduce the number of times the execution engine executes each step by combining small morsels together. Morsels are up to 64Mb in size, but as filters and other operations are run to eliminate records, the engine may be processing 100 morsels of 0.5Mb each. By combining, we have fewer function calls and steps which benefit from seeing as much data as possible (e.g. JOINs) execute faster.

This optimization does not require any information about schemas, but operates on a nearly finished plan so is run after the binder.

notes - This should be implemented to run after filters (including when pushed into readers)

Predicate Rewriter

status implemented
goal faster implementations
description replace predicates with faster versions

Some predicates can support complex filtering but are used to perform trivial filtering, where a complex predicate is used to perform a trivial check, replace the check with an simpler function call which is faster.

demorgans laws
negative filter reduction
LIKE to STARTS_WITH, ENDS_WITH, SEARCH
Single element IN to equals
No wildcard LIKE to equals

improvements Identify other functions which have faster versions.

Aggregate Pushdown

into SQL sources

IN (literal) to JOIN

Use Heap Sort

status implemented
goal reduce memory usage & sort complexity
description incremental sort

When performing an ORDER BY and a LIMIT, use a heap sort in batches to avoid loading the entire dataset into memory.

This works by acquiring tuples to sort in batches, sorting the batch, keeping the number of tuples to satisfy the limit and then fetching the next batch.

Use pass-thru LIMIT

When dealing with large number of records, rather than load them all into memory to to the distinct, use a pass-thru limit approach

Limit Pushdown

Push limits to the SQL reader

IN (subquery) to JOIN

CTE rewrite

Subquery flattening

JOIN ordering

Predicate Ordering

Predicate Flattening

Predicate Elimination

Where logical expressions contain boolean literals, these can be removed sometimes resulting in the entire expression being removed. E.g. 1 = 1 OR name = 'Alice' can be completely removed avoiding any further handling of the check against the name column.

Correlated Predicates

If a column participating in a JOIN has a known value range, push down corresponding filters to both sides of the JOIN, reducing the size of the datasets that need to be joined.

JOIN Elimination

Value range information for fields participating in JOINs could lead to the identification of redundant JOINs. Where the value ranges of the joining columns do not overlap, the JOIN will not produce any results, the entire sub-tree below the JOIN could be pruned.