Storage Formats
This document primarily applies to the Blob and File stores, such as GCS, S3 and local disk.
Supported Data Files
Parquet
Parquet is the preferred file format for Opteryx and use of Parquet offers optimizations not available with other formats. If a datasource has query performance issues or is hot in terms of query use, converting to Parquet is likely to improve performance. Performance testing suggests Parquet with zStandard compression provides best balance of IO to read the files and CPU to to the files.
As will all guidance on performance tuning - this appears to be generally correct but test for your specific circumstances.
ORC & Feather
Opteryx support ORC and Feather, but not all optimizations implemented for Parquet are implemented for these formats. These will still provide better performance than traditional data formats.
ORC files have limited support on Windows and PyPy environments.
JSONL
JSONL and zStandard compressed JSONL files.
Files don't need an explicit schema, but each partition must have the same columns in the same order in every row of every file.
Data types are inferred from the records, where data types are not consistent, the read will fail.
Opteryx supports zStandard Compressed JSONL files as created by Mabel, these perform approximately 20% faster than raw JSONL files primarily due to reduced IO.
CSV & TSV
Comma-separated and tab-delimited files can be used with Opteryx, however this is only provided to meet the base expectation that the system can support these formats. It is not recommended and limited regression tests are written relating to CSV handling.
Avro
Avro formatted files are supported, however require an additional library to be installed (pip install avro
) and performance is considered poor.
Storage Layout
For Blob/File stores, the path of the data is used as the name of the relation in the query. There are currently two built in data schemas, none (or flat) and Mabel.
Flat
The flat schema is where the data files are stored in the folder which names the relation, such as:
This would be available to query with a query such as:
Which would read the three files to return the query.
Mabel
The Mabel schema is where data is structured in date labelled folders
The date components of the folder structure are part of the temporal processing, and are not directly referenced as part of the query, instead they form part of the temporal clause (FOR
)
This approach enables data to be partitioned by date and pruned using temporal filters.