Deployment Guide
Requirements
Host Specifications
Minimum: 1 Gb RAM, 1 CPU (x86)
Recommended: 8 Gb RAM, 4 CPUs (x86)
Opteryx balances memory consumption with performance, however, being able to process large datasets will require larger memory specifications compared to what is needed to process smaller datasets. The reference implementation of Opteryx regularly processes 100Gb of data in a container with 4 CPUs and 8Gb of memory allocated.
Note
This is a general recommendation and is a good place to start, your environment and specific problem may require, or perform significantly better, with a different configuration.
Note
Opteryx contains no specific optimiations to make use of multiple CPUs, although multiple CPUs may be beneficial as some libraries Opteryx is built on may use multiple CPUs.
Warning
Non x86 environments, such as Raspberry Pi or the Apple Silicon Macs (e.g. M1), may require additional set up steps.
Operating System Support
Recommended Operating System: Ubuntu 20 (64bit)
Opteryx can be installed and deployed on a number of different platforms. It has heavy dependency on Apache Arrow and cannot be run on systems which do not support Arrow.
The full regression suite is run on Ubuntu (Ubuntu 20.04) for Python versions 3.9, 3.10, 3.11, and 3.12. The below table shows regression suite coverage:
OS | Python 3.9 | Python 3.10 | Python 3.11 | Python 3.12 |
---|---|---|---|---|
MacOS (x86/Intel) | Partial | Partial | Partial | Partial |
Windows (x86) | Partial | Partial | Partial | Partial |
Ubuntu (x86) | Full | Full | Full | Full |
Debian (ARM) | Partial | None | None | None |
Full - no tests are excluded from the test suite - coverage statistics are from Ubuntu Python 3.10 tests.
Partial - some tests are excluded from the test suite or that some tests fail.
None - there is no automated test for this configuration.
Note
- Python 3.8 last supported version 0.11.0
- PyPy regression suite fails due to issues with Apache Arrow.
- MacOs (M1/M2/M3) is not included in the regression suite due to lack of support on the test platform, however there is known usage on this chipset on Python 3.11.
- Windows (ARM) is not included in the regression suite due to lack of support on the test platform.
- Partial coverage is primarily due to testing platform constraints, not core-compatibility issues.
Python Environment
Recommended Version: 3.11
Opteryx supports Python versions 3.9, 3.10 and 3.11. Due to variations in support for parts of Opteryx and environments to reliably test and build, not all environments support each version of Python. 3.11 has the broadest compatibility
Environment | Python Versions Supported |
---|---|
Linux 64bit x86 | 3.9, 3.10, 3.11, 3.12 |
Linux ARM | build from source |
MacOS Intel | 3.9, 3.10, 3.11, 3.12 |
MacOS Apple (M) | 3.11, 3.12 |
Windows 64bit x86 | 3.9, 3.10, 3.11, 3.12 |
Opteryx is primarily developed on workstations running Python 3.11 (Debian x86, Raspian, and MacOS M2) and is known to be deployed in production environments running Python 3.9 and Python 3.11 on Debian.
Python 3.11 has the greatest test coverage due to it being supported on more platforms.
Jupyter Notebooks
Opteryx can run in Jupyter Notebooks to access data locally or, if configured, remotely on systems like GCS and S3.
It is recommended that the Notebook host is located close to the data being queried - such as running Vertex AI Notebooks if the data sources are primarily on GCP, or querying local files if running Jupyter on a local device. Other configurations will work, but are less optimal.
Docker & Kubernetes
There is no Docker image for Opteryx, this is because Opteryx is an embedded Python library. However, system built using Opteryx can be deployed via Docker or Kubernetes.
Google Cloud
Cloud Run
Opteryx is well-suited for running data manipulation tasks in Cloud Run as this was the target platform for the initial development.
Running in the Generation 2 container environment is likely to result in faster query processing, but has a slower start-up time. Opteryx runs in Generation 1 container, taking approximately 10% longer to execute queries.
Data Storage
Connectors
Built-In Connectors
The following connectors are part of the base installation of Opteryx and are tested as part of each deployment to ensure they operate as expected.
Platform | Connector Name | Implementation |
---|---|---|
AWS S3 | AwsS3Connector | Blob/File Store |
Google Cloud Storage | GcpCloudStorageConnector | Blob/File Store |
Google FireStore | GcpFireStoreConnector | Document Store |
Local Disk | DiskConnector | Blob/File Store |
MinIo | AwsS3Connector | Blob/File Store |
MongoDB * | MongoDbConnector | Document Store |
BigQuery | SqlConnector | SQL Store |
Cockroach DB | SqlConnector | SQL Store |
DuckDB | SqlConnector | SQL Store |
MySQL | SqlConnector | SQL Store |
Postgres | SqlConnector | SQL Store |
SQLite | SqlConnector | SQL Store |
Cassandra | CqlConnector | CQL Store |
Datastax Astra | CqlConnector | CQL Store |
Connectors are registered with the storage engine using the register_store
method. Multiple prefixes can be added, using different connectors - multiple storage types can be combined into a single query.
Note
Other data sources with SqlAlchemy connectors or which support the Cassandra Driver are likely to be supported, however, are not part of the automated test suite.
A more complete example using the register_store
method to set up a connector to Google Cloud Storage (GCS) and then query data on GCS is below:
import opteryx
from opteryx.connectors import GcpCloudStorageConnector
# Tell the storage engine that datasets with the prefix 'your_bucket'
# are to be read using the GcpCloudStorageConnector connector.
# Multiple prefixes can be added and do not need to be the same
# connector.
opteryx.register_store("your_bucket", GcpCloudStorageConnector)
connextion = opteryx.connect()
cursor = connection.cursor()
cursor.execute("SELECT * FROM your_bucket.folder;")
print(cursor.fetchone())
Blob/File Stores
Datasets
Opteryx references datasets using their relative path as the table name. For example in the following folder structure
/
├─ products/
├─ customers/
│ ├─ profiles/
│ └─ preferences/
│ ├─ marketing/
│ └─ site/
└── purchases/
Would have the following datasets available (assuming leaf folders have data files within them)
- products
- customers.profiles
- customers.preferences.marketing
- customers.preferences.site
- purchases
These are queryable like this:
Temporal Structures
To enable temporal queries, data must be structured into date hierarchy folders below the dataset folder. Using just the products dataset from the above example, below the products folder must be year, month and day folders like this:
To query the data for today with this structure, you can execute:
To query just the folder shown in the example (1st May 2022), you can execute:
This is the default structure created by Mabel and within Opteryx this is called Mabel Partitioning.
File Types
Opteryx is primarily designed for use with Parquet to store data, Parquet is fast to process and offers optimizations not available for other formats, however, in some benchmarks ORC out performs Parquet.
Opteryx supports:
- Parquet formatted files (
.parquet
) - CSV formatted files (
.csv
) - Tab delimited files (
.tsv
) - JSONL formatted files (
.jsonl
) - JSONL formatted files which have been Zstandard compressed (
.zstd
) - ORC formatted files (
.orc
) - Feather (Arrow) formatted files (
.arrow
) - Arrow IPC format (
.ipc
) - Avro formatted files (
.avro
)
Note
- ORC is not supported on Windows environments
- CSV and TSV support is limited and is not recommended beyond trivial use cases
- Avro is not recommended for use in performance-sensitive contexts
File Sizes
Opteryx usually loads entire files into memory at a time, this requires the following to be considered:
- Reading one record from a file loads the entire blob. If you regularly only read a few records, prefer smaller blobs.
- Reading each blob, particularly from Cloud Storage (S3/GCS), incurs a per-read overhead. If you have large datasets, prefer larger blobs.
If you are unsure where to start, 64Mb (pre compression) is a recommended general-purpose blobs size, these should then be compressed (Snappy or zStandard are recommended).