Integration Guides

Opteryx provides flexible integration options to query data across multiple sources. These guides will help you connect Opteryx to various databases, cloud storage systems, and data tools.

Getting Started

New to Opteryx? Start here:

Quickstart Guide - Get Opteryx up and running in minutes
Python API - Learn the Python API basics
Using with Jupyter Notebooks - Interactive data analysis

Data Sources

Query data from various storage systems without moving or copying it.

Cloud Storage

AWS S3 - Query Parquet, CSV, and JSONL files directly from Amazon S3
Google Cloud Storage - Access data files in GCS buckets
Apache Iceberg - Query Iceberg tables with catalog support

File Formats

Execute SQL on CSV Files - Query CSV files directly without loading into a database
Convert CSV to Parquet - Transform file formats using SQL

SQL Databases

Connect to relational databases and run federated queries across them.

PostgreSQL - Connect to Postgres databases using SQLAlchemy
MySQL - Query MySQL databases and join with other sources
BigQuery - Access Google BigQuery tables
DuckDB - Query DuckDB databases for analytics
SQLite - Work with SQLite databases

Data Analysis Tools

Integrate with popular Python data analysis libraries.

DataFrames

Pandas - Query Pandas DataFrames and output results as Pandas
Polars - High-performance DataFrame integration with Polars

Notebooks

Jupyter Notebooks - Interactive data exploration and visualization

Common Use Cases

Federated Queries

Combine data from multiple sources in a single query:

import opteryx
from opteryx.connectors import SqlConnector, AwsS3Connector
from sqlalchemy import create_engine

# Register S3 connector
opteryx.register_store("my-bucket", AwsS3Connector)

# Register PostgreSQL
postgres_engine = create_engine("postgresql+psycopg2://user:pass@host/")
opteryx.register_store("pg", SqlConnector, remove_prefix=True, engine=postgres_engine)

# Query across S3 and Postgres
result = opteryx.query("""
    SELECT 
        s3_data.customer_id,
        s3_data.purchase_amount,
        pg_data.customer_name
    FROM my-bucket.sales AS s3_data
    JOIN pg.customers AS pg_data
    ON s3_data.customer_id = pg_data.id
""")

Data Pipeline Integration

Use Opteryx in data pipelines:

import opteryx
from opteryx.connectors import GcpCloudStorageConnector

# Register GCS connector
opteryx.register_store("data", GcpCloudStorageConnector)

# Read from GCS
df = opteryx.query("""
    SELECT 
        date,
        product_id,
        SUM(amount) as total_sales
    FROM data.sales
    WHERE date >= '2024-01-01'
    GROUP BY date, product_id
""").pandas()

# Continue processing with your preferred tool
df.to_csv('processed_sales.csv', index=False)

Need Help?

Discord Community - Get help from the community
GitHub Issues - Report bugs or request features
Documentation - Explore the full documentation

Contributing

Found an issue or want to contribute a guide? Check out our Contributing Guide.