Skip to content

Integration Guides

Opteryx provides flexible integration options to query data across multiple sources. These guides will help you connect Opteryx to various databases, cloud storage systems, and data tools.

Getting Started

New to Opteryx? Start here:

Data Sources

Query data from various storage systems without moving or copying it.

Cloud Storage

File Formats

SQL Databases

Connect to relational databases and run federated queries across them.

  • PostgreSQL - Connect to Postgres databases using SQLAlchemy
  • MySQL - Query MySQL databases and join with other sources
  • BigQuery - Access Google BigQuery tables
  • DuckDB - Query DuckDB databases for analytics
  • SQLite - Work with SQLite databases

Data Analysis Tools

Integrate with popular Python data analysis libraries.

DataFrames

  • Pandas - Query Pandas DataFrames and output results as Pandas
  • Polars - High-performance DataFrame integration with Polars

Notebooks

Common Use Cases

Federated Queries

Combine data from multiple sources in a single query:

import opteryx
from opteryx.connectors import SqlConnector, AwsS3Connector
from sqlalchemy import create_engine

# Register S3 connector
opteryx.register_store("my-bucket", AwsS3Connector)

# Register PostgreSQL
postgres_engine = create_engine("postgresql+psycopg2://user:pass@host/")
opteryx.register_store("pg", SqlConnector, remove_prefix=True, engine=postgres_engine)

# Query across S3 and Postgres
result = opteryx.query("""
    SELECT 
        s3_data.customer_id,
        s3_data.purchase_amount,
        pg_data.customer_name
    FROM my-bucket.sales AS s3_data
    JOIN pg.customers AS pg_data
    ON s3_data.customer_id = pg_data.id
""")

Data Pipeline Integration

Use Opteryx in data pipelines:

import opteryx
from opteryx.connectors import GcpCloudStorageConnector

# Register GCS connector
opteryx.register_store("data", GcpCloudStorageConnector)

# Read from GCS
df = opteryx.query("""
    SELECT 
        date,
        product_id,
        SUM(amount) as total_sales
    FROM data.sales
    WHERE date >= '2024-01-01'
    GROUP BY date, product_id
""").pandas()

# Continue processing with your preferred tool
df.to_csv('processed_sales.csv', index=False)

Need Help?

Contributing

Found an issue or want to contribute a guide? Check out our Contributing Guide.