Querying Google Cloud Storage with Opteryx
This guide demonstrates how to query data stored in Google Cloud Storage (GCS) buckets using Opteryx. Opteryx can directly read various file formats (Parquet, CSV, JSONL, ORC, Avro) from GCS without needing to download them first.
Installation
Install Opteryx with GCS support.
Authentication
Opteryx uses standard Google Cloud credential mechanisms. Ensure you have GCP credentials configured through one of:
- Application Default Credentials (ADC)
- Service account key file (set
GOOGLE_APPLICATION_CREDENTIALSenvironment variable) - OAuth 2.0 token (when running on Google Cloud infrastructure)
For local development, authenticate using:
Querying GCS Files
To query files in GCS, you need to register the GCS connector with a prefix that maps to your bucket.
Basic Setup
import opteryx
from opteryx.connectors import GcpCloudStorageConnector
# Register GCS connector for your bucket
# Use the bucket name as the prefix
opteryx.register_store("my-bucket", GcpCloudStorageConnector)
# Now you can query files using the registered prefix
result = opteryx.query("""
SELECT *
FROM my-bucket.data.planets
LIMIT 10
""")
# Display results
result.head()
Note
The dataset path my-bucket.data.planets refers to files in the GCS path gs://my-bucket/data/planets/. Opteryx uses dot notation instead of GCS URIs.
Query a Single File
import opteryx
from opteryx.connectors import GcpCloudStorageConnector
# Register the GCS connector
opteryx.register_store("my-bucket", GcpCloudStorageConnector)
# Query a Parquet file in GCS
result = opteryx.query("""
SELECT *
FROM my-bucket.data.planets
LIMIT 10
""")
# Display results
result.head()
Query Multiple Files
import opteryx
from opteryx.connectors import GcpCloudStorageConnector
# Register the GCS connector
opteryx.register_store("my-bucket", GcpCloudStorageConnector)
# Query all files in a dataset directory
result = opteryx.query("""
SELECT *
FROM my-bucket.data.planets
WHERE name LIKE 'M%'
""")
result.head()
Note
When querying a dataset like my-bucket.data.planets, Opteryx reads all compatible files in that directory (e.g., all .parquet files in gs://my-bucket/data/planets/).
Supported File Formats
Opteryx can query the following formats directly from GCS:
- Parquet (
.parquet) - CSV (
.csv) - JSONL (
.jsonl) - ORC (
.orc) - Avro (
.avro)
Performance Tips
- Use Parquet for better performance with columnar data
- Partition your data by date or category to enable partition pruning
- Structure data in dataset directories - Opteryx will automatically read all compatible files in a dataset folder
- Apply filters to reduce data transfer from GCS
Example: Joining GCS Data with Local Data
import opteryx
from opteryx.connectors import GcpCloudStorageConnector
# Register the GCS connector
opteryx.register_store("my-bucket", GcpCloudStorageConnector)
# Join data from GCS with a local file
result = opteryx.query("""
SELECT
gcs_data.customer_id,
gcs_data.purchase_amount,
local_data.customer_name
FROM my-bucket.sales.transactions AS gcs_data
JOIN 'local_customers.csv' AS local_data
ON gcs_data.customer_id = local_data.id
""")
result.head()
Example: Combining GCS and BigQuery
import opteryx
from opteryx.connectors import SqlConnector, GcpCloudStorageConnector
from sqlalchemy import create_engine
# Register GCS
opteryx.register_store("my-bucket", GcpCloudStorageConnector)
# Register BigQuery
GCP_PROJECT = "your-gcp-project"
bq_engine = create_engine(f"bigquery://{GCP_PROJECT}")
opteryx.register_store(
prefix="bq",
connector=SqlConnector,
remove_prefix=True,
engine=bq_engine
)
# Query joining GCS files with BigQuery tables
result = opteryx.query("""
SELECT
gcs.transaction_id,
gcs.amount,
bq.customer_name
FROM my-bucket.transactions AS gcs
JOIN bq.customers AS bq
ON gcs.customer_id = bq.id
""")
result.head()
Troubleshooting
Permission Denied Errors
- Verify your GCP credentials are configured correctly
- Check that your service account has storage.objects.get permission for the bucket
- Run gcloud auth application-default login for local development
File Not Found - Ensure the GCS path is correct - Verify the bucket name and object path exist - Check bucket permissions
Slow Queries - Consider using Parquet instead of CSV for better performance - Apply WHERE clauses to filter data early - Check your network connection to Google Cloud