Skip to content

Changelog

All notable changes to this project will be documented in this file, where appropriate the GitHub issue reference will be noted along with the change. Breaking changes will be clearly indicated with the icon.

The format is based on Keep a Changelog.

[0.18.0] - 2024-10-15

Added

Fixed

Changed

  • [#1990] Flag LIST_CONTAINS_ANY and LIST_CONTAINS_ALL as Deprecated @joocer
  • [#2013] MemoryPool based on Python's multiprocessing buffer for storage @joocer
  • [#2021] Error raised when subqueries expose columns with duplicate names. @joocer
  • [#2066] Reduce materialization of ARRAY columns. @joocer

[0.17.0] - 2024-09-05

Added

Fixed

Changed

[0.16.0] - 2024-07-20

Added

Fixed

[0.15.9] - 2024-06-29

Added

  • [#1778] Specialized INNER JOIN operator for INTEGERS and VARCHARS @joocer
  • [#1768] Date calculation predicate rewriter (optimizer) @joocer

Fixed

  • [#1788] Errors reading files reported as SQL syntax errors. @joocer

[0.15.8] - 2024-06-29

Added

Fixed

  • [#1700] Mac versions not building and publishing to PyPI @joocer

[0.15.5] - 2024-06-18

Changed

  • [#1759] -> and ->> operators able to parse JSON strings and bytes @joocer

[0.15.4] - 2024-06-17

Added

  • [#1746] Support legacy Mabel LZMA compressed JSONL files @joocer
  • [#1748] S3/MinIO Connector supports async reads @joocer

Changed

Fixed

  • [#1756] Sort incorrectly returning zero records when first morsel is empty @joocer

[0.15.3] - 2024-06-06

Changed

  • [#1715] More optimizations AND, OR and XOR aware @joocer
  • [#1717] Memcahed notified when resource read from local buffers @joocer

[0.15.1] - 2024-05-31

Added

  • [#1685] [ClickBench] Include BLOB as column types for predicate pushdowns @joocer
  • [#1697] [ClickBench] Optimizer removes redundant operators @joocer
  • [#1698] [ClickBench] Replace LIKE conditions with STARTS_WITH and ENDS_WITH functions @joocer
  • [#1581] ANY operator supports literal lists @joocer
  • [#1690] Support REGEXP_REPLACE function @joocer

[0.15.0] - 2024-05-26

Added

Changed

Fixed

[0.14.1] - 2024-04-13

Added

  • [#1571] Initial COSINE_SIMILARITY support (ad hoc). @joocer

Fixed

  • [#1573] Parquet COUNT(*) returns unfiltered row count. @joocer

[0.14.0] - 2024-04-07

Added

Changed

Fixed

  • [#1462] Unhandled exception on empty datasets. @joocer
  • [#1465] Clashing column names not aliased correctly. @joocer
  • [#1445] SHOW EXTENDED COLUMNS not working as expected - changes made to data profile format. @joocer
  • [#1473] INNER JOIN statistics not correct. @joocer
  • [#1474] Unhelpful error message when SELECT * is mixed with column references @joocer
  • [#1487] Filters are not applied to scans on specific conditions involving JOINs @joocer
  • [#1480] INTERVAL cannot be compared to durations @joocer
  • [#1485] Improved handling of COUNT(*) when pushing to parquet and SQL @joocer
  • [#1513] LIST_CONTAINS_ANY performance @joocer

[0.13.3] - 2024-02-13

Added

  • [#1427] Boolean simplification optimization strategy. @joocer

Fixed

Changes

  • [#1421] Improve CROSS JOIN UNNEST performance. @joocer
  • [#1410] Whilst working on JOIN improvements, improvements to DISTINCT made. @joocer
  • Refactored SQL Fuzzer regression test script. @joocer

[0.13.0] - 2024-02-03

Added

  • [#1404] Optimize single element literal IN conditions @joocer

Fixed

Changed

[0.12.2] - 2024-01-10

Added

Fixed

Changed

  • [#1366] Subscript refactor for performance (~3x faster). @joocer

[0.12.0] - 2024-01-02

Fixed

  • [#1080] Windows regression test failures. @joocer
  • Soundex incorrectly evaluated empty strings as '0000'. @joocer

Changed

Added

Removed

  • Python 3.8 is no longer supported. @joocer

[0.11.0] - 2023-06-16

Fixed

  • [#1069] Minor improvements identified during code review of code to generate numeric series. @joocer
  • [#1072] Minor improvements identified during code review of code to handle dates and intervals. @joocer
  • [#1026] Removed pin to version 0.11 of PyArrow dependabot
  • [#1077] Removed pin to version 0.7.1 of DuckDB dependabot

Changed

Added

  • [#1034] Schemas added for the internal sample datasets. @joocer
  • [#1038] Able to pass SqlAlchemy Engine to the SQL Connectors, allowing for more complex authentication scenarios. @joocer
  • [#1065] Support integer division operator DIV. @joocer

[0.10.0] - 2023-05-03

Warnings

  • .to_df() will be replaced with .pandas() in version 0.11.

Fixed

  • [#929] Improved error messages for malformed temporal clauses. @joocer
  • [#735] (correction) Cursor fetchone and fetchmany step over the record set. @joocer
  • [#994] LIMIT didn't prevent additional files from being read after limit was met. @joocer
  • [#996] Performance issues with LIMIT and serialization steps. @joocer
  • [#1008] JOIN on a literal fails when attempting to find good match. @joocer
  • [#1006] Errors handling filenames with multiple dots in the name. @joocer
  • [#1010] Predicates not pushed for ZSTD compressed files. @joocer
  • [#1007] Wildcards not interpretted correctly in some projection pushdowns @joocer
  • [#1015] Column comparisons not working as expected in predicate pushdowns @joocer

Changed

Added

[0.9.3] - 2023-03-04

Fixed

  • [#916] Profile error on morsel with all nulls in column @joocer
  • Correctness of LRU-K algorithm @joocer
  • [#917] Comparisons failed on very long and skinny tables @joocer

[0.9.2] - 2023-02-28

Fixed

  • [#909] Divide by Zero error handling empty pages @joocer
  • [#912] Literal expressioned which evaluate to a boolean were ignored @joocer

Changed

  • [#901] Generate Series no longer accepts single numbers or IP ranges, provide explicit start or use | to test IP address containment @joocer
  • [#848] Collection and SQL Connectors dynamically size reads to fill target morsel size @joocer

[0.9.1] - 2023-02-23

Fixed

[0.9.0] - 2023-02-19

Fixed

  • [#797] Name collisons with aliases cause issues in ORDER BY. @joocer
  • [#833] Unhelpful error when no statement is provided @joocer
  • [#870] Repeated columns in GROUP BY not processed @joocer
  • [#873] 2 x CodeQL security issues @joocer

Changed

  • [#799] Chunk large blob reads. @joocer
  • [#812] Abstract the tree structure that plans are built from. @joocer
  • [#808] Split Logical and Physical planning (partial). @joocer
  • [#825] Remove HyperLogLog from profiling. @joocer
  • [#750] More CLI improvements. @joocer
  • [#589] Moved conditional imports out of program initialization @joocer
  • [#836] Use PyArrow 11s exposure of underlying date values in profiler @joocer
  • [#853] CaskDB replaces RocksDB as default KV store @joocer
  • [#855] Caches have been renamed and separated from KV Stores to disencourage incorrect use; The Memcache Cache is now imported using from opteryx.managers.cache import MemcachedCache @joocer
  • [#857] Removed PyYAML install @joocer
  • [#865] Replaced third-party DATE_TRUNC implementation with a first-party implementation @joocer
  • [#861] Replaced third-party bitarray library with a first-party implementation @joocer
  • [#871] Consistently name internal variables relating to chunks of data to 'morsel' (technically breaking, but no user impact expected) @joocer
  • [#880] Minor performance improvements @joocer

Added

[0.8.3] - 2023-01-10

Fixed

  • [#782] Support literal predicates in JOIN conditions. @joocer

Changed

Added

[0.8.2] - 2023-01-06

Fixed

  • [#757] Multiple bugs in config manager. @joocer
  • [#769] ARRAY_AGG couldn't be nested. @joocer
  • [#775] Connection function .arrow() materializes before applying limit. @joocer

Changed

Added

[0.8.1] - 2022-12-30

Fixed

[0.8.0] - 2022-12-27

Fixed

  • [#703] ORDER BY columns not in SELECT clause. @joocer
  • [#712] Aggregates on literals when combined with a GROUP BY clause. @joocer
  • [#710] SEARCH mishandles pages with empty values in first row. @joocer
  • [#711] DATE_TRUNC is case sensitive. @joocer

Changed

  • [#707] First try to estimate unique values using the Distogram for SHOW EXTENDED COLUMNS. @joocer
  • [#707] SHOW EXTENDED COLUMNS creates histograms of 20 bins. @joocer
  • [#707] Distogram (data profiler) significant performance improvements. @joocer
  • [#722] Allow temporal FOR after alias AS clauses. @joocer
  • [#743] 'Did you mean' prompt for columns better suggestions when casing is different. @joocer

Added

[0.7.0] - 2022-12-02

Fixed

  • [#653] LIKE and FOR clauses cannot coexist in SHOW queries. @joocer
  • [#669] COUNT(*) cannot be mixed with other aggregates. @joocer
  • [#518] SELECT * and GROUP BY can't be used together. @joocer
  • [#689] IS comparisons cannot be combined with other comparisons when optimization is off. @joocer

Changed

Added

[0.6.0] - 2022-11-08

Fixed

  • [#568] Unable to perform aggregates on literals. @joocer
  • [#592] Dates not always handled correctly. @joocer
  • [#600] Parameterization when used on query batches fails. @joocer
  • [#580] Empty result sets have no column information. @joocer
  • [#548] 'did you mean' message restored for dataset WITH hints. @joocer
  • [#640] COUNT(*) shortcut only used when in uppercase. @joocer
  • [#645] (correction) null values not handled correctly in comparisions. @joocer
  • Problem installing on M1 Mac. @joocer
  • Support AND, OR, and XOR in SELECT statement. @joocer
  • [#646] Temporal clauses in incorrect place were ignored @joocer

Changed

  • [#566] Change from using SQLite3 to DuckDB for SQL comparision tests in Wrenchy-Bench. @joocer
  • [#584] (clarity) enable_page_management configuration and parameter renamed enable_page_defragmentation with some minor refactoring of approach to defragmentation. @joocer
  • (alignment) TIMESTAMP casting no longer supports casting from a number. @joocer
  • [#588] Integrate sqloxide into Opteryx to reduce lag with sqlparser-rs updates. @joocer
  • [#619] Page defragmentation moved to an Operator and positioned by the Optimizer. @joocer
  • (correction) cursor 'fetch*' methods return Python tuple, rather than Python lists. @joocer

Added

  • [#533] Support LIKE on SHOW FUNCTIONS, see sqlparser-rs/#620. @joocer
  • [#570] Query Optimizer rule to reduce steps in expression evaluation by partial elimination of negatives. @joocer
  • [#129] Support FOR clauses for all datasets. @joocer
  • [#543] Support 'type string' notation for casting values. @joocer
  • [#596] Optimizer replaces ORDER BY and LIMIT plan steps with a single 'HeapSort' plan step. @joocer
  • [#515] NULLIF function. @joocer
  • [#581] New SQL Battery test that tests results, and initial set of tests. @joocer
  • [#577] Hierarchical buffer pool and configuration. @joocer

[0.5.0] - 2022-10-02

Fixed

  • [#528] .shape() and .count() not working as expected. @joocer
  • Numbers expressed in the form +n not parsed correctly. @joocer

Changed

  • (alignment) .as_arrow() renamed to .arrow() to align to DuckDB naming. @joocer
  • (consistency) SHOW COLUMNS returns the column name in the name column, previously column_name @joocer
  • (correction) cursor 'fetch*' methods returns tuples rather than dictionaries as defaults, this is correcting a bug in PEP249 compatibility. @joocer
  • [#517] (security) Placeholder changed from '%s' to '?'. @joocer
  • [#522] Implementation of LRU-K(2) for cache evictions. @joocer
  • [#537] Significant refactor of Query Planner. @joocer

Added

  • [#397] Time Travel with '$planets' dataset. @joocer
  • [#519] Introduce a size limit on .as_arrow(). @joocer
  • [#324] Support IN UNNEST(). @joocer
  • [#386] Support SET statements. @joocer
  • [#531] Support SHOW VARIABLES and SHOW PARAMETERS. @joocer
  • [#464] Support LEFT JOIN <relation> USING @joocer
  • [#402] INNER JOIN ON supports multiple conditions @joocer
  • [#551] Document stores (MongoDb + FireStore) return '_id' column holding string version of document ID. @joocer
  • [#532] Runtime parameters are able to be altered using the SET statement. @joocer
  • [#524] Query Optimizer - conjunctive predicate splitter. @joocer

[0.4.1] - 2022-09-12

Fixed

  • Fixed missing __init__ file. @joocer

[0.4.0] - 2022-09-12

Added

Changed

  • [#457] (correction) null values are removed before performing INNER JOIN USING. @joocer

Fixed

  • [#448] VERSION() failed and missing from regression suite. @joocer
  • [#404] COALESCE fails for NaN values. @joocer
  • [#453] PyArrow bug with long lists creating new columns. @joocer
  • [#444] Very low cardinality INNER JOINS exceed memory allocation. @joocer
  • [#459] Functions lose some detail on non-first page. @joocer
  • [#465] Pages aren't matched to schema for simple queries. @joocer
  • [#468] Parquet reader shows some fields as "item". @joocer
  • [#471] Column aliases not correctly applied when the relation has an alias. @joocer
  • [#489] Intermittent behaviour on hash JOIN algorithm. @joocer

[0.3.0] - 2022-08-28

Added

  • [#196] Partial implementation of projection pushdown (Parquet Only). @joocer
  • [#41] Enable the results of functions to be used as parameters for other functions. @joocer
  • [#42] Enable inline operations. @joocer
  • [#330] Support SIMILAR TO alias for RegEx match. @joocer
  • [#331] Support SAFE_CAST alias for TRY_CAST. @joocer
  • [#419] Various simple functions (SIGN, SQRT, TITLE, REVERSE). @joocer
  • [#364] Support SOUNDEX function. @joocer
  • [#401] Support SHA-based hash algorithm functions. @joocer

Changed

  • (alignment) Paths to storage adapters has been updated to reflect 'connector' terminology.
  • (sensible defaults) Default behaviour changed from Mabel partitioning to no partitioning.
  • (correction) - Use of aliases defined in the SELECT clause can no longer be used in WHERE and GROUP BY clauses - this is a correction to align to standard SQL behaviour.
  • (correction) - Use of 'None' as an alias for null is no longer supported - this is a correction to align to standard SQL behaviour.
  • [#326] Prefer pyarrow's 'promote' over manually handling missing fields. @joocer
  • [#39] Rewrite Aggregation Node to use Pyarrow group_by(). @joocer
  • [#338] Remove Evaluation Node. @joocer
  • [#58] Performance of ORDER BY RAND() improved. @joocer

Fixed

  • [#334] All lists should be cast to lists of strings. (@joocer)
  • [#382] INNER JOIN on UNNEST relation. (@joocer)
  • [#320] Can't execute functions on results of GROUP BY. (@joocer)
  • [#399] Strings in double quotes aren't parsed. (@joocer)

[0.2.0] - 2022-07-31

Added

Changed

  • [#35] Table scan planning done during query planning. @joocer
  • [#173] Data not found raises different errors under different scenarios. (@joocer)
  • Implementation of LEFT and RIGHT functions to reduce execution time. (@joocer)
  • [#258] Code release approach. (@joocer)
  • [#295] Removed redundant projection when SELECT *. (@joocer)
  • [#297] Filters on SHOW COLUMNS execute before profiling. (@joocer)

Fixed

  • [#252] Planner should gracefully convert byte strings to ascii strings. (@joocer)
  • [#184] Schema changes cause unexpected and unhelpful failures. (@joocer)
  • [#261] Read fails if buffer cache is unavailable. (@joocer)
  • [#277] Cache errors should be transparent. (@joocer)
  • [#285] DISTINCT on nulls throws error. (@joocer)
  • [#281] SELECT on empty aggregates reports missing columns. (@joocer)
  • [#312] Invalid dates in FOR clauses treated as TODAY. (@joocer)

[0.1.0] - 2022-07-02

Added

  • [#165] Support S3/MinIO data stores for blobs. (@joocer)
  • FAKE dataset constructor (part of #179). (@joocer)
  • [#177] Support SHOW FULL COLUMNS to read entire datasets rather than just the first blob. (@joocer)
  • [#194] Functions that are abbreviations, should have the full name as an alias. (@joocer)
  • [#201] generate_series() supports CIDR expansion. (@joocer)
  • [#175] Support WITH (NO_CACHE) hint to disable using cache. (@joocer)
  • [#203] When reporting that a column doesn't exist, it should suggest likely correct columns. (@joocer)
  • 'Not' Regular Expression match operator, !~ added to supported set of operators. (@joocer)
  • [#226] Implement DATE_TRUNC function. (@joocer)
  • [#230] Allow addressing fields as numbers. (@joocer)
  • [#234] Implement SEARCH function. (@joocer)
  • [#237] Implement COALESCE function. (@joocer)

Changed

  • Blob-based readers (disk & GCS) moved from 'local' and 'network' paths to a new 'blob' path. (@joocer)
  • Query Execution rewritten. (@joocer)
  • [#20] Split query planner and query plan into different modules. (@joocer)
  • [#164] Split dataset reader into specific types. (@joocer)
  • Expression evaluation short-cuts execution when executing evaluations against an array of null. (@joocer)
  • [#244] Improve performance of IN test against literal lists. (@joocer)

Fixed

  • [#172] LIKE on non string column gives confusing error (@joocer)
  • [#179] Aggregate Node creates new metadata for each chunk (@joocer)
  • [#183] NOT doesn't display in plan correctly (@joocer)
  • [#182] Unable to evaluate valid filters (@joocer)
  • [#178] SHOW COLUMNS returns type OTHER when it can probably work out the type (@joocer)
  • [#128] JOIN fails, using PyArrow .join() (@joocer)
  • [#189] Explicit JOIN algorithm exceeds memory (@joocer)
  • [#199] SHOW EXTENDED COLUMNS blows memory allocations on large tables (@joocer)
  • [#169] Selection nodes in EXPLAIN have nested parentheses. (@joocer)
  • [#220] LIKE clause fails for columns that contain nulls. (@joocer)
  • [#222] Column of NULL detects as VARCHAR. (@joocer)
  • [#225] UNNEST does not assign a type to the column when all of the values are NULL. (@joocer)

[0.0.2] - 2022-06-03

Added

  • [#72] Configuration is now read from opteryx.yaml rather than the environment. (@joocer)
  • [#139] Gather statistics on planning reading of segements. (@joocer)
  • [#151] Implement SELECT table.*. (@joocer)
  • [#137] GENERATE_SERIES function. (@joocer)

Fixed

  • [#106] ORDER BY on qualified fields fails (@joocer)
  • [#103] ORDER BY after JOIN errors (@joocer)
  • [#110] SubQueries AS statement ignored (@joocer)
  • [#112] SHOW COLUMNS doesn't work for non sample datasets (@joocer)
  • [#113] Sample data has "NaN" as a string, rather than the value NaN (@joocer)
  • [#111] CROSS JOIN UNNEST should return a NONE when the list is empty (or NONE) (@joocer)
  • [#119] 'NoneType' object is not iterable error on UNNEST (@joocer)
  • [#127] Reading from segments appears to only read the first segment (@joocer)
  • [#132] Multiprocessing regressed Caching functionality (@joocer)
  • [#140] Appears to have read both frames rather than the latest frame (@joocer)
  • [#144] Multiple JOINS in one query aren't recognized (@joocer)

[0.0.1] - 2022-05-09

Added

  • Additional statistics recording the time taken to scan partitions (@joocer)
  • Support for FULL JOIN and RIGHT JOIN (@joocer)

Changed

  • Use PyArrow implementation for INNER JOIN and LEFT JOIN (@joocer)

Fixed

  • [#99] Grouping by a list gives an unhelpful error message (@joocer)
  • [#100] Projection ignores field qualifications (@joocer)

[0.0.0]

  • Initial Version