Releases: apache/datafusion-comet
0.4.0
DataFusion Comet 0.4.0 Changelog
This release consists of 51 commits from 10 contributors. See credits at the end of this changelog for more information.
Fixed bugs:
- fix: Use the number of rows from underlying arrays instead of logical row count from RecordBatch #972 (viirya)
- fix: The spilled_bytes metric of CometSortExec should be size instead of time #984 (Kontinuation)
- fix: Properly handle Java exceptions without error messages; fix loading of comet native library from java.library.path #982 (Kontinuation)
- fix: Fallback to Spark if scan has meta columns #997 (viirya)
- fix: Fallback to Spark if named_struct contains duplicate field names #1016 (viirya)
- fix: Make comet-git-info.properties optional #1027 (andygrove)
- fix: TopK operator should return correct results on dictionary column with nulls #1033 (viirya)
- fix: need default value for getSizeAsMb(EXECUTOR_MEMORY.key) #1046 (neyama)
Performance related:
- perf: Remove one redundant CopyExec for SMJ #962 (andygrove)
- perf: Add experimental feature to replace SortMergeJoin with ShuffledHashJoin #1007 (andygrove)
- perf: Cache jstrings during metrics collection #1029 (mbutrovich)
Implemented enhancements:
- feat: Support
GetArrayStructFields
expression #993 (Kimahriman) - feat: Implement bloom_filter_agg #987 (mbutrovich)
- feat: Support more types with BloomFilterAgg #1039 (mbutrovich)
- feat: Implement CAST from struct to string #1066 (andygrove)
- feat: Use official DataFusion 43 release #1070 (andygrove)
- feat: Implement CAST between struct types #1074 (andygrove)
- feat: support array_append #1072 (NoeB)
- feat: Require offHeap memory to be enabled (always use unified memory) #1062 (andygrove)
Documentation updates:
- doc: add documentation interlinks #975 (comphead)
- docs: Add IntelliJ documentation for generated source code #985 (mbutrovich)
- docs: Update tuning guide #995 (andygrove)
- docs: Various documentation improvements #1005 (andygrove)
- docs: clarify that Maven central only has jars for Linux #1009 (andygrove)
- doc: fix K8s links and doc #1058 (comphead)
- docs: Update benchmarking.md #1085 (rluvaton-flarion)
Other:
- chore: Generate changelog for 0.3.0 release #964 (andygrove)
- chore: fix publish-to-maven script #966 (andygrove)
- chore: Update benchmarks results based on 0.3.0-rc1 #969 (andygrove)
- chore: update rem expression guide #976 (kazuyukitanimura)
- chore: Enable additional CreateArray tests #928 (Kimahriman)
- chore: fix compatibility guide #978 (kazuyukitanimura)
- chore: Update for 0.3.0 release, prepare for 0.4.0 development #970 (andygrove)
- chore: Don't transform the HashAggregate to CometHashAggregate if Comet shuffle is disabled #991 (viirya)
- chore: Make parquet reader options Comet options instead of Hadoop options #968 (parthchandra)
- chore: remove legacy comet-spark-shell #1013 (andygrove)
- chore: Reserve memory for native shuffle writer per partition #988 (viirya)
- chore: Bump arrow-rs to 53.1.0 and datafusion #1001 (kazuyukitanimura)
- chore: Revert "chore: Reserve memory for native shuffle writer per partition (#988)" #1020 (viirya)
- minor: Remove hard-coded version number from Dockerfile #1025 (andygrove)
- chore: Reserve memory for native shuffle writer per partition #1022 (viirya)
- chore: Improve error handling when native lib fails to load #1000 (andygrove)
- chore: Use twox-hash 2.0 xxhash64 oneshot api instead of custom implementation #1041 (NoeB)
- chore: Refactor Arrow Array and Schema allocation in ColumnReader and MetadataColumnReader #1047 (viirya)
- minor: Refactor binary expr serde to reduce code duplication #1053 (andygrove)
- chore: Upgrade to DataFusion 43.0.0-rc1 #1057 (andygrove)
- chore: Refactor UnaryExpr and MathExpr in protobuf #1056 (andygrove)
- minor: use defaults instead of hard-coding values #1060 (andygrove)
- minor: refactor UnaryExpr handling to make code more concise #1065 (andygrove)
- chore: Refactor binary and math expression serde code #1069 (andygrove)
- chore: Simplify CometShuffleMemoryAllocator to use Spark unified memory allocator #1063 (viirya)
- test: Restore one test in CometExecSuite by adding COMET_SHUFFLE_MODE config #1087 (viirya)
Credits
Thank you to everyone who contributed to this release. Here is a breakdown of commits (PRs merged) per contributor.
19 Andy Grove
13 Matt Butrovich
8 Liang-Chi Hsieh
3 KAZUYUKI TANIMURA
2 Adam Binford
2 Kristin Cowalcijk
1 NoeB
1 Oleks V
1 Parth Chandra
1 neyama
Thank you also to everyone who contributed in other ways such as filing issues, reviewing PRs, and providing feedback on this release.
0.3.0
DataFusion Comet 0.3.0 Changelog
This release consists of 57 commits from 12 contributors. See credits at the end of this changelog for more information.
Fixed bugs:
- fix: Support type coercion for ScalarUDFs #865 (Kimahriman)
- fix: CometTakeOrderedAndProjectExec native scan node should use child operator's output #896 (viirya)
- fix: Fix various memory leaks problems #890 (Kontinuation)
- fix: Add output to Comet operators equal and hashCode #902 (viirya)
- fix: Fallback to Spark when cannot resolve AttributeReference #926 (viirya)
- fix: Fix memory bloat caused by holding too many unclosed
ArrowReaderIterator
s #929 (Kontinuation) - fix: Normalize NaN and zeros for floating number comparison #953 (viirya)
- fix: window function range offset should be long instead of int #733 (huaxingao)
- fix: CometScanExec on Spark 3.5.2 #915 (Kimahriman)
- fix: div and rem by negative zero #960 (kazuyukitanimura)
Performance related:
- perf: Optimize CometSparkToColumnar for columnar input #892 (mbutrovich)
- perf: Fall back to Spark if query uses DPP with v1 data sources #897 (andygrove)
- perf: Report accurate total time for scans #916 (andygrove)
- perf: Add metric for time spent casting in native scan #919 (andygrove)
- perf: Add criterion benchmark for aggregate expressions #948 (andygrove)
- perf: Add metric for time spent in CometSparkToColumnarExec #931 (mbutrovich)
- perf: Optimize decimal precision check in decimal aggregates (sum and avg) #952 (andygrove)
Implemented enhancements:
- feat: Add config option to enable converting CSV to columnar #871 (andygrove)
- feat: Implement basic version of string to float/double/decimal #870 (andygrove)
- feat: Implement to_json for subset of types #805 (andygrove)
- feat: Add ShuffleQueryStageExec to direct child node for CometBroadcastExchangeExec #880 (viirya)
- feat: Support sort merge join with a join condition #553 (viirya)
- feat: Array element extraction #899 (Kimahriman)
- feat: date_add and date_sub functions #910 (mbutrovich)
- feat: implement scripts for binary release build #932 (parthchandra)
- feat: Publish artifacts to maven #946 (parthchandra)
Documentation updates:
- doc: Documenting Helm chart for Comet Kube execution #874 (comphead)
- doc: Update native code path in development #921 (viirya)
- docs: Add more detailed architecture documentation #922 (andygrove)
Other:
- chore: Update installation.md #869 (haoxins)
- chore: Update versions to 0.3.0 / 0.3.0-SNAPSHOT #868 (andygrove)
- chore: Add documentation on running benchmarks with Microk8s #848 (andygrove)
- chore: Improve CometExchange metrics #873 (viirya)
- chore: Add spilling metrics of SortMergeJoin #878 (viirya)
- chore: change shuffle mode default from jvm to auto #877 (andygrove)
- chore: Enable shuffle by default #881 (andygrove)
- chore: print Comet native version to logs after Comet is initialized #900 (SemyonSinchenko)
- chore: Revise batch pull approach to more follow C Data interface semantics #893 (viirya)
- chore: Close dictionary provider when iterator is closed #904 (andygrove)
- chore: Remove unused function #906 (viirya)
- chore: Upgrade to latest DataFusion revision #909 (andygrove)
- build: fix build #917 (andygrove)
- chore: Revise array import to more follow C Data Interface semantics #905 (viirya)
- chore: Address reviews #920 (viirya)
- chore: Enable Comet shuffle for Spark core-1 test #924 (viirya)
- build: Add maven-compiler-plugin for java cross-build #911 (viirya)
- build: Disable upload-test-reports for macos-13 runner #933 (viirya)
- minor: cast timestamp test #468 #923 (himadripal)
- build: Set Java version arg for scala-maven-plugin #934 (viirya)
- chore: Remove redundant RowToColumnar from CometShuffleExchangeExec for columnar shuffle #944 (viirya)
- minor: rename CometMetricNode
add
toset
and update documentation #940 (andygrove) - chore: Add config for enabling SMJ with join condition #937 (andygrove)
- chore: Change maven group ID to
org.apache.datafusion
#941 (andygrove) - chore: Upgrade to DataFusion 42.0.0 #945 (andygrove)
- build: Fix regression in jar packaging #950 (andygrove)
- chore: Show reason for falling back to Spark when SMJ with join condition is not enabled #956 (andygrove)
- chore: clarify tarball installation #959 (comphead)
Credits
Thank you to everyone who contributed to this release. Here is a breakdown of commits (PRs merged) per contributor.
22 Andy Grove
18 Liang-Chi Hsieh
3 Adam Binford
3 Matt Butrovich
2 Kristin Cowalcijk
2 Oleks V
2 Parth Chandra
1 Himadri Pal
1 Huaxin Gao
1 KAZUYUKI TANIMURA
1 Semyon
1 Xin Hao
Thank you also to everyone who contributed in other ways such as filing issues, reviewing PRs, and providing feedback on this release.
0.2.0
DataFusion Comet 0.2.0 Changelog
This release consists of 87 commits from 14 contributors. See credits at the end of this changelog for more information.
Fixed bugs:
- fix: dictionary decimal vector optimization #705 (kazuyukitanimura)
- fix: Unsupported window expression should fall back to Spark #710 (viirya)
- fix: ReusedExchangeExec can be child operator of CometBroadcastExchangeExec #713 (viirya)
- fix: Fallback to Spark for window expression with range frame #719 (viirya)
- fix: Remove
skip.surefire.tests
mvn property #739 (wForget) - fix: subquery execution under CometTakeOrderedAndProjectExec should not fail #748 (viirya)
- fix: skip negative scale checks for creating decimals #723 (kazuyukitanimura)
- fix: Fallback to Spark for unsupported partitioning #759 (viirya)
- fix: Unsupported types for SinglePartition should fallback to Spark #765 (viirya)
- fix: unwrap dictionaries in CreateNamedStruct #754 (andygrove)
- fix: Fallback to Spark for unsupported input besides ordering #768 (viirya)
- fix: Native window operator should be CometUnaryExec #774 (viirya)
- fix: Fallback to Spark when shuffling on struct with duplicate field name #776 (viirya)
- fix: withInfo was overwriting information in some cases #780 (andygrove)
- fix: Improve support for nested structs #800 (eejbyfeldt)
- fix: Sort on single struct should fallback to Spark #811 (viirya)
- fix: Check sort order of SortExec instead of child output #821 (viirya)
- fix: Fix panic in
avg
aggregate and disablestddev
by default #819 (andygrove) - fix: Supported nested types in HashJoin #735 (eejbyfeldt)
Performance related:
- perf: Improve performance of CASE .. WHEN expressions #703 (andygrove)
- perf: Optimize IfExpr by delegating to CaseExpr #681 (andygrove)
- fix: optimize isNullAt #732 (kazuyukitanimura)
- perf: decimal decode improvements #727 (parthchandra)
- fix: Remove castting on decimals with a small precision to decimal256 #741 (kazuyukitanimura)
- fix: optimize some bit functions #718 (kazuyukitanimura)
- fix: Optimize getDecimal for small precision #758 (kazuyukitanimura)
- perf: add metrics to CopyExec and ScanExec #778 (andygrove)
- fix: Optimize decimal creation macros #764 (kazuyukitanimura)
- perf: Improve count aggregate performance #784 (andygrove)
- fix: Optimize read_side_padding #772 (kazuyukitanimura)
- perf: Remove some redundant copying of batches #816 (andygrove)
- perf: Remove redundant copying of batches after FilterExec #835 (andygrove)
- fix: Optimize CheckOverflow #852 (kazuyukitanimura)
- perf: Add benchmarks for Spark Scan + Comet Exec #863 (andygrove)
Implemented enhancements:
- feat: Add support for time-zone, 3 & 5 digit years: Cast from string to timestamp. #704 (akhilss99)
- feat: Support count AggregateUDF for window function #736 (huaxingao)
- feat: Implement basic version of RLIKE #734 (andygrove)
- feat: show executed native plan with metrics when in debug mode #746 (andygrove)
- feat: Add GetStructField expression #731 (Kimahriman)
- feat: Add config to enable native upper and lower string conversion #767 (andygrove)
- feat: Improve native explain #795 (andygrove)
- feat: Add support for null literal with struct type #797 (eejbyfeldt)
- feat: Optimze CreateNamedStruct preserve dictionaries #789 (eejbyfeldt)
- feat:
CreateArray
support #793 (Kimahriman) - feat: Add native thread configs #828 (viirya)
- feat: Add specific configs for converting Spark Parquet and JSON data to Arrow #832 (andygrove)
- feat: Support sum in window function #802 (huaxingao)
- feat: Simplify configs for enabling/disabling operators #855 (andygrove)
- feat: Enable
clippy::clone_on_ref_ptr
onproto
andspark_exprs
crates #859 (comphead) - feat: Enable
clippy::clone_on_ref_ptr
oncore
crate #860 (comphead) - feat: Use CometPlugin as main entrypoint #853 (andygrove)
Documentation updates:
- doc: Update outdated spark.comet.columnar.shuffle.enabled configuration doc #738 (wForget)
- docs: Add explicit configs for enabling operators #801 (andygrove)
- doc: Document CometPlugin to start Comet in cluster mode #836 (comphead)
Other:
- chore: Make rust clippy happy #701 (Xuanwo)
- chore: Update version to 0.2.0 and add 0.1.0 changelog #696 (andygrove)
- chore: Use rust-toolchain.toml for better toolchain support #699 (Xuanwo)
- chore(native): Make sure all targets in workspace been covered by clippy #702 (Xuanwo)
- Apache DataFusion Comet Logo #697 (aocsa)
- chore: Add logo to rat exclude list #709 (andygrove)
- chore: Use new logo in README and website #724 (andygrove)
- build: Add Comet logo files into exclude list #726 (viirya)
- chore: Remove TPC-DS benchmark results #728 (andygrove)
- chore: make Cast's logic reusable for other projects #716 (Blizzara)
- chore: move scalar_funcs into spark-expr #712 (Blizzara)
- chore: Bump DataFusion to rev 35c2e7e #740 (andygrove)
- chore: add more aggregate functions to benchmark test #706 (huaxingao)
- chore: Add criterion benchmark for decimal_div #743 (andygrove)
- build: Re-enable TPCDS q72 for broadcast and hash join configs #781 (viirya)
- chore: bump DataFusion to rev f4e519f #783 (huaxingao)
- chore: Upgrade to DataFusion rev bddb641 and disable "skip partial aggregates" feature [#788](https://github.com/apa...
0.1.0
DataFusion Comet 0.1.0 Changelog
This release consists of 343 commits from 41 contributors. See credits at the end of this changelog for more information.
Implemented enhancements:
- feat: Add native shuffle and columnar shuffle #30 (viirya)
- feat: Support Emit::First for SumDecimalGroupsAccumulator #47 (viirya)
- feat: Nested map support for columnar shuffle #51 (viirya)
- feat: Support Count(Distinct) and similar aggregation functions #42 (huaxingao)
- feat: Upgrade to
jni-rs
0.21 #50 (sunchao) - feat: Handle exception thrown from native side #61 (sunchao)
- feat: Support InSet expression in Comet #59 (viirya)
- feat: Add
CometNativeException
for exceptions thrown from the native side #62 (sunchao) - feat: Add cause to native exception #63 (viirya)
- feat: Pull based native execution #69 (viirya)
- feat: Add executeColumnarCollectIterator to CometExec to collect Comet operator result #71 (viirya)
- feat: Add CometBroadcastExchangeExec to support broadcasting the result of Comet native operator #80 (viirya)
- feat: Reduce memory consumption when writing sorted shuffle files #82 (sunchao)
- feat: Add struct/map as unsupported map key/value for columnar shuffle #84 (viirya)
- feat: Support multiple input sources for CometNativeExec #87 (viirya)
- feat: Date and timestamp trunc with format array #94 (parthchandra)
- feat: Support
First
/Last
aggregate functions #97 (huaxingao) - feat: Add support of TakeOrderedAndProjectExec in Comet #88 (viirya)
- feat: Support Binary in shuffle writer #106 (advancedxy)
- feat: Add license header by spotless:apply automatically #110 (advancedxy)
- feat: Add dictionary binary to shuffle writer #111 (viirya)
- feat: Minimize number of connections used by parallel reader #126 (parthchandra)
- feat: Support CollectLimit operator #100 (advancedxy)
- feat: Enable min/max for boolean type #165 (huaxingao)
- feat: Introduce
CometTaskMemoryManager
and native side memory pool #83 (sunchao) - feat: Fix old style names #201 (comphead)
- feat: enable comet shuffle manager for comet shell #204 (zuston)
- feat: Support bitwise aggregate functions #197 (huaxingao)
- feat: Support BloomFilterMightContain expr #179 (advancedxy)
- feat: Support sort merge join #178 (viirya)
- feat: Support HashJoin operator #194 (viirya)
- feat: Remove use of nightly int_roundings feature #228 (psvri)
- feat: Support Broadcast HashJoin #211 (viirya)
- feat: Enable Comet broadcast by default #213 (viirya)
- feat: Add CometRowToColumnar operator #206 (advancedxy)
- feat: Document the class path / classloader issue with the shuffle manager #256 (holdenk)
- feat: Port Datafusion Covariance to Comet #234 (huaxingao)
- feat: Add manual test to calculate spark builtin functions coverage #263 (comphead)
- feat: Support ANSI mode in CAST from String to Bool #290 (andygrove)
- feat: Add extended explain info to Comet plan #255 (parthchandra)
- feat: Improve CometSortMergeJoin statistics #304 (planga82)
- feat: Add compatibility guide #316 (andygrove)
- feat: Improve CometHashJoin statistics #309 (planga82)
- feat: Support Variance #297 (huaxingao)
- feat: Support murmur3_hash and sha2 family hash functions #226 (advancedxy)
- feat: Disable cast string to timestamp by default #337 (andygrove)
- feat: Improve CometBroadcastHashJoin statistics #339 (planga82)
- feat: Implement Spark-compatible CAST from string to integral types #307 (andygrove)
- feat: Implement Spark-compatible CAST from string to timestamp types #335 (vaibhawvipul)
- feat: Implement Spark-compatible CAST float/double to string #346 (mattharder91)
- feat: Only allow incompatible cast expressions to run in comet if a config is enabled #362 (andygrove)
- feat: Implement Spark-compatible CAST between integer types #340 (ganeshkumar269)
- feat: Supports Stddev #348 (huaxingao)
- feat: Improve cast compatibility tests and docs #379 (andygrove)
- feat: Implement Spark-compatible CAST from non-integral numeric types to integral types #399 (rohitrastogi)
- feat: Implement Spark unhex #342 (tshauck)
- feat: Enable columnar shuffle by default #250 (viirya)
- feat: Implement Spark-compatible CAST from floating-point/double to decimal #384 (vaibhawvipul)
- feat: Add logging to explain reasons for Comet not being able to run a query stage natively #397 (andygrove)
- feat: Add support for TryCast expression in Spark 3.2 and 3.3 #416 (vaibhawvipul)
- feat: Supports UUID column #395 (huaxingao)
- feat: correlation support #456 (huaxingao)
- feat: Implement Spark-compatible CAST from String to Date #383 (vidyasankarv)
- feat: Add COMET_SHUFFLE_MODE config to control Comet shuffle mode #460 (viirya)
- feat: Add random row generator in data generator #451 (advancedxy)
- feat: Add xxhash64 function support #424 (advancedxy)
- feat: add hex scalar function #449 (tshauck)
- feat: Add "Comet Fuzz" fuzz-testing utility #472 (andygrove)
- feat: Use enum to represent CAST eval_mode in expr.proto #415 (prashantksharma)
- feat: Implement ANSI support for UnaryMinus #471 (vaibhawvipul)
- feat: Add specific fuzz tests for cast and try_cast and fix NPE found during fuzz testing #514 (andygrove)
- feat: Add fuzz testing for arithmetic expressions #519 (andygr...