Skip to content

Commit

Permalink
[SPARK-50080][SQL][TESTS] Add benchmark cases for parquet adaptive bl…
Browse files Browse the repository at this point in the history
…oom filter in BloomFilterBenchmark

### What changes were proposed in this pull request?

Parquet's AdaptiveBlockSplitBloomFilter is a technique for generating a bloom filter with the optimal bit size according to the number of distinct real data values. It may not come at no cost because it uses multiple BloomFilter candidates at runtime, which could increase CPU usage or time.

This pull request adds benchmark cases to compare with those that use the default BloomFilter size.

### Why are the changes needed?

Improvement benchmark coverage for common user-orient features from parquet datasource

### Does this PR introduce _any_ user-facing change?

no
### How was this patch tested?

benchmarking golden files attached

### Was this patch authored or co-authored using generative AI tooling?

no

Closes apache#48609 from yaooqinn/SPARK-50080.

Authored-by: Kent Yao <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
  • Loading branch information
yaooqinn authored and HyukjinKwon committed Oct 23, 2024
1 parent 51e915a commit 2cb7a16
Show file tree
Hide file tree
Showing 3 changed files with 118 additions and 100 deletions.
104 changes: 54 additions & 50 deletions sql/core/benchmarks/BloomFilterBenchmark-jdk21-results.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,191 +2,195 @@
ORC Write
================================================================================================

OpenJDK 64-Bit Server VM 21.0.4+7-LTS on Linux 6.5.0-1025-azure
OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.5.0-1025-azure
AMD EPYC 7763 64-Core Processor
Write 100M rows: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Without bloom filter 7996 8147 214 12.5 80.0 1.0X
With bloom filter 9835 9843 13 10.2 98.3 0.8X
Without bloom filter 8070 8132 88 12.4 80.7 1.0X
With bloom filter 10025 10082 81 10.0 100.2 0.8X


================================================================================================
ORC Read
================================================================================================

OpenJDK 64-Bit Server VM 21.0.4+7-LTS on Linux 6.5.0-1025-azure
OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.5.0-1025-azure
AMD EPYC 7763 64-Core Processor
Read a row from 100M rows: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Without bloom filter, blocksize: 2097152 857 882 27 116.7 8.6 1.0X
With bloom filter, blocksize: 2097152 578 599 18 173.1 5.8 1.5X
Without bloom filter, blocksize: 2097152 882 890 7 113.4 8.8 1.0X
With bloom filter, blocksize: 2097152 567 577 10 176.4 5.7 1.6X


================================================================================================
ORC Read
================================================================================================

OpenJDK 64-Bit Server VM 21.0.4+7-LTS on Linux 6.5.0-1025-azure
OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.5.0-1025-azure
AMD EPYC 7763 64-Core Processor
Read a row from 100M rows: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Without bloom filter, blocksize: 4194304 844 851 9 118.5 8.4 1.0X
With bloom filter, blocksize: 4194304 551 588 27 181.4 5.5 1.5X
Without bloom filter, blocksize: 4194304 810 836 22 123.4 8.1 1.0X
With bloom filter, blocksize: 4194304 550 568 22 181.8 5.5 1.5X


================================================================================================
ORC Read
================================================================================================

OpenJDK 64-Bit Server VM 21.0.4+7-LTS on Linux 6.5.0-1025-azure
OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.5.0-1025-azure
AMD EPYC 7763 64-Core Processor
Read a row from 100M rows: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Without bloom filter, blocksize: 6291456 837 861 23 119.4 8.4 1.0X
With bloom filter, blocksize: 6291456 555 591 54 180.2 5.5 1.5X
Without bloom filter, blocksize: 6291456 823 836 11 121.5 8.2 1.0X
With bloom filter, blocksize: 6291456 540 563 17 185.3 5.4 1.5X


================================================================================================
ORC Read
================================================================================================

OpenJDK 64-Bit Server VM 21.0.4+7-LTS on Linux 6.5.0-1025-azure
OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.5.0-1025-azure
AMD EPYC 7763 64-Core Processor
Read a row from 100M rows: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Without bloom filter, blocksize: 8388608 828 847 16 120.7 8.3 1.0X
With bloom filter, blocksize: 8388608 529 560 39 189.0 5.3 1.6X
Without bloom filter, blocksize: 8388608 797 821 21 125.5 8.0 1.0X
With bloom filter, blocksize: 8388608 533 553 23 187.5 5.3 1.5X


================================================================================================
ORC Read
================================================================================================

OpenJDK 64-Bit Server VM 21.0.4+7-LTS on Linux 6.5.0-1025-azure
OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.5.0-1025-azure
AMD EPYC 7763 64-Core Processor
Read a row from 100M rows: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------------------------------------
Without bloom filter, blocksize: 12582912 845 851 7 118.4 8.4 1.0X
With bloom filter, blocksize: 12582912 547 578 44 182.7 5.5 1.5X
Without bloom filter, blocksize: 12582912 859 876 15 116.4 8.6 1.0X
With bloom filter, blocksize: 12582912 545 576 22 183.4 5.5 1.6X


================================================================================================
ORC Read
================================================================================================

OpenJDK 64-Bit Server VM 21.0.4+7-LTS on Linux 6.5.0-1025-azure
OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.5.0-1025-azure
AMD EPYC 7763 64-Core Processor
Read a row from 100M rows: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------------------------------------
Without bloom filter, blocksize: 16777216 815 832 15 122.7 8.1 1.0X
With bloom filter, blocksize: 16777216 534 559 26 187.1 5.3 1.5X
Without bloom filter, blocksize: 16777216 810 841 26 123.4 8.1 1.0X
With bloom filter, blocksize: 16777216 554 575 15 180.5 5.5 1.5X


================================================================================================
ORC Read
================================================================================================

OpenJDK 64-Bit Server VM 21.0.4+7-LTS on Linux 6.5.0-1025-azure
OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.5.0-1025-azure
AMD EPYC 7763 64-Core Processor
Read a row from 100M rows: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------------------------------------
Without bloom filter, blocksize: 33554432 801 817 23 124.8 8.0 1.0X
With bloom filter, blocksize: 33554432 528 538 11 189.4 5.3 1.5X
Without bloom filter, blocksize: 33554432 845 852 7 118.4 8.4 1.0X
With bloom filter, blocksize: 33554432 545 564 16 183.4 5.5 1.5X


================================================================================================
Parquet Write
================================================================================================

OpenJDK 64-Bit Server VM 21.0.4+7-LTS on Linux 6.5.0-1025-azure
OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.5.0-1025-azure
AMD EPYC 7763 64-Core Processor
Write 100M rows: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Without bloom filter 12129 12161 46 8.2 121.3 1.0X
With bloom filter 20231 20267 50 4.9 202.3 0.6X
Write 100M rows: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
---------------------------------------------------------------------------------------------------------------------------
Without bloom filter 12141 12156 21 8.2 121.4 1.0X
With bloom filter 21175 21296 172 4.7 211.7 0.6X
With adaptive bloom filter & 3 candidates 20846 20897 71 4.8 208.5 0.6X
With adaptive bloom filter & 5 candidates 20731 20989 365 4.8 207.3 0.6X
With adaptive bloom filter & 9 candidates 23208 23264 79 4.3 232.1 0.5X
With adaptive bloom filter & 15 candidates 23293 23349 78 4.3 232.9 0.5X


================================================================================================
Parquet Read
================================================================================================

OpenJDK 64-Bit Server VM 21.0.4+7-LTS on Linux 6.5.0-1025-azure
OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.5.0-1025-azure
AMD EPYC 7763 64-Core Processor
Read a row from 100M rows: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Without bloom filter, blocksize: 2097152 422 461 41 237.1 4.2 1.0X
With bloom filter, blocksize: 2097152 170 179 6 589.5 1.7 2.5X
Without bloom filter, blocksize: 2097152 451 502 37 221.9 4.5 1.0X
With bloom filter, blocksize: 2097152 174 186 12 573.8 1.7 2.6X


================================================================================================
Parquet Read
================================================================================================

OpenJDK 64-Bit Server VM 21.0.4+7-LTS on Linux 6.5.0-1025-azure
OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.5.0-1025-azure
AMD EPYC 7763 64-Core Processor
Read a row from 100M rows: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Without bloom filter, blocksize: 4194304 397 421 17 251.6 4.0 1.0X
With bloom filter, blocksize: 4194304 126 140 11 791.4 1.3 3.1X
Without bloom filter, blocksize: 4194304 404 409 4 247.6 4.0 1.0X
With bloom filter, blocksize: 4194304 139 150 7 719.2 1.4 2.9X


================================================================================================
Parquet Read
================================================================================================

OpenJDK 64-Bit Server VM 21.0.4+7-LTS on Linux 6.5.0-1025-azure
OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.5.0-1025-azure
AMD EPYC 7763 64-Core Processor
Read a row from 100M rows: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Without bloom filter, blocksize: 6291456 388 397 5 257.8 3.9 1.0X
With bloom filter, blocksize: 6291456 150 159 9 667.1 1.5 2.6X
Without bloom filter, blocksize: 6291456 416 423 7 240.5 4.2 1.0X
With bloom filter, blocksize: 6291456 141 152 10 709.9 1.4 3.0X


================================================================================================
Parquet Read
================================================================================================

OpenJDK 64-Bit Server VM 21.0.4+7-LTS on Linux 6.5.0-1025-azure
OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.5.0-1025-azure
AMD EPYC 7763 64-Core Processor
Read a row from 100M rows: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Without bloom filter, blocksize: 8388608 380 387 5 263.1 3.8 1.0X
With bloom filter, blocksize: 8388608 170 183 9 587.9 1.7 2.2X
Without bloom filter, blocksize: 8388608 419 432 10 238.6 4.2 1.0X
With bloom filter, blocksize: 8388608 210 223 7 476.2 2.1 2.0X


================================================================================================
Parquet Read
================================================================================================

OpenJDK 64-Bit Server VM 21.0.4+7-LTS on Linux 6.5.0-1025-azure
OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.5.0-1025-azure
AMD EPYC 7763 64-Core Processor
Read a row from 100M rows: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------------------------------------
Without bloom filter, blocksize: 12582912 396 401 5 252.2 4.0 1.0X
With bloom filter, blocksize: 12582912 301 335 20 332.0 3.0 1.3X
Without bloom filter, blocksize: 12582912 422 430 9 236.8 4.2 1.0X
With bloom filter, blocksize: 12582912 325 330 4 307.2 3.3 1.3X


================================================================================================
Parquet Read
================================================================================================

OpenJDK 64-Bit Server VM 21.0.4+7-LTS on Linux 6.5.0-1025-azure
OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.5.0-1025-azure
AMD EPYC 7763 64-Core Processor
Read a row from 100M rows: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------------------------------------
Without bloom filter, blocksize: 16777216 404 414 7 247.7 4.0 1.0X
With bloom filter, blocksize: 16777216 357 361 5 280.1 3.6 1.1X
Without bloom filter, blocksize: 16777216 420 436 22 238.3 4.2 1.0X
With bloom filter, blocksize: 16777216 398 428 29 251.2 4.0 1.1X


================================================================================================
Parquet Read
================================================================================================

OpenJDK 64-Bit Server VM 21.0.4+7-LTS on Linux 6.5.0-1025-azure
OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.5.0-1025-azure
AMD EPYC 7763 64-Core Processor
Read a row from 100M rows: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------------------------------------
Without bloom filter, blocksize: 33554432 408 419 19 244.8 4.1 1.0X
With bloom filter, blocksize: 33554432 410 419 9 244.1 4.1 1.0X
Without bloom filter, blocksize: 33554432 428 439 9 233.5 4.3 1.0X
With bloom filter, blocksize: 33554432 430 441 15 232.4 4.3 1.0X


Loading

0 comments on commit 2cb7a16

Please sign in to comment.