Add H2O.ai Database-like Ops benchmark to `dfbench` #7209

alamb · 2023-08-05T17:01:57Z

Is your feature request related to a problem or challenge?

Follow on to #7052
There is an interesting database benchark called "H20.ai database like benchmark" that DuckDB seems to have revived (perhaps because the original went dormant with very old with very old/ slow duckdb results). More background here: https://duckdb.org/2023/04/14/h2oai.html#results

@Dandandan added a new solution for datafusion here: duckdblabs/db-benchmark#18

However, there is no easy way to run the h2o benchmark within the datafusion repo. There is an old version of some of these benchmarks in the code: https://github.com/apache/arrow-datafusion/blob/main/benchmarks/src/bin/h2o.rs

Describe the solution you'd like

I would like someone to make it easy to run the h20.ai benchmark in the datafusion repo.

Ideally this would look like

# generate data
./benchmarks/bench.sh data h20.ai
# run 
./benchmarks/bench.sh run h20.ai

I would expect to be able to run the individual queries like this

cargo run  --bin dfbench -- h2o.ai --query=3

Some steps might be

port the existing benchmark script to dfbench following the model in Add parquet-filter and sort benchmarks to dfbench #7120
update bench.sh, following the model of existing benchmarks
Update the documentation

Describe alternatives you've considered

We could also simply remove the h20.ai benchmark script as it is not clear how important it will be long term

Additional context

I think this is a good first issue as the task is clear, and there are existing patterns in bench.sh, dfbench and in

The text was updated successfully, but these errors were encountered:

palash25 · 2023-08-07T08:45:22Z

I would like to work on this

alamb · 2023-08-07T14:46:20Z

Thank you @palash25

palash25 · 2023-09-09T18:30:16Z

sorry for the inactivity on this. my RSI came back so i was taking a break from typing, i will try to submit the PR in a day or two.

alamb · 2023-09-10T10:26:42Z

No problem -- I hope you feel better soon

drewhayward · 2024-07-29T18:49:27Z

Is this something that's still wanted? I took a look at doing this but it looks like the data isn't hosted on the benchmark repo, just data gen scripts in R.

alamb · 2024-07-31T14:06:33Z

Is this something that's still wanted? I took a look at doing this but it looks like the data isn't hosted on the benchmark repo, just data gen scripts in R.

I think it would be useful. Thank you

I think figuring out how to generate the data locally would be super valuable -- perhaps we can use a docker like approach as we do for tpch:

datafusion/benchmarks/bench.sh

Lines 286 to 292 in 89677ae

    
           FILE="${TPCH_DIR}/supplier.tbl" 
        
           if test -f "${FILE}"; then 
        
               echo " tbl files exist ($FILE exists)." 
        
           else 
        
               echo " creating tbl files with tpch_dbgen..." 
        
               docker run -v "${TPCH_DIR}":/data -it --rm ghcr.io/scalytics/tpch-docker:main -vf -s ${SCALE_FACTOR} 
        
           fi

So it would run like

./bench.sh data h2o

Which would leave data in datafusion/benchmarks/data/h2o

🤔

alamb · 2024-12-13T16:54:54Z

Now that @Rachelint and @2010YOUY01 and others have started working on

[EPIC] Improved performance in H20.ai benchmarks #13548

I think this issue is more important than ever

I think the hardest part of this task is actually generating the benchmark data
Thankfully @MrPowers has created falsa to generate the dataset (so we don't need R installed):

Here are the instructions for generating data: https://github.com/MrPowers/mrpowers-benchmarks?tab=readme-ov-file#running-the-benchmarks-on-your-machine

Here are the queries:

drewhayward · 2024-12-19T18:51:25Z

I don't currently have bandwidth to take this across the finish line but I did get the datagen working via Docker

Commit with data gen

alamb · 2024-12-19T21:25:39Z

Thanks @drewhayward -- maybe someone else can pick it up from there

zhuqi-lucas · 2025-01-03T06:56:51Z

take

zhuqi-lucas · 2025-01-04T07:18:18Z

Hi @alamb , the draft PR works well and tested group by h2o benchmark, but the join seems have some problems, i also submitted the question:

MrPowers/mrpowers-benchmarks#10 (comment)

2010YOUY01 · 2025-01-04T12:38:17Z

Hi @alamb , the draft PR works well and tested group by h2o benchmark, but the join seems have some problems, i also submitted the question:

MrPowers/mrpowers-benchmarks#10 (comment)

Looks like it's a todo tracked by mrpowers-io/falsa#21, perhaps we can skip join queries for now

alamb · 2025-01-04T22:50:08Z

Looks like it's a todo tracked by mrpowers-io/falsa#21, perhaps we can skip join queries for now

I think it is a good idea to skip the join queries for now and link to the todo found by @2010YOUY01

zhuqi-lucas · 2025-01-05T02:08:52Z

Sure @alamb @2010YOUY01 , thanks, let me change the PR to only support groupby and add todo for join.

zhuqi-lucas · 2025-01-05T03:24:12Z

The PR testing result:
Data generate example, we can use small, medium, or big dataset:

./benchmarks/bench.sh data h2o_small
***************************
DataFusion Benchmark Runner and Data Generator
COMMAND: data
BENCHMARK: h2o_small
DATA_DIR: /Users/zhuqi/arrow-datafusion/benchmarks/data
CARGO_COMMAND: cargo run --release
PREFER_HASH_JOIN: true
***************************
Python version 3.9 found, but version 3.10 or higher is required.
Using Python command: python3.12
Installing falsa...
Generating h2o test data in /Users/zhuqi/arrow-datafusion/benchmarks/data/h2o with size=SMALL and format=PARQUET
10000000 rows will be saved into: /Users/zhuqi/arrow-datafusion/benchmarks/data/h2o/G1_1e7_1e7_100_0.parquet

An output data schema is the following:
id1: string
id2: string
id3: string
id4: int64
id5: int64
id6: int64
v1: int64 not null
v2: int64 not null
v3: double not null

An output format is PARQUET

Batch mode is supported.
In case of memory problems you can try to reduce a batch_size.


Working... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:04

Running example, we can both use /benchmarks/bench.sh run or cargo run:

./benchmarks/bench.sh run h2o_small
***************************
DataFusion Benchmark Script
COMMAND: run
BENCHMARK: h2o_small
DATAFUSION_DIR: /Users/zhuqi/arrow-datafusion/benchmarks/..
BRANCH_NAME: issue_7209
DATA_DIR: /Users/zhuqi/arrow-datafusion/benchmarks/data
RESULTS_DIR: /Users/zhuqi/arrow-datafusion/benchmarks/results/issue_7209
CARGO_COMMAND: cargo run --release
PREFER_HASH_JOIN: true
***************************
RESULTS_FILE: /Users/zhuqi/arrow-datafusion/benchmarks/results/issue_7209/h2o.json
Running h2o benchmark...
   Compiling datafusion-benchmarks v44.0.0 (/Users/zhuqi/arrow-datafusion/benchmarks)
    Building [=======================> ] 337/338: dfbench(bin)

    Finished `release` profile [optimized] target(s) in 4m 41s
     Running `/Users/zhuqi/arrow-datafusion/target/release/dfbench h2o --iterations 3 --path /Users/zhuqi/arrow-datafusion/benchmarks/data/h2o/G1_1e7_1e7_100_0.parquet --queries-path /Users/zhuqi/arrow-datafusion/benchmarks/queries/h2o/groupby.sql -o /Users/zhuqi/arrow-datafusion/benchmarks/results/issue_7209/h2o.json`
Running benchmarks with the following options: RunOpt { query: None, common: CommonOpt { iterations: 3, partitions: None, batch_size: 8192, debug: false }, queries_path: "/Users/zhuqi/arrow-datafusion/benchmarks/queries/h2o/groupby.sql", path: "/Users/zhuqi/arrow-datafusion/benchmarks/data/h2o/G1_1e7_1e7_100_0.parquet", output_path: Some("/Users/zhuqi/arrow-datafusion/benchmarks/results/issue_7209/h2o.json") }
Q1: SELECT id1, SUM(v1) AS v1 FROM x GROUP BY id1;
Query 1 iteration 1 took 58.3 ms and returned 100 rows
Query 1 iteration 2 took 18.8 ms and returned 100 rows
Query 1 iteration 3 took 19.1 ms and returned 100 rows
Q2: SELECT id1, id2, SUM(v1) AS v1 FROM x GROUP BY id1, id2;
Query 2 iteration 1 took 196.0 ms and returned 6321413 rows
Query 2 iteration 2 took 148.5 ms and returned 6321413 rows
Query 2 iteration 3 took 142.1 ms and returned 6321413 rows
Q3: SELECT id3, SUM(v1) AS v1, AVG(v3) AS v3 FROM x GROUP BY id3;
Query 3 iteration 1 took 113.4 ms and returned 100000 rows
Query 3 iteration 2 took 113.1 ms and returned 100000 rows
Query 3 iteration 3 took 107.0 ms and returned 100000 rows
Q4: SELECT id4, AVG(v1) AS v1, AVG(v2) AS v2, AVG(v3) AS v3 FROM x GROUP BY id4;
Query 4 iteration 1 took 28.0 ms and returned 100 rows
Query 4 iteration 2 took 41.5 ms and returned 100 rows
Query 4 iteration 3 took 44.1 ms and returned 100 rows
Q5: SELECT id6, SUM(v1) AS v1, SUM(v2) AS v2, SUM(v3) AS v3 FROM x GROUP BY id6;
Query 5 iteration 1 took 64.1 ms and returned 100000 rows
Query 5 iteration 2 took 52.1 ms and returned 100000 rows
Query 5 iteration 3 took 50.0 ms and returned 100000 rows
Q6: SELECT id4, id5, MEDIAN(v3) AS median_v3, STDDEV(v3) AS sd_v3 FROM x GROUP BY id4, id5;
Query 6 iteration 1 took 225.0 ms and returned 10000 rows
Query 6 iteration 2 took 245.5 ms and returned 10000 rows
Query 6 iteration 3 took 224.8 ms and returned 10000 rows
Q7: SELECT id3, MAX(v1) - MIN(v2) AS range_v1_v2 FROM x GROUP BY id3;
Query 7 iteration 1 took 111.0 ms and returned 100000 rows
Query 7 iteration 2 took 97.4 ms and returned 100000 rows
Query 7 iteration 3 took 95.1 ms and returned 100000 rows
Q8: SELECT id6, largest2_v3 FROM (SELECT id6, v3 AS largest2_v3, ROW_NUMBER() OVER (PARTITION BY id6 ORDER BY v3 DESC) AS order_v3 FROM x WHERE v3 IS NOT NULL) sub_query WHERE order_v3 <= 2;
Query 8 iteration 1 took 386.7 ms and returned 200000 rows
Query 8 iteration 2 took 309.7 ms and returned 200000 rows
Query 8 iteration 3 took 301.9 ms and returned 200000 rows
Q9: SELECT id2, id4, POWER(CORR(v1, v2), 2) AS r2 FROM x GROUP BY id2, id4;
Query 9 iteration 1 took 614.5 ms and returned 6320797 rows
Query 9 iteration 2 took 572.8 ms and returned 6320797 rows
Query 9 iteration 3 took 591.2 ms and returned 6320797 rows
Q10: SELECT id1, id2, id3, id4, id5, id6, SUM(v3) AS v3, COUNT(*) AS count FROM x GROUP BY id1, id2, id3, id4, id5, id6;
Query 10 iteration 1 took 492.9 ms and returned 10000000 rows
Query 10 iteration 2 took 332.5 ms and returned 10000000 rows
Query 10 iteration 3 took 375.3 ms and returned 10000000 rows
Done

cargo run --release --bin dfbench -- h2o --query 3 --debug
    Finished `release` profile [optimized] target(s) in 0.22s
     Running `target/release/dfbench h2o --query 3 --debug`
Running benchmarks with the following options: RunOpt { query: Some(3), common: CommonOpt { iterations: 3, partitions: None, batch_size: 8192, debug: true }, queries_path: "benchmarks/queries/h2o/groupby.sql", path: "benchmarks/data/h2o/G1_1e7_1e7_100_0.parquet", output_path: None }
Q3: SELECT id3, SUM(v1) AS v1, AVG(v3) AS v3 FROM x GROUP BY id3;
Query 3 iteration 1 took 165.0 ms and returned 100000 rows
Query 3 iteration 2 took 112.6 ms and returned 100000 rows
Query 3 iteration 3 took 114.8 ms and returned 100000 rows
+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Projection: x.id3, sum(x.v1) AS v1, avg(x.v3) AS v3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
|               |   Aggregate: groupBy=[[x.id3]], aggr=[[sum(x.v1), avg(x.v3)]]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
|               |     TableScan: x projection=[id3, v1, v3]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
| physical_plan | ProjectionExec: expr=[id3@0 as id3, sum(x.v1)@1 as v1, avg(x.v3)@2 as v3]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
|               |   AggregateExec: mode=FinalPartitioned, gby=[id3@0 as id3], aggr=[sum(x.v1), avg(x.v3)]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
|               |     CoalesceBatchesExec: target_batch_size=8192                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
|               |       RepartitionExec: partitioning=Hash([id3@0], 14), input_partitions=14                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
|               |         AggregateExec: mode=Partial, gby=[id3@0 as id3], aggr=[sum(x.v1), avg(x.v3)]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
|               |           ParquetExec: file_groups={14 groups: [[Users/zhuqi/arrow-datafusion/benchmarks/data/h2o/G1_1e7_1e7_100_0.parquet:0..18252411], [Users/zhuqi/arrow-datafusion/benchmarks/data/h2o/G1_1e7_1e7_100_0.parquet:18252411..36504822], [Users/zhuqi/arrow-datafusion/benchmarks/data/h2o/G1_1e7_1e7_100_0.parquet:36504822..54757233], [Users/zhuqi/arrow-datafusion/benchmarks/data/h2o/G1_1e7_1e7_100_0.parquet:54757233..73009644], [Users/zhuqi/arrow-datafusion/benchmarks/data/h2o/G1_1e7_1e7_100_0.parquet:73009644..91262055], ...]}, projection=[id3, v1, v3] |
|               |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

alamb added enhancement New feature or request good first issue Good for newcomers labels Aug 5, 2023

alamb mentioned this issue Aug 14, 2023

Consolidate datafusion benchmarks into dfbench #7052

Closed

8 tasks

alamb mentioned this issue Dec 13, 2024

[EPIC] Improved performance in H20.ai benchmarks #13548

Open

4 tasks

github-actions bot assigned zhuqi-lucas Jan 3, 2025

zhuqi-lucas mentioned this issue Jan 3, 2025

Add H2O.ai Database-like Ops benchmark to dfbench (groupby support) #13996

Merged

alamb closed this as completed in #13996 Jan 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add H2O.ai Database-like Ops benchmark to `dfbench` #7209

Add H2O.ai Database-like Ops benchmark to `dfbench` #7209

alamb commented Aug 5, 2023

palash25 commented Aug 7, 2023 •

edited

Loading

alamb commented Aug 7, 2023

palash25 commented Sep 9, 2023

alamb commented Sep 10, 2023

drewhayward commented Jul 29, 2024

alamb commented Jul 31, 2024

alamb commented Dec 13, 2024 •

edited

Loading

drewhayward commented Dec 19, 2024

alamb commented Dec 19, 2024

zhuqi-lucas commented Jan 3, 2025

zhuqi-lucas commented Jan 4, 2025

2010YOUY01 commented Jan 4, 2025

alamb commented Jan 4, 2025

zhuqi-lucas commented Jan 5, 2025

zhuqi-lucas commented Jan 5, 2025 •

edited

Loading

Add H2O.ai Database-like Ops benchmark to dfbench #7209

Add H2O.ai Database-like Ops benchmark to dfbench #7209

Comments

alamb commented Aug 5, 2023

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

palash25 commented Aug 7, 2023 • edited Loading

alamb commented Aug 7, 2023

palash25 commented Sep 9, 2023

alamb commented Sep 10, 2023

drewhayward commented Jul 29, 2024

alamb commented Jul 31, 2024

alamb commented Dec 13, 2024 • edited Loading

drewhayward commented Dec 19, 2024

alamb commented Dec 19, 2024

zhuqi-lucas commented Jan 3, 2025

zhuqi-lucas commented Jan 4, 2025

2010YOUY01 commented Jan 4, 2025

alamb commented Jan 4, 2025

zhuqi-lucas commented Jan 5, 2025

zhuqi-lucas commented Jan 5, 2025 • edited Loading

Add H2O.ai Database-like Ops benchmark to `dfbench` #7209

Add H2O.ai Database-like Ops benchmark to `dfbench` #7209

palash25 commented Aug 7, 2023 •

edited

Loading

alamb commented Dec 13, 2024 •

edited

Loading

zhuqi-lucas commented Jan 5, 2025 •

edited

Loading