Why is the Spark memory set to 2gb? #127

MrPowers · 2024-06-11T03:47:00Z

Here's the line: https://github.com/pola-rs/tpch/blob/6c5bbe93a04cfcd25678dd860bab5ad61ad66edb/queries/pyspark/utils.py#L24

If these benchmarks are being run on a single node, we should probably set the shuffle partitions to be like 1-4 instead of 200 (which is the default).

ritchie46 · 2024-06-11T09:08:09Z

@stinodego any clue?

stinodego · 2024-06-11T09:39:27Z

These are default values that will let you run scale factor 1 locally without any problems. If we use PySpark defaults, certain queries fail due to memory issues.

The benchmark blog has the actual values used during the benchmark:

For PySpark, driver memory and executor memory were set to 20g and 10g respectively.

I have tried a few different settings, and these seemed to work best for scale factor 10.

If these benchmarks are being run on a single node, we should probably set the shuffle partitions to be like 1-4 instead of 200 (which is the default).

Could be - I am by no means a PySpark optimization expert. Perhaps they should implement better/dynamic defaults.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why is the Spark memory set to 2gb? #127

Why is the Spark memory set to 2gb? #127

MrPowers commented Jun 11, 2024

ritchie46 commented Jun 11, 2024

stinodego commented Jun 11, 2024 •

edited

Loading

Why is the Spark memory set to 2gb? #127

Why is the Spark memory set to 2gb? #127

Comments

MrPowers commented Jun 11, 2024

ritchie46 commented Jun 11, 2024

stinodego commented Jun 11, 2024 • edited Loading

stinodego commented Jun 11, 2024 •

edited

Loading