You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If these benchmarks are being run on a single node, we should probably set the shuffle partitions to be like 1-4 instead of 200 (which is the default).
The text was updated successfully, but these errors were encountered:
These are default values that will let you run scale factor 1 locally without any problems. If we use PySpark defaults, certain queries fail due to memory issues.
The benchmark blog has the actual values used during the benchmark:
For PySpark, driver memory and executor memory were set to 20g and 10g respectively.
I have tried a few different settings, and these seemed to work best for scale factor 10.
If these benchmarks are being run on a single node, we should probably set the shuffle partitions to be like 1-4 instead of 200 (which is the default).
Could be - I am by no means a PySpark optimization expert. Perhaps they should implement better/dynamic defaults.
Here's the line: https://github.com/pola-rs/tpch/blob/6c5bbe93a04cfcd25678dd860bab5ad61ad66edb/queries/pyspark/utils.py#L24
If these benchmarks are being run on a single node, we should probably set the shuffle partitions to be like 1-4 instead of 200 (which is the default).
The text was updated successfully, but these errors were encountered: