-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
group_by
+explode
more than 3x faster than over
#18556
Comments
group_by
+explode
more than 3x slower than over
group_by
+explode
more than 3x faster than over
This is just an excuse to show altair through your I got slightly worse with 4.16sec and 36.34sec. On the def q1_polars_with_over_struct(df):
return df.select(
TARGET,
pl.struct(
pl.col(TARGET).shift(lag).alias(f"{TARGET}_lag_{lag}") for lag in LAG_DAYS
)
.over("id")
.alias("struct"),
).unnest("struct") and this did a bit better at 27.87sec. but not better enough to get back to the group_by performance. Another difference is that the results aren't in the same order. Even if I do |
Intresting - although the Does it also happen with |
chatted about this earlier: |
From a previous discussion, it was mentioned that it was due to parallelization:
|
This is expected as a window function in Polars by default has the constrained that it has to return data in the order of the input frame. This is a costly operation. If you write A way this can be written is as such: df.select(
pl.all().over('id', mapping_strategy='explode'),
pl.col(..).shift(l).over('id', mapping_strategy='explode')
) I will close this as there isn't a bug and the perf differences are expected. |
Here you imply group_by("id") i think. |
Checks
Reproducible example
The following two queries produce the same results
But, the second one is more than 3x faster
Perhaps this is a query optimisation opportunity?
I came across this while investigating pola-rs/polars-benchmark#136, and noticing that the pandas vs Polars difference wasn't as large as I was expecting
Log output
No response
Issue description
Is there a chance here to do more sharing of
over
statements?Complete reproducible example (data is the input file to https://www.kaggle.com/code/marcogorelli/over-vs-group-by-explode/notebook?scriptVersionId=195325043)
Then, compare:
res1 = q1_polars_with_explode(pl.scan_parquet(PATH)).collect()
res2 = q1_polars_with_over(pl.scan_parquet(PATH)).collect()
Expected behavior
Ideally I think they should perform similarly?
Installed versions
--------Version info---------
Polars: 1.6.0
Index type: UInt32
Platform: Linux-5.15.154+-x86_64-with-glibc2.31
Python: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0]
----Optional dependencies----
adbc_driver_manager
altair 5.4.0
cloudpickle 3.0.0
connectorx
deltalake
fastexcel
fsspec 2024.6.1
gevent
great_tables
matplotlib 3.7.5
nest_asyncio 1.6.0
numpy 1.26.4
openpyxl 3.1.5
pandas 2.2.2
pyarrow 17.0.0
pydantic 2.8.2
pyiceberg
sqlalchemy 2.0.30
torch 2.4.0+cpu
xlsx2csv
xlsxwriter
The text was updated successfully, but these errors were encountered: