`group_by`+`explode` more than 3x faster than `over` #18556

MarcoGorelli · 2024-09-04T16:36:47Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

The following two queries produce the same results

def q1_polars_with_over(df):
    return df.with_columns(
        pl.col(TARGET).shift(lag).over("id").alias(f"{TARGET}_lag_{lag}")
        for lag in LAG_DAYS
    )

def q1_polars_with_explode(df):
    return df.group_by(TARGET).agg(
        pl.col(TARGET).shift(l).alias(f"{TARGET}_lag_{l}")
        for l in LAG_DAYS
    ).explode([f"{TARGET}_lag_{l}" for l in LAG_DAYS])

But, the second one is more than 3x faster

Perhaps this is a query optimisation opportunity?

I came across this while investigating pola-rs/polars-benchmark#136, and noticing that the pandas vs Polars difference wasn't as large as I was expecting

Log output

No response

Issue description

Is there a chance here to do more sharing of over statements?

Complete reproducible example (data is the input file to https://www.kaggle.com/code/marcogorelli/over-vs-group-by-explode/notebook?scriptVersionId=195325043)

import pandas as pd
import polars as pl
import time
import numpy as np
import os
import pyarrow

print('polars version', pl.__version__)

pl.enable_string_cache()

PROCESSED_DATA_DIR = (
    "/kaggle/input/m5-forecasting-data"
)

TARGET = "sales"
SHIFT_DAY = 28

PATH = os.path.join(PROCESSED_DATA_DIR, 'grid_part_1.parquet')



LAG_DAYS = [col for col in range(SHIFT_DAY, SHIFT_DAY + 15)]


def q1_polars_with_explode(df):
    return df.group_by('id').agg(
        'd',
        *[pl.col(TARGET).shift(l).alias(f"{TARGET}_lag_{l}")
        for l in LAG_DAYS]
    ).explode('d', *[f"{TARGET}_lag_{l}" for l in LAG_DAYS])

def q1_polars_with_over(df):
    return df.select(
        'id',
        'd',
        *[pl.col(TARGET).shift(lag).over("id").alias(f"{TARGET}_lag_{lag}")
        for lag in LAG_DAYS]
    )

Then, compare:

res1 = q1_polars_with_explode(pl.scan_parquet(PATH)).collect()
res2 = q1_polars_with_over(pl.scan_parquet(PATH)).collect()

Expected behavior

Ideally I think they should perform similarly?

Installed versions

```

--------Version info---------
Polars: 1.6.0
Index type: UInt32
Platform: Linux-5.15.154+-x86_64-with-glibc2.31
Python: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0]

----Optional dependencies----
adbc_driver_manager
altair 5.4.0
cloudpickle 3.0.0
connectorx
deltalake
fastexcel
fsspec 2024.6.1
gevent
great_tables
matplotlib 3.7.5
nest_asyncio 1.6.0
numpy 1.26.4
openpyxl 3.1.5
pandas 2.2.2
pyarrow 17.0.0
pydantic 2.8.2
pyiceberg
sqlalchemy 2.0.30
torch 2.4.0+cpu
xlsx2csv
xlsxwriter

</details>

The text was updated successfully, but these errors were encountered:

deanm0000 · 2024-09-04T21:13:42Z

This is just an excuse to show altair through your pl.show_versions() isn't it? j/k

I got slightly worse with 4.16sec and 36.34sec.

On the over I was thinking it was having to do the grouping for each column so I tried like this

def q1_polars_with_over_struct(df):
    return df.select(
        TARGET,
        pl.struct(
            pl.col(TARGET).shift(lag).alias(f"{TARGET}_lag_{lag}") for lag in LAG_DAYS
        )
        .over("id")
        .alias("struct"),
    ).unnest("struct")

and this did a bit better at 27.87sec. but not better enough to get back to the group_by performance.

Another difference is that the results aren't in the same order. Even if I do maintain_order=True in the group_by they still aren't in the same order. I'm taking a shot in the dark that the order approach is doing a bunch of joins and/or sorts that aren't necessary.

barak1412 · 2024-09-05T19:00:26Z

Intresting - although the over happends only once in @deanm0000 example, we still get worse result.

Does it also happen with LAG_DAYS=1?
If not, it will be difficult to add it to the optimizer.

MarcoGorelli · 2024-09-05T19:35:00Z

chatted about this earlier: over has to do extra work to preserve order. you can reduce the gap by using mapping_strategy='explode' in all arguments to the select

cmdlineluser · 2024-09-05T19:58:35Z

From a previous discussion, it was mentioned that it was due to parallelization:

#10063 (comment)

The assumptions are different, therefore the aggregation can parallelize over all aggregation functions.

ritchie46 · 2024-09-07T13:45:28Z

This is expected as a window function in Polars by default has the constrained that it has to return data in the order of the input frame. This is a costly operation.

If you write over(.., mapping_strategy=explode), the window function doesn't have this constraint, but this should only be used in a select context as the return order will not be tied to the input frame.

A way this can be written is as such:

df.select(
        pl.all().over('id', mapping_strategy='explode'),
        pl.col(..).shift(l).over('id', mapping_strategy='explode')
)

I will close this as there isn't a bug and the perf differences are expected.

krds00 · 2024-10-04T19:30:18Z

def q1_polars_with_explode(df):
    return df.group_by(TARGET).agg(
        pl.col(TARGET).shift(l).alias(f"{TARGET}_lag_{l}")
        for l in LAG_DAYS
    ).explode([f"{TARGET}_lag_{l}" for l in LAG_DAYS])

Here you imply group_by("id") i think.

MarcoGorelli added bug Something isn't working python Related to Python Polars needs triage Awaiting prioritization by a maintainer labels Sep 4, 2024

MarcoGorelli changed the title ~~group_by+explode more than 3x slower than over~~ group_by+explode more than 3x faster than over Sep 4, 2024

MarcoGorelli added the performance Performance issues or improvements label Sep 4, 2024

ritchie46 closed this as completed Sep 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`group_by`+`explode` more than 3x faster than `over` #18556

`group_by`+`explode` more than 3x faster than `over` #18556

MarcoGorelli commented Sep 4, 2024 •

edited

Loading

deanm0000 commented Sep 4, 2024

barak1412 commented Sep 5, 2024

MarcoGorelli commented Sep 5, 2024 •

edited

Loading

cmdlineluser commented Sep 5, 2024

ritchie46 commented Sep 7, 2024

krds00 commented Oct 4, 2024 •

edited

Loading

group_by+explode more than 3x faster than over #18556

group_by+explode more than 3x faster than over #18556

Comments

MarcoGorelli commented Sep 4, 2024 • edited Loading

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

deanm0000 commented Sep 4, 2024

barak1412 commented Sep 5, 2024

MarcoGorelli commented Sep 5, 2024 • edited Loading

cmdlineluser commented Sep 5, 2024

ritchie46 commented Sep 7, 2024

krds00 commented Oct 4, 2024 • edited Loading

`group_by`+`explode` more than 3x faster than `over` #18556

`group_by`+`explode` more than 3x faster than `over` #18556

MarcoGorelli commented Sep 4, 2024 •

edited

Loading

MarcoGorelli commented Sep 5, 2024 •

edited

Loading

krds00 commented Oct 4, 2024 •

edited

Loading