Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

group_by+explode more than 3x faster than over #18556

Closed
2 tasks done
MarcoGorelli opened this issue Sep 4, 2024 · 6 comments
Closed
2 tasks done

group_by+explode more than 3x faster than over #18556

MarcoGorelli opened this issue Sep 4, 2024 · 6 comments
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer performance Performance issues or improvements python Related to Python Polars

Comments

@MarcoGorelli
Copy link
Collaborator

MarcoGorelli commented Sep 4, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

The following two queries produce the same results

def q1_polars_with_over(df):
    return df.with_columns(
        pl.col(TARGET).shift(lag).over("id").alias(f"{TARGET}_lag_{lag}")
        for lag in LAG_DAYS
    )
def q1_polars_with_explode(df):
    return df.group_by(TARGET).agg(
        pl.col(TARGET).shift(l).alias(f"{TARGET}_lag_{l}")
        for l in LAG_DAYS
    ).explode([f"{TARGET}_lag_{l}" for l in LAG_DAYS])

But, the second one is more than 3x faster

Perhaps this is a query optimisation opportunity?

I came across this while investigating pola-rs/polars-benchmark#136, and noticing that the pandas vs Polars difference wasn't as large as I was expecting

Log output

No response

Issue description

Is there a chance here to do more sharing of over statements?

Complete reproducible example (data is the input file to https://www.kaggle.com/code/marcogorelli/over-vs-group-by-explode/notebook?scriptVersionId=195325043)

import pandas as pd
import polars as pl
import time
import numpy as np
import os
import pyarrow

print('polars version', pl.__version__)

pl.enable_string_cache()

PROCESSED_DATA_DIR = (
    "/kaggle/input/m5-forecasting-data"
)

TARGET = "sales"
SHIFT_DAY = 28

PATH = os.path.join(PROCESSED_DATA_DIR, 'grid_part_1.parquet')



LAG_DAYS = [col for col in range(SHIFT_DAY, SHIFT_DAY + 15)]


def q1_polars_with_explode(df):
    return df.group_by('id').agg(
        'd',
        *[pl.col(TARGET).shift(l).alias(f"{TARGET}_lag_{l}")
        for l in LAG_DAYS]
    ).explode('d', *[f"{TARGET}_lag_{l}" for l in LAG_DAYS])

def q1_polars_with_over(df):
    return df.select(
        'id',
        'd',
        *[pl.col(TARGET).shift(lag).over("id").alias(f"{TARGET}_lag_{lag}")
        for lag in LAG_DAYS]
    )

Then, compare:

  • res1 = q1_polars_with_explode(pl.scan_parquet(PATH)).collect()
  • res2 = q1_polars_with_over(pl.scan_parquet(PATH)).collect()

Expected behavior

Ideally I think they should perform similarly?

Installed versions

```

--------Version info---------
Polars: 1.6.0
Index type: UInt32
Platform: Linux-5.15.154+-x86_64-with-glibc2.31
Python: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0]

----Optional dependencies----
adbc_driver_manager
altair 5.4.0
cloudpickle 3.0.0
connectorx
deltalake
fastexcel
fsspec 2024.6.1
gevent
great_tables
matplotlib 3.7.5
nest_asyncio 1.6.0
numpy 1.26.4
openpyxl 3.1.5
pandas 2.2.2
pyarrow 17.0.0
pydantic 2.8.2
pyiceberg
sqlalchemy 2.0.30
torch 2.4.0+cpu
xlsx2csv
xlsxwriter

</details>
@MarcoGorelli MarcoGorelli added bug Something isn't working python Related to Python Polars needs triage Awaiting prioritization by a maintainer labels Sep 4, 2024
@MarcoGorelli MarcoGorelli changed the title group_by+explode more than 3x slower than over group_by+explode more than 3x faster than over Sep 4, 2024
@MarcoGorelli MarcoGorelli added the performance Performance issues or improvements label Sep 4, 2024
@deanm0000
Copy link
Collaborator

This is just an excuse to show altair through your pl.show_versions() isn't it? j/k

I got slightly worse with 4.16sec and 36.34sec.

On the over I was thinking it was having to do the grouping for each column so I tried like this

def q1_polars_with_over_struct(df):
    return df.select(
        TARGET,
        pl.struct(
            pl.col(TARGET).shift(lag).alias(f"{TARGET}_lag_{lag}") for lag in LAG_DAYS
        )
        .over("id")
        .alias("struct"),
    ).unnest("struct")

and this did a bit better at 27.87sec. but not better enough to get back to the group_by performance.

Another difference is that the results aren't in the same order. Even if I do maintain_order=True in the group_by they still aren't in the same order. I'm taking a shot in the dark that the order approach is doing a bunch of joins and/or sorts that aren't necessary.

@barak1412
Copy link
Contributor

Intresting - although the over happends only once in @deanm0000 example, we still get worse result.

Does it also happen with LAG_DAYS=1?
If not, it will be difficult to add it to the optimizer.

@MarcoGorelli
Copy link
Collaborator Author

MarcoGorelli commented Sep 5, 2024

chatted about this earlier: over has to do extra work to preserve order. you can reduce the gap by using mapping_strategy='explode' in all arguments to the select

@cmdlineluser
Copy link
Contributor

From a previous discussion, it was mentioned that it was due to parallelization:

#10063 (comment)

The assumptions are different, therefore the aggregation can parallelize over all aggregation functions.

@ritchie46
Copy link
Member

This is expected as a window function in Polars by default has the constrained that it has to return data in the order of the input frame. This is a costly operation.

If you write over(.., mapping_strategy=explode), the window function doesn't have this constraint, but this should only be used in a select context as the return order will not be tied to the input frame.

A way this can be written is as such:

df.select(
        pl.all().over('id', mapping_strategy='explode'),
        pl.col(..).shift(l).over('id', mapping_strategy='explode')
)

I will close this as there isn't a bug and the perf differences are expected.

@krds00
Copy link

krds00 commented Oct 4, 2024

def q1_polars_with_explode(df):
    return df.group_by(TARGET).agg(
        pl.col(TARGET).shift(l).alias(f"{TARGET}_lag_{l}")
        for l in LAG_DAYS
    ).explode([f"{TARGET}_lag_{l}" for l in LAG_DAYS])

Here you imply group_by("id") i think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer performance Performance issues or improvements python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

6 participants