Pyarrow polars.exceptions.ComputeError: ArrowInvalid: offset overflow while concatenating arrays with scanning large delta tables #19689

Hhernan6 · 2024-11-07T16:38:03Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

Creating a table

Creating deltabales 
import pandas as pd
import pyarrow as pa
from deltalake import write_deltalake
import numpy as np

# NOTE: Can run once with 300,000,000 rows or several times to get to that number

# Define the number of rows
num_rows_1 = 50_000_000 
id_pool = [f"id_{i}" for i in range(1, 10000000)]  # id_1 to id_10

# Generate synthetic data
def generate_data(num_rows):
    # Generate random data
    data = {
        "id": np.random.choice(id_pool, num_rows),
        "properties": [
            {
                "groupingUid": f"aud_{np.random.randint(1, 1500)}",
                "otherField1": np.random.randint(100, 200),
                "otherField2": f"prop_{np.random.randint(1, 5)}"
            } 
            for _ in range(num_rows)
        ],
        "date": ["2024-10-26"] * num_rows,
    }

    return pd.DataFrame(data)

print("generating dataframe")
df = generate_data(num_rows_1)

df['date'] = df['date'].astype(str)

schema = pa.schema([
    ("id", pa.string()),
    ("properties", pa.struct([
        ("groupingUid", pa.string()),
        ("otherField1", pa.int32()),
        ("otherField2", pa.string()),
    ])),
    ("date", pa.string()),
])
table1 = pa.Table.from_pandas(df, schema=schema)

# Specify the path to save the Delta table
delta_table_path1 = "./table1"

# Write the data to the Delta table
print("writing to tables...")
write_deltalake(delta_table_path1, table1, partition_by=["date"], mode="append")
print(f"Delta table with {num_rows_1} rows created at {delta_table_path1}")

reading the table

import polars as pl

#pl.Config.set_streaming_chunk_size(5) ### makes no difference setting this

# Load df
df = (
    pl.scan_delta(
        "table1"
    )
    .select(
        "id",
        pl.col("properties").struct["groupingUid"].alias("groupingUid"),
        "date",
    )
)

try:
    collected_df = df.collect(streaming=True)
    print("DF:", collected_df)
except Exception as e:
    raise Exception(f"error creating audience size dataframe: {e}")

Log output

Traceback (most recent call last):
  File "/Users/hhernandez/Documents/rvo-redplatform/polars-testing/read.py", line 18, in <module>
    collected_df = df.collect(streaming=True)
  File "/Users/hhernandez/Library/Python/3.9/lib/python/site-packages/polars/lazyframe/frame.py", line 2055, in collect
    return wrap_df(ldf.collect(callback))
polars.exceptions.ComputeError: ArrowInvalid: offset overflow while concatenating arrays

Issue description

For writing to the table, I ran that 6 times to get to 300m rows as I run into issues writing 300m at one time locally but as long as the table has 300m or more is where the issue seems to occur when reading off that table.

Expected behavior

The expected behavior would be that I can collect the large tables into a dataframe with no issues

Installed versions

--------Version info---------
Polars:              1.12.0
Index type:          UInt32
Platform:            macOS-14.5-arm64-arm-64bit
Python:              3.9.6 (default, Feb  3 2024, 15:58:27) 
[Clang 15.0.0 (clang-1500.3.9.4)]
LTS CPU:             False

----Optional dependencies----
adbc_driver_manager  <not installed>
altair               <not installed>
cloudpickle          <not installed>
connectorx           <not installed>
deltalake            0.21.0
fastexcel            <not installed>
fsspec               <not installed>
gevent               <not installed>
great_tables         <not installed>
matplotlib           <not installed>
nest_asyncio         <not installed>
numpy                1.24.3
openpyxl             <not installed>
pandas               2.0.2
pyarrow              18.0.0
pydantic             <not installed>
pyiceberg            <not installed>
sqlalchemy           <not installed>
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>

The text was updated successfully, but these errors were encountered:

Hhernan6 added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Nov 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pyarrow polars.exceptions.ComputeError: ArrowInvalid: offset overflow while concatenating arrays with scanning large delta tables #19689

Pyarrow polars.exceptions.ComputeError: ArrowInvalid: offset overflow while concatenating arrays with scanning large delta tables #19689

Hhernan6 commented Nov 7, 2024 •

edited

Loading

Pyarrow polars.exceptions.ComputeError: ArrowInvalid: offset overflow while concatenating arrays with scanning large delta tables #19689

Pyarrow polars.exceptions.ComputeError: ArrowInvalid: offset overflow while concatenating arrays with scanning large delta tables #19689

Comments

Hhernan6 commented Nov 7, 2024 • edited Loading

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

Hhernan6 commented Nov 7, 2024 •

edited

Loading