Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pyarrow polars.exceptions.ComputeError: ArrowInvalid: offset overflow while concatenating arrays with scanning large delta tables #19689

Open
2 tasks done
Hhernan6 opened this issue Nov 7, 2024 · 0 comments
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@Hhernan6
Copy link

Hhernan6 commented Nov 7, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

Creating a table

Creating deltabales 
import pandas as pd
import pyarrow as pa
from deltalake import write_deltalake
import numpy as np

# NOTE: Can run once with 300,000,000 rows or several times to get to that number

# Define the number of rows
num_rows_1 = 50_000_000 
id_pool = [f"id_{i}" for i in range(1, 10000000)]  # id_1 to id_10

# Generate synthetic data
def generate_data(num_rows):
    # Generate random data
    data = {
        "id": np.random.choice(id_pool, num_rows),
        "properties": [
            {
                "groupingUid": f"aud_{np.random.randint(1, 1500)}",
                "otherField1": np.random.randint(100, 200),
                "otherField2": f"prop_{np.random.randint(1, 5)}"
            } 
            for _ in range(num_rows)
        ],
        "date": ["2024-10-26"] * num_rows,
    }

    return pd.DataFrame(data)

print("generating dataframe")
df = generate_data(num_rows_1)

df['date'] = df['date'].astype(str)

schema = pa.schema([
    ("id", pa.string()),
    ("properties", pa.struct([
        ("groupingUid", pa.string()),
        ("otherField1", pa.int32()),
        ("otherField2", pa.string()),
    ])),
    ("date", pa.string()),
])
table1 = pa.Table.from_pandas(df, schema=schema)

# Specify the path to save the Delta table
delta_table_path1 = "./table1"

# Write the data to the Delta table
print("writing to tables...")
write_deltalake(delta_table_path1, table1, partition_by=["date"], mode="append")
print(f"Delta table with {num_rows_1} rows created at {delta_table_path1}")

reading the table

import polars as pl

#pl.Config.set_streaming_chunk_size(5) ### makes no difference setting this

# Load df
df = (
    pl.scan_delta(
        "table1"
    )
    .select(
        "id",
        pl.col("properties").struct["groupingUid"].alias("groupingUid"),
        "date",
    )
)

try:
    collected_df = df.collect(streaming=True)
    print("DF:", collected_df)
except Exception as e:
    raise Exception(f"error creating audience size dataframe: {e}")

Log output

Traceback (most recent call last):
  File "/Users/hhernandez/Documents/rvo-redplatform/polars-testing/read.py", line 18, in <module>
    collected_df = df.collect(streaming=True)
  File "/Users/hhernandez/Library/Python/3.9/lib/python/site-packages/polars/lazyframe/frame.py", line 2055, in collect
    return wrap_df(ldf.collect(callback))
polars.exceptions.ComputeError: ArrowInvalid: offset overflow while concatenating arrays

Issue description

For writing to the table, I ran that 6 times to get to 300m rows as I run into issues writing 300m at one time locally but as long as the table has 300m or more is where the issue seems to occur when reading off that table.

Expected behavior

The expected behavior would be that I can collect the large tables into a dataframe with no issues

Installed versions

--------Version info---------
Polars:              1.12.0
Index type:          UInt32
Platform:            macOS-14.5-arm64-arm-64bit
Python:              3.9.6 (default, Feb  3 2024, 15:58:27) 
[Clang 15.0.0 (clang-1500.3.9.4)]
LTS CPU:             False

----Optional dependencies----
adbc_driver_manager  <not installed>
altair               <not installed>
cloudpickle          <not installed>
connectorx           <not installed>
deltalake            0.21.0
fastexcel            <not installed>
fsspec               <not installed>
gevent               <not installed>
great_tables         <not installed>
matplotlib           <not installed>
nest_asyncio         <not installed>
numpy                1.24.3
openpyxl             <not installed>
pandas               2.0.2
pyarrow              18.0.0
pydantic             <not installed>
pyiceberg            <not installed>
sqlalchemy           <not installed>
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>
@Hhernan6 Hhernan6 added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Nov 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

1 participant