1.3.0 regression when reading all-null DECIMAL(19,0) column @ parquet file exported by AWS Redshift #17929

TinoSM · 2024-07-29T15:13:36Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

broken_example.parquet.zip

  # file is  [broken_example.parquet.zip](https://github.com/user-attachments/files/16416053/broken_example.parquet.zip)
    df_eager = pl.scan_parquet("broken_example.parquet") #also tried with read_parquet
    for col in df_eager.columns:
            print(col)
            df_eager.select(col).collect()  # only fails with DECIMAL(19,0) columns exported with redshift, I tried replicating using pl.Dataframe([None]*500).cast.. and it does not reproduce the issue. Also tried reading it with Polars 1.2.1, then writing it again, then reading with polars 1.3.0, that also works.

Log output

thread 'polars-0' panicked at crates/polars-parquet/src/arrow/read/deserialize/fixed_size_binary.rs:250:31:
range end index 8 out of range for slice of length 0 <-- In our real data it says "index 9"
stack backtrace:
   0: _rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::slice::index::slice_end_index_len_fail
   3: <polars_parquet::arrow::read::deserialize::fixed_size_binary::StateTranslation as polars_parquet::arrow::read::deserialize::utils::StateTranslation<polars_parquet::arrow::read::deserialize::fixed_size_binary::BinaryDecoder>>::extend_from_state
   4: polars_parquet::arrow::read::deserialize::utils::PageDecoder<I,D>::collect_n
   5: polars_parquet::arrow::read::deserialize::simple::page_iter_to_array
   6: polars_io::parquet::read::read_impl::column_idx_to_series
   7: polars_io::parquet::read::read_impl::rg_to_dfs
   8: rayon::iter::plumbing::bridge_producer_consumer::helper
   9: rayon_core::thread_pool::ThreadPool::install::{{closure}}
  10: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute

Issue description

Since 1.3.0 (we come from 1.2.1 and same files work fine) we are unable to read all our parquet exports.

It turns out the issue happens when we read a DECIMAL(19,0) column with (many/all?) values set to null

It works fine if use_pyarrow=True (but that enforces read_parquet instead of scan_parquet also...)

Expected behavior

Parquet file being read without errors thrown

I can read it correctly with Polars 1.2.1 or DuckDB or using "pyarrow" engine

Installed versions

--------Version info---------
Polars:               1.3.0
Index type:           UInt32
Platform:             macOS-14.5-arm64-arm-64bit
Python:               3.11.9 (main, Apr  2 2024, 08:25:04) [Clang 15.0.0 (clang-1500.3.9.4)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            0.18.2
fastexcel:            <not installed>
fsspec:               2024.6.1
gevent:               <not installed>

great_tables:         <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
nest_asyncio:         <not installed>
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.2
pyarrow:              17.0.0
pydantic:             2.8.2
pyiceberg:            <not installed>
sqlalchemy:           <not installed>
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

The text was updated successfully, but these errors were encountered:

TinoSM · 2024-07-29T15:15:12Z

This is how the file was generated (Redshift SQL query)

unload ('select null + MY_DECIMAL_19_0_COLUMN as test_column from table')
to 's3://XXX'
PARQUET
REGION 'eu-west-1'
CLEANPATH
MAXFILESIZE 128MB
CREDENTIALS 'xxxx'

ritchie46 · 2024-07-29T18:27:14Z

Can you share a small file showing the issue?

TinoSM · 2024-07-29T18:29:04Z

@ritchie46 it is attached to the original ticket, is the .zip in the reproducible example (broken_example.parquet.zip in ctrl-F will find it)

Adding it aswell to this comment
broken_example.parquet.zip

ritchie46 · 2024-07-30T07:44:04Z

Check. Thanks!

coastalwhite · 2024-07-30T12:06:34Z

Minimal repro for this issue:

import polars as pl
import io
from polars.testing import assert_frame_equal

df = pl.DataFrame({
    'a': [ None ]
}, schema={ 'a': pl.Decimal(precision=18, scale=0) })

f = io.BytesIO()
df.write_parquet(f, use_pyarrow=True)
f.seek(0)
assert_frame_equal(pl.read_parquet(f), df)

Fixes pola-rs#17929.

TinoSM · 2024-07-31T12:28:47Z

thanks @coastalwhite @ritchie46 !

TinoSM added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jul 29, 2024

TinoSM changed the title ~~1.3.0 regression when reading DECIMAL(19,0) parquet files exported by Redshift~~ 1.3.0 regression when reading all-null DECIMAL(19,0) column @ parquet file exported by Redshift Jul 29, 2024

TinoSM changed the title ~~1.3.0 regression when reading all-null DECIMAL(19,0) column @ parquet file exported by Redshift~~ 1.3.0 regression when reading all-null DECIMAL(19,0) column @ parquet file exported by AWS Redshift Jul 29, 2024

coastalwhite added accepted Ready for implementation P-medium Priority: medium A-io-parquet Area: reading/writing Parquet files A-panic Area: code that results in panic exceptions and removed needs triage Awaiting prioritization by a maintainer labels Jul 29, 2024

github-project-automation bot added this to Backlog Jul 29, 2024

github-project-automation bot moved this to Ready in Backlog Jul 29, 2024

coastalwhite self-assigned this Jul 29, 2024

coastalwhite added a commit to coastalwhite/polars that referenced this issue Jul 30, 2024

fix: decode Fixed-Size-List Parquet dict page

1f18aa1

Fixes pola-rs#17929.

coastalwhite added a commit to coastalwhite/polars that referenced this issue Jul 30, 2024

fix: decode Fixed-Size-List Parquet dict page

b54f2e8

Fixes pola-rs#17929.

coastalwhite linked a pull request Jul 31, 2024 that will close this issue

fix: Several parquet reader/writer regressions #17941

Merged

ritchie46 closed this as completed in #17941 Jul 31, 2024

github-project-automation bot moved this from Ready to Done in Backlog Jul 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1.3.0 regression when reading all-null DECIMAL(19,0) column @ parquet file exported by AWS Redshift #17929

1.3.0 regression when reading all-null DECIMAL(19,0) column @ parquet file exported by AWS Redshift #17929

TinoSM commented Jul 29, 2024 •

edited

Loading

TinoSM commented Jul 29, 2024 •

edited

Loading

ritchie46 commented Jul 29, 2024

TinoSM commented Jul 29, 2024 •

edited

Loading

ritchie46 commented Jul 30, 2024

coastalwhite commented Jul 30, 2024

TinoSM commented Jul 31, 2024

1.3.0 regression when reading all-null DECIMAL(19,0) column @ parquet file exported by AWS Redshift #17929

1.3.0 regression when reading all-null DECIMAL(19,0) column @ parquet file exported by AWS Redshift #17929

Comments

TinoSM commented Jul 29, 2024 • edited Loading

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

TinoSM commented Jul 29, 2024 • edited Loading

ritchie46 commented Jul 29, 2024

TinoSM commented Jul 29, 2024 • edited Loading

ritchie46 commented Jul 30, 2024

coastalwhite commented Jul 30, 2024

TinoSM commented Jul 31, 2024

TinoSM commented Jul 29, 2024 •

edited

Loading

TinoSM commented Jul 29, 2024 •

edited

Loading

TinoSM commented Jul 29, 2024 •

edited

Loading