Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.3.0 regression when reading all-null DECIMAL(19,0) column @ parquet file exported by AWS Redshift #17929

Closed
2 tasks done
TinoSM opened this issue Jul 29, 2024 · 6 comments · Fixed by #17941
Closed
2 tasks done
Assignees
Labels
A-io-parquet Area: reading/writing Parquet files A-panic Area: code that results in panic exceptions accepted Ready for implementation bug Something isn't working P-medium Priority: medium python Related to Python Polars

Comments

@TinoSM
Copy link

TinoSM commented Jul 29, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

broken_example.parquet.zip

  # file is  [broken_example.parquet.zip](https://github.com/user-attachments/files/16416053/broken_example.parquet.zip)
    df_eager = pl.scan_parquet("broken_example.parquet") #also tried with read_parquet
    for col in df_eager.columns:
            print(col)
            df_eager.select(col).collect()  # only fails with DECIMAL(19,0) columns exported with redshift, I tried replicating using pl.Dataframe([None]*500).cast.. and it does not reproduce the issue. Also tried reading it with Polars 1.2.1, then writing it again, then reading with polars 1.3.0, that also works.

Log output

thread 'polars-0' panicked at crates/polars-parquet/src/arrow/read/deserialize/fixed_size_binary.rs:250:31:
range end index 8 out of range for slice of length 0 <-- In our real data it says "index 9"
stack backtrace:
   0: _rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::slice::index::slice_end_index_len_fail
   3: <polars_parquet::arrow::read::deserialize::fixed_size_binary::StateTranslation as polars_parquet::arrow::read::deserialize::utils::StateTranslation<polars_parquet::arrow::read::deserialize::fixed_size_binary::BinaryDecoder>>::extend_from_state
   4: polars_parquet::arrow::read::deserialize::utils::PageDecoder<I,D>::collect_n
   5: polars_parquet::arrow::read::deserialize::simple::page_iter_to_array
   6: polars_io::parquet::read::read_impl::column_idx_to_series
   7: polars_io::parquet::read::read_impl::rg_to_dfs
   8: rayon::iter::plumbing::bridge_producer_consumer::helper
   9: rayon_core::thread_pool::ThreadPool::install::{{closure}}
  10: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute

Issue description

Since 1.3.0 (we come from 1.2.1 and same files work fine) we are unable to read all our parquet exports.

It turns out the issue happens when we read a DECIMAL(19,0) column with (many/all?) values set to null

It works fine if use_pyarrow=True (but that enforces read_parquet instead of scan_parquet also...)

Expected behavior

Parquet file being read without errors thrown

I can read it correctly with Polars 1.2.1 or DuckDB or using "pyarrow" engine

Installed versions

--------Version info---------
Polars:               1.3.0
Index type:           UInt32
Platform:             macOS-14.5-arm64-arm-64bit
Python:               3.11.9 (main, Apr  2 2024, 08:25:04) [Clang 15.0.0 (clang-1500.3.9.4)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            0.18.2
fastexcel:            <not installed>
fsspec:               2024.6.1
gevent:               <not installed>

great_tables:         <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
nest_asyncio:         <not installed>
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.2
pyarrow:              17.0.0
pydantic:             2.8.2
pyiceberg:            <not installed>
sqlalchemy:           <not installed>
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

@TinoSM TinoSM added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jul 29, 2024
@TinoSM
Copy link
Author

TinoSM commented Jul 29, 2024

This is how the file was generated (Redshift SQL query)

unload ('select null + MY_DECIMAL_19_0_COLUMN as test_column from table')
to 's3://XXX'
PARQUET
REGION 'eu-west-1'
CLEANPATH
MAXFILESIZE 128MB
CREDENTIALS 'xxxx'

@TinoSM TinoSM changed the title 1.3.0 regression when reading DECIMAL(19,0) parquet files exported by Redshift 1.3.0 regression when reading all-null DECIMAL(19,0) column @ parquet file exported by Redshift Jul 29, 2024
@TinoSM TinoSM changed the title 1.3.0 regression when reading all-null DECIMAL(19,0) column @ parquet file exported by Redshift 1.3.0 regression when reading all-null DECIMAL(19,0) column @ parquet file exported by AWS Redshift Jul 29, 2024
@coastalwhite coastalwhite added accepted Ready for implementation P-medium Priority: medium A-io-parquet Area: reading/writing Parquet files A-panic Area: code that results in panic exceptions and removed needs triage Awaiting prioritization by a maintainer labels Jul 29, 2024
@github-project-automation github-project-automation bot moved this to Ready in Backlog Jul 29, 2024
@coastalwhite coastalwhite self-assigned this Jul 29, 2024
@ritchie46
Copy link
Member

Can you share a small file showing the issue?

@TinoSM
Copy link
Author

TinoSM commented Jul 29, 2024

@ritchie46 it is attached to the original ticket, is the .zip in the reproducible example (broken_example.parquet.zip in ctrl-F will find it)

Adding it aswell to this comment
broken_example.parquet.zip

@ritchie46
Copy link
Member

Check. Thanks!

@coastalwhite
Copy link
Collaborator

Minimal repro for this issue:

import polars as pl
import io
from polars.testing import assert_frame_equal

df = pl.DataFrame({
    'a': [ None ]
}, schema={ 'a': pl.Decimal(precision=18, scale=0) })

f = io.BytesIO()
df.write_parquet(f, use_pyarrow=True)
f.seek(0)
assert_frame_equal(pl.read_parquet(f), df)

coastalwhite added a commit to coastalwhite/polars that referenced this issue Jul 30, 2024
coastalwhite added a commit to coastalwhite/polars that referenced this issue Jul 30, 2024
@coastalwhite coastalwhite linked a pull request Jul 31, 2024 that will close this issue
@github-project-automation github-project-automation bot moved this from Ready to Done in Backlog Jul 31, 2024
@TinoSM
Copy link
Author

TinoSM commented Jul 31, 2024

thanks @coastalwhite @ritchie46 !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io-parquet Area: reading/writing Parquet files A-panic Area: code that results in panic exceptions accepted Ready for implementation bug Something isn't working P-medium Priority: medium python Related to Python Polars
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants