Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scan_parquet + filter on S3 with Hive schema pl.Date breaks #21526

Open
2 tasks done
mbronckers opened this issue Feb 28, 2025 · 0 comments
Open
2 tasks done

scan_parquet + filter on S3 with Hive schema pl.Date breaks #21526

mbronckers opened this issue Feb 28, 2025 · 0 comments
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@mbronckers
Copy link

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

TDLR: Since version 1.22.0, the following breaks on S3. Assuming you have a partitioned directory as follows on S3, the scan_parquet + filter with datetime.date() breaks down.

# setup of reproducing data example
file_directory = "some_s3_location"
df = pl.DataFrame(
    {
        "DATE": [dt.date(2025, 2, 25), dt.date(2025, 2, 26), dt.date(2025, 2, 27)],
        "VALUE": [10, 20, 30],
    }
)
df.write_parquet(file_directory, partition_by=['DATE'])

Reproducer

Specifying the hive schema as {'DATE': pl.Date} does not allow you to filter on datetime.date objects anymore.

import polars as pl
import datetime as dt

# works, because data is laid out as .../DATE=2025-02-26/...
a = (
    pl.scan_parquet(file_directory_on_s3, hive_partitioning=True)
    .filter(DATE="2025-02-26")
    .collect()
    .head()
)
print(a)

# does not work
b = (
    pl.scan_parquet(
        file_directory,
        hive_partitioning=True,
        hive_schema={"DATE": pl.Date},
    )
    .filter(DATE=dt.date(2025, 2, 26))
    .collect()
    .head()
)
print(b)

# does also not work
c = (pl.scan_parquet(
    file_directory,
    hive_partitioning=True,
    hive_schema={'DATE': pl.Date}
).filter(DATE='2025-02-26').collect().head())
print(c)

Log output

reading of 1/1 file...

thread '<unnamed>' panicked at crates/polars-core/src/scalar/mod.rs:64:92:
called `Result::unwrap()` on an `Err` value: SchemaMismatch(ErrString("unexpected value while building Series of type Date; found value of type String: \"2025-02-26\""))
stack backtrace:
   0: _rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::result::unwrap_failed
   3: polars_core::scalar::Scalar::into_series
   4: polars_core::frame::column::scalar::ScalarColumn::_to_series
   5: polars_core::frame::column::scalar::ScalarColumn::as_n_values_series
   6: polars_core::frame::column::compare::<impl polars_core::chunked_array::ops::ChunkCompareIneq<&polars_core::frame::column::Column> for polars_core::frame::column::Column>::gt
   7: polars_expr::expressions::binary::apply_operator
   8: polars_expr::expressions::binary::apply_operator_owned
   9: <polars_expr::expressions::binary::BinaryExpr as polars_expr::expressions::PhysicalExpr>::evaluate
  10: <polars_expr::expressions::binary::BinaryExpr as polars_expr::expressions::PhysicalExpr>::evaluate
  11: <polars_expr::expressions::binary::BinaryExpr as polars_expr::expressions::PhysicalExpr>::evaluate
  12: <polars_mem_engine::predicate::SkipBatchPredicateHelper as polars_io::predicates::SkipBatchPredicate>::can_skip_batch
  13: polars_io::parquet::read::predicates::read_this_row_group
  14: <core::iter::adapters::filter_map::FilterMap<I,F> as core::iter::traits::iterator::Iterator>::next
  15: polars_io::parquet::read::async_impl::FetchRowGroupsFromObjectStore::new
  16: <futures_util::future::try_future::into_future::IntoFuture<Fut> as core::future::future::Future>::poll
  17: polars_mem_engine::executors::scan::parquet::ParquetExec::read_async::{{closure}}
  18: polars_io::pl_async::RuntimeManager::block_on_potential_spawn::{{closure}}
  19: polars_mem_engine::executors::scan::parquet::ParquetExec::read_impl
  20: <polars_mem_engine::executors::scan::parquet::ParquetExec as polars_mem_engine::executors::executor::Executor>::execute
  21: polars_lazy::frame::LazyFrame::collect
  22: polars_python::lazyframe::general::<impl polars_python::lazyframe::PyLazyFrame>::__pymethod_collect__
  23: pyo3::impl_::trampoline::trampoline
  24: polars_python::lazyframe::general::_::__INVENTORY::trampoline
  25: _method_vectorcall_VARARGS_KEYWORDS
  26: _call_function
  27: __PyEval_EvalFrameDefault
  28: __PyEval_Vector
  29: _call_function
  30: __PyEval_EvalFrameDefault
  31: __PyEval_Vector
  32: _PyEval_EvalCode
  33: _run_eval_code_obj
  34: _run_mod
  35: _pyrun_file
  36: __PyRun_SimpleFileObject
  37: __PyRun_AnyFileObject
  38: _pymain_run_file_obj
  39: _pymain_run_file
  40: _Py_RunMain
  41: _Py_BytesMain
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Traceback (most recent call last):
  File "/Users/mbroncke/prj/work/eueo/webgui/test.py", line 43, in <module>
    ).filter(DATE=dt.date(2025,2,26)).collect().head())
  File "/Users/mbroncke/prj/work/eueo/webgui/.venv/lib/python3.10/site-packages/polars/lazyframe/frame.py", line 2066, in collect
    return wrap_df(ldf.collect(callback))
pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: SchemaMismatch(ErrString("unexpected value while building Series of type Date; found value of type String: \"2025-02-26\""))

Issue description

NB: try_parse_hive_dates does not appear to have an effect here. locally the behavior is as expected too.

Expected behavior

Allow filtering on datetime.date objects on S3 when hive schema specifies the partition dtype as pl.Date.

Installed versions

>>> import polars as pl
>>> pl.show_versions()
--------Version info---------
Polars:              1.22.0
Index type:          UInt32
Platform:            macOS-14.3-arm64-arm-64bit
Python:              3.10.14 (main, Mar 19 2024, 21:46:16) [Clang 15.0.0 (clang-1500.3.9.4)]
LTS CPU:             False

----Optional dependencies----
Azure CLI            <not installed>
adbc_driver_manager  1.4.0
altair               5.5.0
azure.identity       <not installed>
boto3                1.35.0
cloudpickle          <not installed>
connectorx           0.4.2
deltalake            <not installed>
fastexcel            <not installed>
fsspec               2025.2.0
gevent               <not installed>
google.auth          <not installed>
great_tables         <not installed>
matplotlib           3.10.1
numpy                2.2.3
openpyxl             <not installed>
pandas               2.2.3
pyarrow              19.0.1
pydantic             2.10.6
pyiceberg            <not installed>
sqlalchemy           2.0.38
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           3.2.0
@mbronckers mbronckers added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Feb 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

1 participant