`scan_parquet` + `filter` on `S3` with Hive schema `pl.Date` breaks #21526

mbronckers · 2025-02-28T12:22:59Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

TDLR: Since version 1.22.0, the following breaks on S3. Assuming you have a partitioned directory as follows on S3, the scan_parquet + filter with datetime.date() breaks down.

# setup of reproducing data example
file_directory = "some_s3_location"
df = pl.DataFrame(
    {
        "DATE": [dt.date(2025, 2, 25), dt.date(2025, 2, 26), dt.date(2025, 2, 27)],
        "VALUE": [10, 20, 30],
    }
)
df.write_parquet(file_directory, partition_by=['DATE'])

Reproducer

Specifying the hive schema as {'DATE': pl.Date} does not allow you to filter on datetime.date objects anymore.

import polars as pl
import datetime as dt

# works, because data is laid out as .../DATE=2025-02-26/...
a = (
    pl.scan_parquet(file_directory_on_s3, hive_partitioning=True)
    .filter(DATE="2025-02-26")
    .collect()
    .head()
)
print(a)

# does not work
b = (
    pl.scan_parquet(
        file_directory,
        hive_partitioning=True,
        hive_schema={"DATE": pl.Date},
    )
    .filter(DATE=dt.date(2025, 2, 26))
    .collect()
    .head()
)
print(b)

# does also not work
c = (pl.scan_parquet(
    file_directory,
    hive_partitioning=True,
    hive_schema={'DATE': pl.Date}
).filter(DATE='2025-02-26').collect().head())
print(c)

Log output

reading of 1/1 file...

thread '<unnamed>' panicked at crates/polars-core/src/scalar/mod.rs:64:92:
called `Result::unwrap()` on an `Err` value: SchemaMismatch(ErrString("unexpected value while building Series of type Date; found value of type String: \"2025-02-26\""))
stack backtrace:
   0: _rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::result::unwrap_failed
   3: polars_core::scalar::Scalar::into_series
   4: polars_core::frame::column::scalar::ScalarColumn::_to_series
   5: polars_core::frame::column::scalar::ScalarColumn::as_n_values_series
   6: polars_core::frame::column::compare::<impl polars_core::chunked_array::ops::ChunkCompareIneq<&polars_core::frame::column::Column> for polars_core::frame::column::Column>::gt
   7: polars_expr::expressions::binary::apply_operator
   8: polars_expr::expressions::binary::apply_operator_owned
   9: <polars_expr::expressions::binary::BinaryExpr as polars_expr::expressions::PhysicalExpr>::evaluate
  10: <polars_expr::expressions::binary::BinaryExpr as polars_expr::expressions::PhysicalExpr>::evaluate
  11: <polars_expr::expressions::binary::BinaryExpr as polars_expr::expressions::PhysicalExpr>::evaluate
  12: <polars_mem_engine::predicate::SkipBatchPredicateHelper as polars_io::predicates::SkipBatchPredicate>::can_skip_batch
  13: polars_io::parquet::read::predicates::read_this_row_group
  14: <core::iter::adapters::filter_map::FilterMap<I,F> as core::iter::traits::iterator::Iterator>::next
  15: polars_io::parquet::read::async_impl::FetchRowGroupsFromObjectStore::new
  16: <futures_util::future::try_future::into_future::IntoFuture<Fut> as core::future::future::Future>::poll
  17: polars_mem_engine::executors::scan::parquet::ParquetExec::read_async::{{closure}}
  18: polars_io::pl_async::RuntimeManager::block_on_potential_spawn::{{closure}}
  19: polars_mem_engine::executors::scan::parquet::ParquetExec::read_impl
  20: <polars_mem_engine::executors::scan::parquet::ParquetExec as polars_mem_engine::executors::executor::Executor>::execute
  21: polars_lazy::frame::LazyFrame::collect
  22: polars_python::lazyframe::general::<impl polars_python::lazyframe::PyLazyFrame>::__pymethod_collect__
  23: pyo3::impl_::trampoline::trampoline
  24: polars_python::lazyframe::general::_::__INVENTORY::trampoline
  25: _method_vectorcall_VARARGS_KEYWORDS
  26: _call_function
  27: __PyEval_EvalFrameDefault
  28: __PyEval_Vector
  29: _call_function
  30: __PyEval_EvalFrameDefault
  31: __PyEval_Vector
  32: _PyEval_EvalCode
  33: _run_eval_code_obj
  34: _run_mod
  35: _pyrun_file
  36: __PyRun_SimpleFileObject
  37: __PyRun_AnyFileObject
  38: _pymain_run_file_obj
  39: _pymain_run_file
  40: _Py_RunMain
  41: _Py_BytesMain
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Traceback (most recent call last):
  File "/Users/mbroncke/prj/work/eueo/webgui/test.py", line 43, in <module>
    ).filter(DATE=dt.date(2025,2,26)).collect().head())
  File "/Users/mbroncke/prj/work/eueo/webgui/.venv/lib/python3.10/site-packages/polars/lazyframe/frame.py", line 2066, in collect
    return wrap_df(ldf.collect(callback))
pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: SchemaMismatch(ErrString("unexpected value while building Series of type Date; found value of type String: \"2025-02-26\""))

Issue description

NB: try_parse_hive_dates does not appear to have an effect here. locally the behavior is as expected too.

Expected behavior

Allow filtering on datetime.date objects on S3 when hive schema specifies the partition dtype as pl.Date.

Installed versions

>>> import polars as pl
>>> pl.show_versions()
--------Version info---------
Polars:              1.22.0
Index type:          UInt32
Platform:            macOS-14.3-arm64-arm-64bit
Python:              3.10.14 (main, Mar 19 2024, 21:46:16) [Clang 15.0.0 (clang-1500.3.9.4)]
LTS CPU:             False

----Optional dependencies----
Azure CLI            <not installed>
adbc_driver_manager  1.4.0
altair               5.5.0
azure.identity       <not installed>
boto3                1.35.0
cloudpickle          <not installed>
connectorx           0.4.2
deltalake            <not installed>
fastexcel            <not installed>
fsspec               2025.2.0
gevent               <not installed>
google.auth          <not installed>
great_tables         <not installed>
matplotlib           3.10.1
numpy                2.2.3
openpyxl             <not installed>
pandas               2.2.3
pyarrow              19.0.1
pydantic             2.10.6
pyiceberg            <not installed>
sqlalchemy           2.0.38
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           3.2.0

The text was updated successfully, but these errors were encountered:

mbronckers added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Feb 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`scan_parquet` + `filter` on `S3` with Hive schema `pl.Date` breaks #21526

`scan_parquet` + `filter` on `S3` with Hive schema `pl.Date` breaks #21526

mbronckers commented Feb 28, 2025

scan_parquet + filter on S3 with Hive schema pl.Date breaks #21526

scan_parquet + filter on S3 with Hive schema pl.Date breaks #21526

Comments

mbronckers commented Feb 28, 2025

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

`scan_parquet` + `filter` on `S3` with Hive schema `pl.Date` breaks #21526

`scan_parquet` + `filter` on `S3` with Hive schema `pl.Date` breaks #21526