Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error reading ods file with read_ods #14053

Open
2 tasks done
archqt opened this issue Jan 28, 2024 · 8 comments
Open
2 tasks done

Error reading ods file with read_ods #14053

archqt opened this issue Jan 28, 2024 · 8 comments
Labels
A-io-spreadsheet Area: reading/writing Excel/ODS files bug Something isn't working python Related to Python Polars

Comments

@archqt
Copy link

archqt commented Jan 28, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl

data = pl.read_ods(
    source = "test.ods",
    schema_overrides = {"dt":pl.String},
    raise_if_empty = False,
)

⬇️ test.ods

Log output

Traceback (most recent call last):
  File "/home/moi/Cours/Planning/planning.py", line 15, in <module>
    data=pl.read_ods(source="test.ods",
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/polars/io/spreadsheet/functions.py", line 387, in read_ods
    return _read_spreadsheet(
           ^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/polars/io/spreadsheet/functions.py", line 428, in _read_spreadsheet
    parsed_sheets = {
                    ^
  File "/usr/lib/python3.11/site-packages/polars/io/spreadsheet/functions.py", line 429, in <dictcomp>
    name: reader_fn(
          ^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/polars/io/spreadsheet/functions.py", line 682, in _read_spreadsheet_ods
    df = pl.DataFrame(
         ^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/polars/dataframe/frame.py", line 377, in __init__
    self._df = sequence_to_pydf(
               ^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/polars/utils/_construction.py", line 989, in sequence_to_pydf
    return _sequence_to_pydf_dispatcher(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/functools.py", line 909, in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/polars/utils/_construction.py", line 1133, in _sequence_of_sequence_to_pydf
    pydf = PyDataFrame.read_rows(
           ^^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.ComputeError: could not append value: "Z" of type: str to the builder; make sure that all rows have the same schema or consider increasing `infer_schema_length`

it might also be that a value overflows the data-type's capacity

Issue description

Of course i removed lot of thing in the file to have this bug. But even if i remove the "Z" cell, i also have "duplicate name error". For now i will still use pandas, and i will test with polars for the next version.
Thanks for all

Expected behavior

No error if i remove the "Z" on the cell

Installed versions

--------Version info---------
Polars:               0.20.6
Index type:           UInt32
Platform:             Linux-6.7.1-arch1-1-x86_64-with-glibc2.38
Python:               3.11.6 (main, Nov 14 2023, 09:36:21) [GCC 13.2.1 20230801]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          2.2.1
connectorx:           <not installed>
deltalake:            <not installed>
fsspec:               2023.9.2
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           3.8.2
numpy:                1.26.3
openpyxl:             <not installed>
pandas:               1.5.3
pyarrow:              <not installed>
pydantic:             2.5.3
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             0.8.1
xlsxwriter:           <not installed>
@archqt archqt added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jan 28, 2024
@stinodego stinodego added the A-io-spreadsheet Area: reading/writing Excel/ODS files label Jan 29, 2024
@alexander-beedie alexander-beedie removed the needs triage Awaiting prioritization by a maintainer label Aug 7, 2024
@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Aug 7, 2024

FYI: we updated the default engine for both Excel and ODS files to "calamine" somewhat recently, and this uses fastexcel to load the data instead. However, there is still a (slightly different) error.

I've taken your file above and created an even-more minimal test case for them, to demonstrate the issue (see: ToucanToco/fastexcel#275). @lukapeschke & @PrettyWood, if you can take a look that would be much appreciated! 😎

@archqt: In case the reformulating of your original file is partially responsible for uncovering the calamine error, can you try loading it again? (I'm also going to expose a "has_header" param for both read_ods and read_excel shortly, which may be useful for you as your original file doesn't appear to have table headers). Note that your "schema_overrides" parameter won't do anything as there doesn't seem to be a column called "dt".

@PrettyWood
Copy link

Hey! I'll look into it over the next few days 👍

@archqt
Copy link
Author

archqt commented Aug 10, 2024

I uptated to polars 1.4.1-1, it works with the file sucess.ods, but it failed with the file failure.ods

Traceback (most recent call last):

  File /usr/lib/python3.12/site-packages/spyder_kernels/py3compat.py:356 in compat_exec
    exec(code, globals, locals)

  File ~/Cours/Planning/planning.py:18
    data=pl.read_ods(source=lf[0])

  File /usr/lib/python3.12/site-packages/polars/io/spreadsheet/functions.py:439 in read_ods
    return _read_spreadsheet(

  File /usr/lib/python3.12/site-packages/polars/io/spreadsheet/functions.py:552 in _read_spreadsheet
    name: reader_fn(

  File /usr/lib/python3.12/site-packages/polars/io/spreadsheet/functions.py:890 in _read_spreadsheet_calamine
    ws_arrow = parser.load_sheet_eager(sheet_name, **read_options)

  File /usr/lib/python3.12/site-packages/fastexcel/__init__.py:203 in load_sheet_eager
    return self._reader.load_sheet(

CannotRetrieveCellDataError: cannot retrieve cell data at (8, 0)
Context:
    0: could not determine dtype for column StringCol

@PrettyWood
Copy link

PrettyWood commented Aug 10, 2024

Yes I made a fix on calamine side this morning

@alexander-beedie
Copy link
Collaborator

Yes I made a fix on calamine side this morning

Many thanks!

@archqt
Copy link
Author

archqt commented Sep 14, 2024

It still doesn't work, i have now polars 1.7.1

@PrettyWood
Copy link

We still need calamine to merge my fix. Then bump fastexcel. We are thinking about forking calamine to be faster

@PrettyWood
Copy link

PrettyWood commented Oct 14, 2024

it's now fixed in fastexcel 0.12.0 (released today)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io-spreadsheet Area: reading/writing Excel/ODS files bug Something isn't working python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

4 participants