Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pl.scan_csv().select('col1').collect() silently does not return all rows if one column contains the quote_char #21519

Open
2 tasks done
jmuecke opened this issue Feb 28, 2025 · 2 comments
Labels
A-io-csv Area: reading/writing CSV files bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@jmuecke
Copy link

jmuecke commented Feb 28, 2025

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

csv = b"""col1,col2
val1,abc"
val2,random
"""
pl.scan_csv(csv).select("col1").collect()

# ┌──────┐
# │ col1 │
# │ ---  │
# │ str  │
# ╞══════╡
# │ val1 │
# └──────┘

Log output

_init_credential_provider_builder(): credential_provider_init = None
read files in parallel

Issue description

Applying a selection on a different column, than the column (i.e. col2) with a quote_char (default: ") not as the first character silently, returns only rows until and including the line with the quote_char. All subsequent rows are silently not returned. This also works with any other char used as quote_char (e.g. q).

This also happens with POLARS_MAX_THREADS=1

If we select the column with the quote_char, the csv is correctly parsed:

pl.scan_csv(csv, separator="|").select("col2").collect()

# shape: (2, 1)
# ┌────────┐
# │ col2   │
# │ ---    │
# │ str    │
# ╞════════╡
# │ abc"   │
# │ random │
# └────────┘

Expected behavior

All rows from the csv should be returned by scan_csv().select().collect().

Another solution would be to throw an exception.

Installed versions

I tested this with polars 1.0.0, 1.22.0, 1.23.0. All versions are affected by this bug.

--------Version info---------
Polars:              1.23.0
Index type:          UInt32
Platform:            Linux-6.12.9-200.fc41.x86_64-x86_64-with-glibc2.40
Python:              3.13.2 (main, Feb  4 2025, 00:00:00) [GCC 14.2.1 20250110 (Red Hat 14.2.1-7)]
LTS CPU:             False

----Optional dependencies----
Azure CLI            <not installed>
adbc_driver_manager  <not installed>
altair               <not installed>
azure.identity       <not installed>
boto3                1.36.17
cloudpickle          <not installed>
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               2024.12.0
gevent               <not installed>
google.auth          2.38.0
great_tables         <not installed>
matplotlib           3.9.4
numpy                1.26.4
openpyxl             3.1.2
pandas               2.2.1
polars_cloud         <not installed>
pyarrow              16.1.0
pydantic             <not installed>
pyiceberg            <not installed>
sqlalchemy           2.0.37
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           3.2.0
@jmuecke jmuecke added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Feb 28, 2025
@ritchie46
Copy link
Member

Your csv is malformed. The quote char should be escaped.

@jmuecke
Copy link
Author

jmuecke commented Feb 28, 2025

Your csv is malformed. The quote char should be escaped.

True, but the parsing should be consistent between LazyFrames and normal DataFrames.

Example with Dataframe

csv = b"""col1,col2
val1,abc"
val2,random
"""
pl.scan_csv(csv).collect().select("col1")

# ┌──────┐
# │ col1 │
# │ ---  │
# │ str  │
# ╞══════╡
# │ val1 │
# │ val2 │
# └──────┘

Example with lazyframe but affected column selected

pl.scan_csv(csv).select(pl.all()).collect()

# ┌──────┬────────┐
# │ col1 ┆ col2   │
# │ ---  ┆ ---    │
# │ str  ┆ str    │
# ╞══════╪════════╡
# │ val1 ┆ abc"   │
# │ val2 ┆ random │
# └──────┴────────┘

@alexander-beedie alexander-beedie added the A-io-csv Area: reading/writing CSV files label Feb 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io-csv Area: reading/writing CSV files bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

3 participants