Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read a parquet file error #108

Open
l1t1 opened this issue Mar 7, 2024 · 4 comments
Open

read a parquet file error #108

l1t1 opened this issue Mar 7, 2024 · 4 comments

Comments

@l1t1
Copy link

l1t1 commented Mar 7, 2024

when count(*)

> select count(*) from external('c:/t/t.parquet') ;
Error executing SQL: Error while reading parquet file: Error in c:/t/t.parquet
        Error while decoding row group 327 column chunk for column 'id' of type
'INT32' at offset 1419196279 of size 201495
        Decompressing compressed page of type 'DATA_PAGE_V2' at offset 141919627
9 with codec 'ZSTD' failed
        (compressed region offset: 1419204506, compressed size: 193268, expected
 uncompressed size: 270339)
        Actual uncompressed size (262144 bytes) of ZSTD compressed data is less
than expected (270339 bytes)
Hint: The file is probably corrupt.
Context: 0xfa6b0e2f

when fetch first few lines

> select * from external('c:/t/t.parquet') limit 5;
Error executing SQL: Error while reading parquet file: Error in c:/t/t.parquet
        Error while decoding row group 15 column chunk for column 'id' of type '
INT32' at offset 63856848 of size 180917
        Decompressing compressed page of type 'DATA_PAGE_V2' at offset 63856848
with codec 'ZSTD' failed
        (compressed region offset: 63864694, compressed size: 173071, expected u
ncompressed size: 257803)
        Actual uncompressed size (249988 bytes) of ZSTD compressed data is less
than expected (257803 bytes)
Hint: The file is probably corrupt.
Context: 0xfa6b0e2f
@l1t1
Copy link
Author

l1t1 commented Mar 7, 2024

both 'duckdb' and 'polars' can read the same file.

sql.execute('SELECT max(id) a,min(id) b FROM t',eager=True)
shape: (1, 2)
┌──────────┬─────┐
│ a        ┆ b   │
│ ---      ┆ --- │
│ i32      ┆ i32 │
╞══════════╪═════╡
│ 300000001   │
└──────────┴─────┘

@jkammerer
Copy link
Collaborator

Interesting find. Can you share the file with us?

@l1t1
Copy link
Author

l1t1 commented Mar 12, 2024

it's too big(2.3GB), and I am trying to find a smaller exmaple.

@l1t1
Copy link
Author

l1t1 commented Mar 13, 2024

I got some info of the bad file

import pyarrow.parquet as pq
parquet_file = 'c:/t/t.parquet'
>>> metadata = pq.ParquetFile(parquet_file).metadata
>>> print(metadata)
<pyarrow._parquet.FileMetaData object at 0x0000000004FE0AE0>
  created_by: Arrow2 - Native Rust implementation of Arrow
  num_columns: 83
  num_rows: 5000000
  num_row_groups: 95
  format_version: 2.6
  serialized_size: 518729

another good file info

>>> metadata = pq.ParquetFile(parquet_file).metadata
>>> print(metadata)
<pyarrow._parquet.FileMetaData object at 0x000000000500AAE0>
  created_by: DuckDB
  num_columns: 6
  num_rows: 50000000
  num_row_groups: 499
  format_version: 1.0
  serialized_size: 275300

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants