[BUG] Multi-source batched JSON reader: error due to reordered columns in partial tables constructed from each batch #17689

shrshi · 2025-01-07T18:02:58Z

Describe the bug
Before we concatenate the partial tables generated from each batch, we error out if the schemas of the tables don't match. But the column ordering of the partial tables can change depending on nulls in the columns. We should not error out in this case.

Steps/Code to reproduce bug
Draft PR #17688
./build/latest/benchmarks/JSON_READER_NVBENCH --benchmark json_read_compressed_io --axis compression_type GZIP --axis data_size[pow2]=28 --axis num_sources=4 --device 0

Expected behavior
Enforce column ordering based on the partial table in the first batch in all later batches.

The text was updated successfully, but these errors were encountered:

shrshi · 2025-01-07T23:36:48Z

Enforce column ordering based on the partial table in the first batch in all later batches.

From offline discussions with @karthikeyann, pitfalls with the proposed solution:

Data type mismatch for the same column between partial tables. For example, a partial table may have int8, but next chunk might be inferred at int16 or float.
Column present in the second batch but not in the first batch. In this case, we will prune that column out and the final table will be missing that column. Note that the converse case - if a column present in the first batch is missing from some following batch - is handled by the JSON tree algorithms. The missing column is included and filled with nulls.

shrshi added the bug Something isn't working label Jan 7, 2025

shrshi mentioned this issue Jan 7, 2025

[DNR] Add multi-source reading to JSON reader benchmarks #17688

Draft

3 tasks

shrshi mentioned this issue Jan 9, 2025

Enforce schema for partial tables in multi-source multi-batch JSON reader #17708

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Multi-source batched JSON reader: error due to reordered columns in partial tables constructed from each batch #17689

[BUG] Multi-source batched JSON reader: error due to reordered columns in partial tables constructed from each batch #17689

shrshi commented Jan 7, 2025

shrshi commented Jan 7, 2025 •

edited

Loading

[BUG] Multi-source batched JSON reader: error due to reordered columns in partial tables constructed from each batch #17689

[BUG] Multi-source batched JSON reader: error due to reordered columns in partial tables constructed from each batch #17689

Comments

shrshi commented Jan 7, 2025

shrshi commented Jan 7, 2025 • edited Loading

shrshi commented Jan 7, 2025 •

edited

Loading