Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading multi-source compressed JSONL files #17161

Open
wants to merge 19 commits into
base: branch-24.12
Choose a base branch
from

Conversation

shrshi
Copy link
Contributor

@shrshi shrshi commented Oct 23, 2024

Description

Fixes #17068
Fixes #12299

This PR introduces a new datasource for compressed inputs which enables batching and byte range reading of multi-source JSONL files using the reallocate-and-retry policy. Moreover. instead of using a 4:1 compression ratio heuristic, the device buffer size is estimated accurately for GZIP, ZIP, and SNAPPY compression types. For remaining types, the files are first decompressed then batched.

TODO: Reuse existing JSON tests but with an additional compression parameter to verify correctness.
Handled by #17219, which implements compressed JSON writer required for the above test.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Oct 23, 2024
@shrshi shrshi added cuIO cuIO issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Oct 24, 2024
@shrshi shrshi marked this pull request as ready for review October 25, 2024 18:21
@shrshi shrshi requested a review from a team as a code owner October 25, 2024 18:21
cpp/src/io/comp/uncomp.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love the approach!
Some suggestions, main one is the "memoization" in the new source type.

cpp/src/io/json/read_json.cu Outdated Show resolved Hide resolved
cpp/src/io/json/read_json.cu Outdated Show resolved Hide resolved
cpp/src/io/json/read_json.cu Outdated Show resolved Hide resolved
cpp/src/io/json/read_json.cu Outdated Show resolved Hide resolved
@shrshi
Copy link
Contributor Author

shrshi commented Oct 31, 2024

Status update: Implementing benchmark for compressed JSONL files in #17219

@shrshi
Copy link
Contributor Author

shrshi commented Nov 6, 2024

Plot of throughput of read_json for GZIP-compressed JSONL inputs against uncompressed input size generated using the benchmark from #17219
compressed-gzip-io

Benchmark raw output:

<style> </style>
compression_type io data_size Samples CPU Time Noise GPU Time Noise bytes_per_second peak_memory_usage encoded_file_size
------------------ ---------- ------------------ --------- ------------ ------- ------------ ------- ------------------ ------------------- -------------------
GZIP FILEPATH 2^20 = 1048576 768x 18.205 ms 1.88% 18.196 ms 1.88% 2.95E+10 13.458 MiB 386.724 KiB
GZIP FILEPATH 2^21 = 2097152 617x 24.215 ms 3.04% 24.206 ms 3.04% 2.22E+10 26.707 MiB 768.517 KiB
GZIP FILEPATH 2^22 = 4194304 418x 35.741 ms 2.15% 35.732 ms 2.15% 1.5E+10 53.015 MiB 1.492 MiB
GZIP FILEPATH 2^23 = 8388608 251x 59.739 ms 1.60% 59.731 ms 1.60% 8.99E+09 106.061 MiB 2.984 MiB
GZIP FILEPATH 2^24 = 16777216 120x 125.038 ms 0.96% 125.029 ms 0.96% 4.29E+09 212.568 MiB 5.981 MiB
GZIP FILEPATH 2^25 = 33554432 5x 244.733 ms 0.37% 244.726 ms 0.37% 2.19E+09 425.201 MiB 11.965 MiB
GZIP FILEPATH 2^26 = 67108864 5x 481.785 ms 0.39% 481.781 ms 0.39% 1.11E+09 849.693 MiB 23.925 MiB
GZIP FILEPATH 2^27 = 134217728 5x 939.229 ms 0.23% 939.231 ms 0.23% 5.72E+08 1.658 GiB 47.797 MiB
GZIP FILEPATH 2^28 = 268435456 5x 1.923 s 0.28% 1.923 s 0.28% 2.79E+08 3.318 GiB 95.642 MiB
GZIP FILEPATH 2^29 = 536870912 4x 3.885 s inf% 3.885 s inf% 1.38E+08 6.638 GiB 191.359 MiB

Without the custom data source for compressed inputs, the parser errors out for all of the above data sizes. I suspect that the heuristic for compressed inputs in the previous implementation falls short without the retry policy, but some more investigation is required to understand the failure, especially for the smaller data sizes.

@shrshi shrshi requested a review from vuule November 6, 2024 13:43
@shrshi
Copy link
Contributor Author

shrshi commented Nov 7, 2024

On profiling the benchmark using nsys, we see that total runtime is dominated by the host-side compression and decompression (in get_record_range_raw_input in read_json).
image
nsys profile for data sizes 2^27, 2^28 and 2^29

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't reviewed the new decompress_snappy as I'm not sure its needed. Otherwise looks good.

cpp/src/io/comp/uncomp.cpp Outdated Show resolved Hide resolved
cpp/src/io/json/read_json.hpp Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
Status: In Progress
Status: Burndown
3 participants