Reading multi-source compressed JSONL files #17161

shrshi · 2024-10-23T23:58:59Z

Description

This PR introduces a new datasource for compressed inputs which enables batching and byte range reading of multi-source JSONL files using the reallocate-and-retry policy. Moreover. instead of using a 4:1 compression ratio heuristic, the device buffer size is estimated accurately for GZIP, ZIP, and SNAPPY compression types. For remaining types, the files are first decompressed then batched.

~~TODO: Reuse existing JSON tests but with an additional compression parameter to verify correctness.~~
Handled by #17219, which implements compressed JSON writer required for the above test.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

cpp/src/io/json/read_json.cu

cpp/src/io/comp/uncomp.cpp

vuule

I love the approach!
Some suggestions, main one is the "memoization" in the new source type.

cpp/src/io/json/read_json.cu

shrshi · 2024-10-31T18:13:37Z

Status update: Implementing benchmark for compressed JSONL files in #17219

shrshi · 2024-11-06T13:40:05Z

Plot of throughput of read_json for GZIP-compressed JSONL inputs against uncompressed input size generated using the benchmark from #17219

Benchmark raw output:

compression_type	io	data_size	Samples	CPU Time	Noise	GPU Time	Noise	bytes_per_second	peak_memory_usage	encoded_file_size
------------------	----------	------------------	---------	------------	-------	------------	-------	------------------	-------------------	-------------------
GZIP	FILEPATH	2^20 = 1048576	768x	18.205 ms	1.88%	18.196 ms	1.88%	2.95E+10	13.458 MiB	386.724 KiB
GZIP	FILEPATH	2^21 = 2097152	617x	24.215 ms	3.04%	24.206 ms	3.04%	2.22E+10	26.707 MiB	768.517 KiB
GZIP	FILEPATH	2^22 = 4194304	418x	35.741 ms	2.15%	35.732 ms	2.15%	1.5E+10	53.015 MiB	1.492 MiB
GZIP	FILEPATH	2^23 = 8388608	251x	59.739 ms	1.60%	59.731 ms	1.60%	8.99E+09	106.061 MiB	2.984 MiB
GZIP	FILEPATH	2^24 = 16777216	120x	125.038 ms	0.96%	125.029 ms	0.96%	4.29E+09	212.568 MiB	5.981 MiB
GZIP	FILEPATH	2^25 = 33554432	5x	244.733 ms	0.37%	244.726 ms	0.37%	2.19E+09	425.201 MiB	11.965 MiB
GZIP	FILEPATH	2^26 = 67108864	5x	481.785 ms	0.39%	481.781 ms	0.39%	1.11E+09	849.693 MiB	23.925 MiB
GZIP	FILEPATH	2^27 = 134217728	5x	939.229 ms	0.23%	939.231 ms	0.23%	5.72E+08	1.658 GiB	47.797 MiB
GZIP	FILEPATH	2^28 = 268435456	5x	1.923 s	0.28%	1.923 s	0.28%	2.79E+08	3.318 GiB	95.642 MiB
GZIP	FILEPATH	2^29 = 536870912	4x	3.885 s	inf%	3.885 s	inf%	1.38E+08	6.638 GiB	191.359 MiB

Without the custom data source for compressed inputs, the parser errors out for all of the above data sizes. I suspect that the heuristic for compressed inputs in the previous implementation falls short without the retry policy, but some more investigation is required to understand the failure, especially for the smaller data sizes.

shrshi · 2024-11-07T13:38:46Z

On profiling the benchmark using nsys, we see that total runtime is dominated by the host-side compression and decompression (in get_record_range_raw_input in read_json).

nsys profile for data sizes 2^27, 2^28 and 2^29

vuule

Haven't reviewed the new decompress_snappy as I'm not sure its needed. Otherwise looks good.

cpp/src/io/comp/uncomp.cpp

cpp/src/io/json/read_json.hpp

…-read-json-bug

shrshi added 3 commits October 18, 2024 20:11

partial work

3517b26

compressed input datasource

1cc6f46

formatting

1f46223

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Oct 23, 2024

shrshi added cuIO cuIO issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Oct 24, 2024

shrshi added 6 commits October 24, 2024 00:57

improving the datasoruce

334ef06

cleanup

839bdda

slow path for some compression formats

42a4b1b

merge

cff583b

cleanup

c3b6cb3

remove include

e116fa7

vuule reviewed Oct 24, 2024

View reviewed changes

cpp/src/io/json/read_json.cu Show resolved Hide resolved

shrshi added 2 commits October 25, 2024 18:19

pr feedback

3cd7c1d

Merge branch 'branch-24.12' into gzip-read-json-bug

dc7471c

shrshi marked this pull request as ready for review October 25, 2024 18:21

shrshi requested a review from a team as a code owner October 25, 2024 18:21

shrshi requested review from karthikeyann and kingcrimsontianyu October 25, 2024 18:21

vuule reviewed Oct 25, 2024

View reviewed changes

cpp/src/io/comp/uncomp.cpp Outdated Show resolved Hide resolved

vuule requested changes Oct 25, 2024

View reviewed changes

cpp/src/io/json/read_json.cu Outdated Show resolved Hide resolved

cpp/src/io/json/read_json.cu Outdated Show resolved Hide resolved

cpp/src/io/json/read_json.cu Outdated Show resolved Hide resolved

cpp/src/io/json/read_json.cu Outdated Show resolved Hide resolved

ttnghia reviewed Oct 28, 2024

View reviewed changes

cpp/src/io/json/read_json.cu Outdated Show resolved Hide resolved

GregoryKimball assigned shrshi Nov 4, 2024

shrshi added 4 commits November 6, 2024 12:04

reorg and cleanup

4bd817e

update function name

51a382e

merge

ac596a2

pr reviews

9ae4c78

shrshi requested a review from vuule November 6, 2024 13:43

storing compressed buf as datasource buf

a6faf3d

shrshi mentioned this pull request Nov 6, 2024

Benchmarking JSON reader for compressed inputs #17219

Open

6 tasks

shrshi requested a review from ttnghia November 6, 2024 14:28

Merge branch 'branch-24.12' into gzip-read-json-bug

89679b4

vuule reviewed Nov 8, 2024

View reviewed changes

cpp/src/io/comp/uncomp.cpp Outdated Show resolved Hide resolved

cpp/src/io/json/read_json.hpp Outdated Show resolved Hide resolved

shrshi added 2 commits November 8, 2024 15:23

pr reviews

92facb5

Merge branch 'gzip-read-json-bug' of github.com:shrshi/cudf into gzip…

2c3fd04

…-read-json-bug

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading multi-source compressed JSONL files #17161

Reading multi-source compressed JSONL files #17161

shrshi commented Oct 23, 2024 •

edited

Loading

vuule left a comment

shrshi commented Oct 31, 2024

shrshi commented Nov 6, 2024 •

edited

Loading

shrshi commented Nov 7, 2024

vuule left a comment

Reading multi-source compressed JSONL files #17161

Are you sure you want to change the base?

Reading multi-source compressed JSONL files #17161

Conversation

shrshi commented Oct 23, 2024 • edited Loading

Description

Checklist

vuule left a comment

Choose a reason for hiding this comment

shrshi commented Oct 31, 2024

shrshi commented Nov 6, 2024 • edited Loading

shrshi commented Nov 7, 2024

vuule left a comment

Choose a reason for hiding this comment

shrshi commented Oct 23, 2024 •

edited

Loading

shrshi commented Nov 6, 2024 •

edited

Loading