GH-41640: [Go] Implement BYTE_STREAM_SPLIT Parquet Encoding #43066

joellubi · 2024-06-26T14:43:10Z

Rationale for this change

This encoding is defined by the Parquet spec but does not currently have a Go implementation.

What changes are included in this PR?

Implement BYTE_STREAM_SPLIT encoder/decoder for:

FIXED_LEN_BYTE_ARRAY
FLOAT
DOUBLE
INT32
INT64

Are these changes tested?

Yes. See unit tests, file read conformance tests, and benchmarks.

Benchmark results on my machine

➜  go git:(impl-pq-bytestreamsplit) go test ./parquet/internal/encoding -run=^$ -bench=BenchmarkByteStreamSplit -benchmem 
goos: darwin
goarch: arm64
pkg: github.com/apache/arrow/go/v17/parquet/internal/encoding
BenchmarkByteStreamSplitEncodingInt32/len_1024-14                 502117              2005 ns/op        2043.37 MB/s        5267 B/op          3 allocs/op
BenchmarkByteStreamSplitEncodingInt32/len_2048-14                 328921              3718 ns/op        2203.54 MB/s        9879 B/op          3 allocs/op
BenchmarkByteStreamSplitEncodingInt32/len_4096-14                 169642              7083 ns/op        2313.14 MB/s       18852 B/op          3 allocs/op
BenchmarkByteStreamSplitEncodingInt32/len_8192-14                  82503             14094 ns/op        2324.99 MB/s       41425 B/op          3 allocs/op
BenchmarkByteStreamSplitEncodingInt32/len_16384-14                 45006             26841 ns/op        2441.68 MB/s       74286 B/op          3 allocs/op
BenchmarkByteStreamSplitEncodingInt32/len_32768-14                 23433             51233 ns/op        2558.33 MB/s      140093 B/op          3 allocs/op
BenchmarkByteStreamSplitEncodingInt32/len_65536-14                 12019             99001 ns/op        2647.90 MB/s      271417 B/op          3 allocs/op
BenchmarkByteStreamSplitDecodingInt32/len_1024-14                 996573              1199 ns/op        3417.00 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingInt32/len_2048-14                 503200              2380 ns/op        3442.18 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingInt32/len_4096-14                 252038              4748 ns/op        3450.90 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingInt32/len_8192-14                 122419              9793 ns/op        3346.08 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingInt32/len_16384-14                 63321             19040 ns/op        3442.00 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingInt32/len_32768-14                 31051             38677 ns/op        3388.89 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingInt32/len_65536-14                 15792             77931 ns/op        3363.80 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingInt32Batched/len_1024-14                  981043              1221 ns/op        3354.53 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingInt32Batched/len_2048-14                  492319              2424 ns/op        3379.34 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingInt32Batched/len_4096-14                  248062              4850 ns/op        3378.20 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingInt32Batched/len_8192-14                  123064              9903 ns/op        3308.87 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingInt32Batched/len_16384-14                  61845             19567 ns/op        3349.29 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingInt32Batched/len_32768-14                  30568             39456 ns/op        3321.96 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingInt32Batched/len_65536-14                  15172             78762 ns/op        3328.30 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitEncodingInt64/len_1024-14                         319006              3690 ns/op        2220.13 MB/s        9880 B/op          3 allocs/op
BenchmarkByteStreamSplitEncodingInt64/len_2048-14                         161006              7132 ns/op        2297.30 MB/s       18853 B/op          3 allocs/op
BenchmarkByteStreamSplitEncodingInt64/len_4096-14                          85783             13925 ns/op        2353.12 MB/s       41421 B/op          3 allocs/op
BenchmarkByteStreamSplitEncodingInt64/len_8192-14                          45015             26943 ns/op        2432.43 MB/s       74312 B/op          3 allocs/op
BenchmarkByteStreamSplitEncodingInt64/len_16384-14                         20352             59259 ns/op        2211.84 MB/s      139940 B/op          3 allocs/op
BenchmarkByteStreamSplitEncodingInt64/len_32768-14                         10000            111143 ns/op        2358.61 MB/s      271642 B/op          3 allocs/op
BenchmarkByteStreamSplitEncodingInt64/len_65536-14                          5529            212652 ns/op        2465.47 MB/s      534805 B/op          3 allocs/op
BenchmarkByteStreamSplitDecodingInt64/len_1024-14                         528987              2355 ns/op        3478.32 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingInt64/len_2048-14                         262707              4701 ns/op        3485.08 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingInt64/len_4096-14                         129212              9313 ns/op        3518.63 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingInt64/len_8192-14                          53746             23315 ns/op        2810.90 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingInt64/len_16384-14                         28782             41054 ns/op        3192.65 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingInt64/len_32768-14                         14803             80157 ns/op        3270.39 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingInt64/len_65536-14                          7484            164111 ns/op        3194.72 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitEncodingFixedLenByteArray/len_1024-14             291716              4107 ns/op         997.43 MB/s        5276 B/op          3 allocs/op
BenchmarkByteStreamSplitEncodingFixedLenByteArray/len_2048-14             148888              7975 ns/op        1027.18 MB/s        9914 B/op          3 allocs/op
BenchmarkByteStreamSplitEncodingFixedLenByteArray/len_4096-14              76587             15677 ns/op        1045.11 MB/s       18955 B/op          3 allocs/op
BenchmarkByteStreamSplitEncodingFixedLenByteArray/len_8192-14              39758             30277 ns/op        1082.26 MB/s       41752 B/op          3 allocs/op
BenchmarkByteStreamSplitEncodingFixedLenByteArray/len_16384-14             20306             59506 ns/op        1101.33 MB/s       74937 B/op          3 allocs/op
BenchmarkByteStreamSplitEncodingFixedLenByteArray/len_32768-14             10000            116043 ns/op        1129.52 MB/s      141290 B/op          3 allocs/op
BenchmarkByteStreamSplitEncodingFixedLenByteArray/len_65536-14              4770            236887 ns/op        1106.62 MB/s      277583 B/op          3 allocs/op
BenchmarkByteStreamSplitDecodingFixedLenByteArray/len_1024-14             601875              1723 ns/op        2376.70 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingFixedLenByteArray/len_2048-14             363206              3422 ns/op        2394.18 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingFixedLenByteArray/len_4096-14             173041              6906 ns/op        2372.45 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingFixedLenByteArray/len_8192-14              81810             14307 ns/op        2290.40 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingFixedLenByteArray/len_16384-14             40518             29101 ns/op        2252.04 MB/s           1 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingFixedLenByteArray/len_32768-14             21338             56678 ns/op        2312.58 MB/s           6 B/op          1 allocs/op
BenchmarkByteStreamSplitDecodingFixedLenByteArray/len_65536-14             10000            111433 ns/op        2352.49 MB/s          26 B/op          6 allocs/op
PASS
ok      github.com/apache/arrow/go/v17/parquet/internal/encoding        69.109s

Are there any user-facing changes?

New ByteStreamSplit encoding option available. Godoc updated to reflect this.

GitHub Issue: [Go][Parquet] Expand BYTE_STREAM_SPLIT to support FIXED_LEN_BYTE_ARRAY, INT32 and INT64 #41640

github-actions · 2024-06-26T14:43:35Z

⚠️ GitHub issue #41640 has been automatically assigned in GitHub to PR creator.

mapleFU · 2024-06-26T14:45:52Z

Feel free to ping me if pr is ready

go/parquet/internal/encoding/fixed_len_byte_array_encoder.go

go/parquet/internal/encoding/byte_stream_split.go

joellubi · 2024-07-08T19:39:40Z

@mapleFU @zeroshade I pushed up some changes to the decoders which aligns them more closely to the current cpp implementation. I also added a new benchmark for batched decoding as well. All benchmarks are updated in the PR description.

Overall, the batched approach improves performance slightly across the board for decoding. This is most likely because an intermediary buffer is no longer needed with this approach, and batches can be directly decoded into the output buffer. The new benchmark demonstrates that there's not much of a difference in performance between one-batch-per-page and many-batches-per-page decoding. There may be bigger differences for extremely small batch sizes but I did my best to pick a realistic number. Of course memory usage is less with the batched approach. We write directly into the output buffer and don't have to allocate pageSize bytes per column reader for decoding all at once.

zeroshade

@joellubi Sounds great! glad to hear that it is overall better performing, the tests look good to me.

My final nitpicks! :)

go/parquet/internal/encoding/encoding_benchmarks_test.go

zeroshade · 2024-07-08T20:50:04Z

go/parquet/internal/encoding/byte_stream_split.go

+type ByteStreamSplitFloat32Decoder = ByteStreamSplitDecoder[float32]
+type ByteStreamSplitFloat64Decoder = ByteStreamSplitDecoder[float64]
+type ByteStreamSplitInt32Decoder = ByteStreamSplitDecoder[int32]
+type ByteStreamSplitInt64Decoder = ByteStreamSplitDecoder[int64]


Should we do the same approach for the encoders too?

Should probably also add godoc comments on these

Just added the godoc comments.

I did like how the generic decoders came out and looked at what it would take to do the same for encoders. It's a little tricker with the encoders because they all embed their respective "Plain" encoders. It's awkward to make this generic at the moment because the Plain encoders are not generic themselves. I think this all gets a lot simpler if/when the overall refactor of parquet to use generics is done, since then the ByteStreamSplitEncoder[T] could just embed PlainEncoder[T] once it exists.

go/parquet/internal/encoding/byte_stream_split.go

mapleFU · 2024-07-09T06:59:30Z

ByteStreamSplit Part LGTM

Overall, the batched approach improves performance slightly across the board for decoding. This is most likely because an intermediary buffer is no longer needed with this approach, and batches can be directly decoded into the output buffer. The new benchmark demonstrates that there's not much of a difference in performance between one-batch-per-page and many-batches-per-page decoding.

Nice to hear that

conbench-apache-arrow · 2024-07-10T01:52:10Z

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 89fd566.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them.

impl ByteStreamSplit enc/dec for FixedLenByteArray

e665b0a

github-actions bot added Component: Go awaiting committer review Awaiting committer review labels Jun 26, 2024

zeroshade reviewed Jun 26, 2024

View reviewed changes

go/parquet/internal/encoding/fixed_len_byte_array_encoder.go Outdated Show resolved Hide resolved

github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Jun 26, 2024

zeroshade reviewed Jun 26, 2024

View reviewed changes

go/parquet/internal/encoding/fixed_len_byte_array_encoder.go Outdated Show resolved Hide resolved

generic encoder/decoders

e5184e9

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jun 27, 2024

joellubi added 5 commits June 27, 2024 11:42

reverse conditions for spaced encoder

8329139

cleanup and benchmarks

98e21b8

use reinterpret cast

358984f

rename file

cd0a408

add parquet spec conformance test

2774f95

joellubi marked this pull request as ready for review June 28, 2024 17:54

joellubi requested a review from zeroshade June 28, 2024 17:54

joellubi added 2 commits June 28, 2024 14:25

fix benchmark bytecount

78ef86d

update doc

5a7f62f

joellubi requested a review from mapleFU June 28, 2024 18:26

add license header

727f7fd

mapleFU reviewed Jun 29, 2024

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jul 1, 2024

joellubi added 2 commits July 1, 2024 08:26

clarify comments

8ec8a26

rename offset -> stream

3a198a3

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jul 1, 2024

add batched benchmark

f188faf

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jul 8, 2024

zeroshade requested changes Jul 8, 2024

View reviewed changes

remove commented test

ed40c34

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jul 8, 2024

panic instead of unknown type

9736100

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jul 8, 2024

added docstring

39a7916

github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Jul 8, 2024

add debug asserts for buffer sizes

604fb56

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jul 9, 2024

joellubi requested review from zeroshade and mapleFU July 9, 2024 13:06

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jul 9, 2024

zeroshade approved these changes Jul 9, 2024

View reviewed changes

zeroshade merged commit 89fd566 into apache:main Jul 9, 2024
26 checks passed

zeroshade removed the awaiting changes Awaiting changes label Jul 9, 2024

zeroshade mentioned this pull request Jul 9, 2024

[Go][Parquet] Expand BYTE_STREAM_SPLIT to support FIXED_LEN_BYTE_ARRAY, INT32 and INT64 #41640

Closed

github-actions bot added the awaiting merge Awaiting merge label Jul 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-41640: [Go] Implement BYTE_STREAM_SPLIT Parquet Encoding #43066

GH-41640: [Go] Implement BYTE_STREAM_SPLIT Parquet Encoding #43066

joellubi commented Jun 26, 2024 •

edited

Loading

github-actions bot commented Jun 26, 2024

mapleFU commented Jun 26, 2024

joellubi commented Jul 8, 2024

zeroshade left a comment

zeroshade Jul 8, 2024

joellubi Jul 8, 2024

mapleFU commented Jul 9, 2024

conbench-apache-arrow bot commented Jul 10, 2024

GH-41640: [Go] Implement BYTE_STREAM_SPLIT Parquet Encoding #43066

GH-41640: [Go] Implement BYTE_STREAM_SPLIT Parquet Encoding #43066

Conversation

joellubi commented Jun 26, 2024 • edited Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

github-actions bot commented Jun 26, 2024

mapleFU commented Jun 26, 2024

joellubi commented Jul 8, 2024

zeroshade left a comment

Choose a reason for hiding this comment

zeroshade Jul 8, 2024

Choose a reason for hiding this comment

joellubi Jul 8, 2024

Choose a reason for hiding this comment

mapleFU commented Jul 9, 2024

conbench-apache-arrow bot commented Jul 10, 2024

joellubi commented Jun 26, 2024 •

edited

Loading