Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-41640: [Go] Implement BYTE_STREAM_SPLIT Parquet Encoding #43066

Merged
merged 25 commits into from
Jul 9, 2024

Conversation

joellubi
Copy link
Member

@joellubi joellubi commented Jun 26, 2024

Rationale for this change

This encoding is defined by the Parquet spec but does not currently have a Go implementation.

What changes are included in this PR?

Implement BYTE_STREAM_SPLIT encoder/decoder for:

  • FIXED_LEN_BYTE_ARRAY
  • FLOAT
  • DOUBLE
  • INT32
  • INT64

Are these changes tested?

Yes. See unit tests, file read conformance tests, and benchmarks.

Benchmark results on my machine

➜  go git:(impl-pq-bytestreamsplit) go test ./parquet/internal/encoding -run=^$ -bench=BenchmarkByteStreamSplit -benchmem 
goos: darwin
goarch: arm64
pkg: github.com/apache/arrow/go/v17/parquet/internal/encoding
BenchmarkByteStreamSplitEncodingInt32/len_1024-14                 502117              2005 ns/op        2043.37 MB/s        5267 B/op          3 allocs/op
BenchmarkByteStreamSplitEncodingInt32/len_2048-14                 328921              3718 ns/op        2203.54 MB/s        9879 B/op          3 allocs/op
BenchmarkByteStreamSplitEncodingInt32/len_4096-14                 169642              7083 ns/op        2313.14 MB/s       18852 B/op          3 allocs/op
BenchmarkByteStreamSplitEncodingInt32/len_8192-14                  82503             14094 ns/op        2324.99 MB/s       41425 B/op          3 allocs/op
BenchmarkByteStreamSplitEncodingInt32/len_16384-14                 45006             26841 ns/op        2441.68 MB/s       74286 B/op          3 allocs/op
BenchmarkByteStreamSplitEncodingInt32/len_32768-14                 23433             51233 ns/op        2558.33 MB/s      140093 B/op          3 allocs/op
BenchmarkByteStreamSplitEncodingInt32/len_65536-14                 12019             99001 ns/op        2647.90 MB/s      271417 B/op          3 allocs/op
BenchmarkByteStreamSplitDecodingInt32/len_1024-14                 996573              1199 ns/op        3417.00 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingInt32/len_2048-14                 503200              2380 ns/op        3442.18 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingInt32/len_4096-14                 252038              4748 ns/op        3450.90 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingInt32/len_8192-14                 122419              9793 ns/op        3346.08 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingInt32/len_16384-14                 63321             19040 ns/op        3442.00 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingInt32/len_32768-14                 31051             38677 ns/op        3388.89 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingInt32/len_65536-14                 15792             77931 ns/op        3363.80 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingInt32Batched/len_1024-14                  981043              1221 ns/op        3354.53 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingInt32Batched/len_2048-14                  492319              2424 ns/op        3379.34 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingInt32Batched/len_4096-14                  248062              4850 ns/op        3378.20 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingInt32Batched/len_8192-14                  123064              9903 ns/op        3308.87 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingInt32Batched/len_16384-14                  61845             19567 ns/op        3349.29 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingInt32Batched/len_32768-14                  30568             39456 ns/op        3321.96 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingInt32Batched/len_65536-14                  15172             78762 ns/op        3328.30 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitEncodingInt64/len_1024-14                         319006              3690 ns/op        2220.13 MB/s        9880 B/op          3 allocs/op
BenchmarkByteStreamSplitEncodingInt64/len_2048-14                         161006              7132 ns/op        2297.30 MB/s       18853 B/op          3 allocs/op
BenchmarkByteStreamSplitEncodingInt64/len_4096-14                          85783             13925 ns/op        2353.12 MB/s       41421 B/op          3 allocs/op
BenchmarkByteStreamSplitEncodingInt64/len_8192-14                          45015             26943 ns/op        2432.43 MB/s       74312 B/op          3 allocs/op
BenchmarkByteStreamSplitEncodingInt64/len_16384-14                         20352             59259 ns/op        2211.84 MB/s      139940 B/op          3 allocs/op
BenchmarkByteStreamSplitEncodingInt64/len_32768-14                         10000            111143 ns/op        2358.61 MB/s      271642 B/op          3 allocs/op
BenchmarkByteStreamSplitEncodingInt64/len_65536-14                          5529            212652 ns/op        2465.47 MB/s      534805 B/op          3 allocs/op
BenchmarkByteStreamSplitDecodingInt64/len_1024-14                         528987              2355 ns/op        3478.32 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingInt64/len_2048-14                         262707              4701 ns/op        3485.08 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingInt64/len_4096-14                         129212              9313 ns/op        3518.63 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingInt64/len_8192-14                          53746             23315 ns/op        2810.90 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingInt64/len_16384-14                         28782             41054 ns/op        3192.65 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingInt64/len_32768-14                         14803             80157 ns/op        3270.39 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingInt64/len_65536-14                          7484            164111 ns/op        3194.72 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitEncodingFixedLenByteArray/len_1024-14             291716              4107 ns/op         997.43 MB/s        5276 B/op          3 allocs/op
BenchmarkByteStreamSplitEncodingFixedLenByteArray/len_2048-14             148888              7975 ns/op        1027.18 MB/s        9914 B/op          3 allocs/op
BenchmarkByteStreamSplitEncodingFixedLenByteArray/len_4096-14              76587             15677 ns/op        1045.11 MB/s       18955 B/op          3 allocs/op
BenchmarkByteStreamSplitEncodingFixedLenByteArray/len_8192-14              39758             30277 ns/op        1082.26 MB/s       41752 B/op          3 allocs/op
BenchmarkByteStreamSplitEncodingFixedLenByteArray/len_16384-14             20306             59506 ns/op        1101.33 MB/s       74937 B/op          3 allocs/op
BenchmarkByteStreamSplitEncodingFixedLenByteArray/len_32768-14             10000            116043 ns/op        1129.52 MB/s      141290 B/op          3 allocs/op
BenchmarkByteStreamSplitEncodingFixedLenByteArray/len_65536-14              4770            236887 ns/op        1106.62 MB/s      277583 B/op          3 allocs/op
BenchmarkByteStreamSplitDecodingFixedLenByteArray/len_1024-14             601875              1723 ns/op        2376.70 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingFixedLenByteArray/len_2048-14             363206              3422 ns/op        2394.18 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingFixedLenByteArray/len_4096-14             173041              6906 ns/op        2372.45 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingFixedLenByteArray/len_8192-14              81810             14307 ns/op        2290.40 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingFixedLenByteArray/len_16384-14             40518             29101 ns/op        2252.04 MB/s           1 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingFixedLenByteArray/len_32768-14             21338             56678 ns/op        2312.58 MB/s           6 B/op          1 allocs/op
BenchmarkByteStreamSplitDecodingFixedLenByteArray/len_65536-14             10000            111433 ns/op        2352.49 MB/s          26 B/op          6 allocs/op
PASS
ok      github.com/apache/arrow/go/v17/parquet/internal/encoding        69.109s

Are there any user-facing changes?

New ByteStreamSplit encoding option available. Godoc updated to reflect this.

Copy link

⚠️ GitHub issue #41640 has been automatically assigned in GitHub to PR creator.

@mapleFU
Copy link
Member

mapleFU commented Jun 26, 2024

Feel free to ping me if pr is ready

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Jun 26, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jun 27, 2024
@joellubi joellubi marked this pull request as ready for review June 28, 2024 17:54
@joellubi joellubi requested a review from zeroshade June 28, 2024 17:54
@joellubi joellubi requested a review from mapleFU June 28, 2024 18:26
go/parquet/internal/encoding/byte_stream_split.go Outdated Show resolved Hide resolved
go/parquet/internal/encoding/byte_stream_split.go Outdated Show resolved Hide resolved
go/parquet/internal/encoding/byte_stream_split.go Outdated Show resolved Hide resolved
go/parquet/internal/encoding/byte_stream_split.go Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jul 1, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jul 1, 2024
@joellubi
Copy link
Member Author

joellubi commented Jul 8, 2024

@mapleFU @zeroshade I pushed up some changes to the decoders which aligns them more closely to the current cpp implementation. I also added a new benchmark for batched decoding as well. All benchmarks are updated in the PR description.

Overall, the batched approach improves performance slightly across the board for decoding. This is most likely because an intermediary buffer is no longer needed with this approach, and batches can be directly decoded into the output buffer. The new benchmark demonstrates that there's not much of a difference in performance between one-batch-per-page and many-batches-per-page decoding. There may be bigger differences for extremely small batch sizes but I did my best to pick a realistic number. Of course memory usage is less with the batched approach. We write directly into the output buffer and don't have to allocate pageSize bytes per column reader for decoding all at once.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jul 8, 2024
Copy link
Member

@zeroshade zeroshade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joellubi Sounds great! glad to hear that it is overall better performing, the tests look good to me.

My final nitpicks! :)

go/parquet/internal/encoding/encoding_benchmarks_test.go Outdated Show resolved Hide resolved
Comment on lines 297 to 300
type ByteStreamSplitFloat32Decoder = ByteStreamSplitDecoder[float32]
type ByteStreamSplitFloat64Decoder = ByteStreamSplitDecoder[float64]
type ByteStreamSplitInt32Decoder = ByteStreamSplitDecoder[int32]
type ByteStreamSplitInt64Decoder = ByteStreamSplitDecoder[int64]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do the same approach for the encoders too?

Should probably also add godoc comments on these

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just added the godoc comments.

I did like how the generic decoders came out and looked at what it would take to do the same for encoders. It's a little tricker with the encoders because they all embed their respective "Plain" encoders. It's awkward to make this generic at the moment because the Plain encoders are not generic themselves. I think this all gets a lot simpler if/when the overall refactor of parquet to use generics is done, since then the ByteStreamSplitEncoder[T] could just embed PlainEncoder[T] once it exists.

go/parquet/internal/encoding/byte_stream_split.go Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jul 8, 2024
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jul 8, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Jul 8, 2024
@mapleFU
Copy link
Member

mapleFU commented Jul 9, 2024

ByteStreamSplit Part LGTM

Overall, the batched approach improves performance slightly across the board for decoding. This is most likely because an intermediary buffer is no longer needed with this approach, and batches can be directly decoded into the output buffer. The new benchmark demonstrates that there's not much of a difference in performance between one-batch-per-page and many-batches-per-page decoding.

Nice to hear that

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jul 9, 2024
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jul 9, 2024
@zeroshade zeroshade merged commit 89fd566 into apache:main Jul 9, 2024
26 checks passed
@zeroshade zeroshade removed the awaiting changes Awaiting changes label Jul 9, 2024
@github-actions github-actions bot added the awaiting merge Awaiting merge label Jul 9, 2024
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 89fd566.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants