-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-41640: [Go] Implement BYTE_STREAM_SPLIT Parquet Encoding #43066
Conversation
|
Feel free to ping me if pr is ready |
@mapleFU @zeroshade I pushed up some changes to the decoders which aligns them more closely to the current cpp implementation. I also added a new benchmark for batched decoding as well. All benchmarks are updated in the PR description. Overall, the batched approach improves performance slightly across the board for decoding. This is most likely because an intermediary buffer is no longer needed with this approach, and batches can be directly decoded into the output buffer. The new benchmark demonstrates that there's not much of a difference in performance between one-batch-per-page and many-batches-per-page decoding. There may be bigger differences for extremely small batch sizes but I did my best to pick a realistic number. Of course memory usage is less with the batched approach. We write directly into the output buffer and don't have to allocate pageSize bytes per column reader for decoding all at once. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joellubi Sounds great! glad to hear that it is overall better performing, the tests look good to me.
My final nitpicks! :)
type ByteStreamSplitFloat32Decoder = ByteStreamSplitDecoder[float32] | ||
type ByteStreamSplitFloat64Decoder = ByteStreamSplitDecoder[float64] | ||
type ByteStreamSplitInt32Decoder = ByteStreamSplitDecoder[int32] | ||
type ByteStreamSplitInt64Decoder = ByteStreamSplitDecoder[int64] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we do the same approach for the encoders too?
Should probably also add godoc comments on these
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just added the godoc comments.
I did like how the generic decoders came out and looked at what it would take to do the same for encoders. It's a little tricker with the encoders because they all embed their respective "Plain" encoders. It's awkward to make this generic at the moment because the Plain encoders are not generic themselves. I think this all gets a lot simpler if/when the overall refactor of parquet to use generics is done, since then the ByteStreamSplitEncoder[T]
could just embed PlainEncoder[T]
once it exists.
ByteStreamSplit Part LGTM
Nice to hear that |
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 89fd566. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them. |
Rationale for this change
This encoding is defined by the Parquet spec but does not currently have a Go implementation.
What changes are included in this PR?
Implement BYTE_STREAM_SPLIT encoder/decoder for:
Are these changes tested?
Yes. See unit tests, file read conformance tests, and benchmarks.
Benchmark results on my machine
Are there any user-facing changes?
New ByteStreamSplit encoding option available. Godoc updated to reflect this.