-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Go][Parquet] Expand BYTE_STREAM_SPLIT to support FIXED_LEN_BYTE_ARRAY, INT32 and INT64 #41640
Comments
zeroshade
pushed a commit
that referenced
this issue
Jul 9, 2024
### Rationale for this change This encoding is defined by the [Parquet spec](https://github.com/apache/parquet-format/blob/master/Encodings.md#byte-stream-split-byte_stream_split--9) but does not currently have a Go implementation. ### What changes are included in this PR? Implement BYTE_STREAM_SPLIT encoder/decoder for: - FIXED_LEN_BYTE_ARRAY - FLOAT - DOUBLE - INT32 - INT64 ### Are these changes tested? Yes. See unit tests, file read conformance tests, and benchmarks. **Benchmark results on my machine** ``` ➜ go git:(impl-pq-bytestreamsplit) go test ./parquet/internal/encoding -run=^$ -bench=BenchmarkByteStreamSplit -benchmem goos: darwin goarch: arm64 pkg: github.com/apache/arrow/go/v17/parquet/internal/encoding BenchmarkByteStreamSplitEncodingInt32/len_1024-14 502117 2005 ns/op 2043.37 MB/s 5267 B/op 3 allocs/op BenchmarkByteStreamSplitEncodingInt32/len_2048-14 328921 3718 ns/op 2203.54 MB/s 9879 B/op 3 allocs/op BenchmarkByteStreamSplitEncodingInt32/len_4096-14 169642 7083 ns/op 2313.14 MB/s 18852 B/op 3 allocs/op BenchmarkByteStreamSplitEncodingInt32/len_8192-14 82503 14094 ns/op 2324.99 MB/s 41425 B/op 3 allocs/op BenchmarkByteStreamSplitEncodingInt32/len_16384-14 45006 26841 ns/op 2441.68 MB/s 74286 B/op 3 allocs/op BenchmarkByteStreamSplitEncodingInt32/len_32768-14 23433 51233 ns/op 2558.33 MB/s 140093 B/op 3 allocs/op BenchmarkByteStreamSplitEncodingInt32/len_65536-14 12019 99001 ns/op 2647.90 MB/s 271417 B/op 3 allocs/op BenchmarkByteStreamSplitDecodingInt32/len_1024-14 996573 1199 ns/op 3417.00 MB/s 0 B/op 0 allocs/op BenchmarkByteStreamSplitDecodingInt32/len_2048-14 503200 2380 ns/op 3442.18 MB/s 0 B/op 0 allocs/op BenchmarkByteStreamSplitDecodingInt32/len_4096-14 252038 4748 ns/op 3450.90 MB/s 0 B/op 0 allocs/op BenchmarkByteStreamSplitDecodingInt32/len_8192-14 122419 9793 ns/op 3346.08 MB/s 0 B/op 0 allocs/op BenchmarkByteStreamSplitDecodingInt32/len_16384-14 63321 19040 ns/op 3442.00 MB/s 0 B/op 0 allocs/op BenchmarkByteStreamSplitDecodingInt32/len_32768-14 31051 38677 ns/op 3388.89 MB/s 0 B/op 0 allocs/op BenchmarkByteStreamSplitDecodingInt32/len_65536-14 15792 77931 ns/op 3363.80 MB/s 0 B/op 0 allocs/op BenchmarkByteStreamSplitDecodingInt32Batched/len_1024-14 981043 1221 ns/op 3354.53 MB/s 0 B/op 0 allocs/op BenchmarkByteStreamSplitDecodingInt32Batched/len_2048-14 492319 2424 ns/op 3379.34 MB/s 0 B/op 0 allocs/op BenchmarkByteStreamSplitDecodingInt32Batched/len_4096-14 248062 4850 ns/op 3378.20 MB/s 0 B/op 0 allocs/op BenchmarkByteStreamSplitDecodingInt32Batched/len_8192-14 123064 9903 ns/op 3308.87 MB/s 0 B/op 0 allocs/op BenchmarkByteStreamSplitDecodingInt32Batched/len_16384-14 61845 19567 ns/op 3349.29 MB/s 0 B/op 0 allocs/op BenchmarkByteStreamSplitDecodingInt32Batched/len_32768-14 30568 39456 ns/op 3321.96 MB/s 0 B/op 0 allocs/op BenchmarkByteStreamSplitDecodingInt32Batched/len_65536-14 15172 78762 ns/op 3328.30 MB/s 0 B/op 0 allocs/op BenchmarkByteStreamSplitEncodingInt64/len_1024-14 319006 3690 ns/op 2220.13 MB/s 9880 B/op 3 allocs/op BenchmarkByteStreamSplitEncodingInt64/len_2048-14 161006 7132 ns/op 2297.30 MB/s 18853 B/op 3 allocs/op BenchmarkByteStreamSplitEncodingInt64/len_4096-14 85783 13925 ns/op 2353.12 MB/s 41421 B/op 3 allocs/op BenchmarkByteStreamSplitEncodingInt64/len_8192-14 45015 26943 ns/op 2432.43 MB/s 74312 B/op 3 allocs/op BenchmarkByteStreamSplitEncodingInt64/len_16384-14 20352 59259 ns/op 2211.84 MB/s 139940 B/op 3 allocs/op BenchmarkByteStreamSplitEncodingInt64/len_32768-14 10000 111143 ns/op 2358.61 MB/s 271642 B/op 3 allocs/op BenchmarkByteStreamSplitEncodingInt64/len_65536-14 5529 212652 ns/op 2465.47 MB/s 534805 B/op 3 allocs/op BenchmarkByteStreamSplitDecodingInt64/len_1024-14 528987 2355 ns/op 3478.32 MB/s 0 B/op 0 allocs/op BenchmarkByteStreamSplitDecodingInt64/len_2048-14 262707 4701 ns/op 3485.08 MB/s 0 B/op 0 allocs/op BenchmarkByteStreamSplitDecodingInt64/len_4096-14 129212 9313 ns/op 3518.63 MB/s 0 B/op 0 allocs/op BenchmarkByteStreamSplitDecodingInt64/len_8192-14 53746 23315 ns/op 2810.90 MB/s 0 B/op 0 allocs/op BenchmarkByteStreamSplitDecodingInt64/len_16384-14 28782 41054 ns/op 3192.65 MB/s 0 B/op 0 allocs/op BenchmarkByteStreamSplitDecodingInt64/len_32768-14 14803 80157 ns/op 3270.39 MB/s 0 B/op 0 allocs/op BenchmarkByteStreamSplitDecodingInt64/len_65536-14 7484 164111 ns/op 3194.72 MB/s 0 B/op 0 allocs/op BenchmarkByteStreamSplitEncodingFixedLenByteArray/len_1024-14 291716 4107 ns/op 997.43 MB/s 5276 B/op 3 allocs/op BenchmarkByteStreamSplitEncodingFixedLenByteArray/len_2048-14 148888 7975 ns/op 1027.18 MB/s 9914 B/op 3 allocs/op BenchmarkByteStreamSplitEncodingFixedLenByteArray/len_4096-14 76587 15677 ns/op 1045.11 MB/s 18955 B/op 3 allocs/op BenchmarkByteStreamSplitEncodingFixedLenByteArray/len_8192-14 39758 30277 ns/op 1082.26 MB/s 41752 B/op 3 allocs/op BenchmarkByteStreamSplitEncodingFixedLenByteArray/len_16384-14 20306 59506 ns/op 1101.33 MB/s 74937 B/op 3 allocs/op BenchmarkByteStreamSplitEncodingFixedLenByteArray/len_32768-14 10000 116043 ns/op 1129.52 MB/s 141290 B/op 3 allocs/op BenchmarkByteStreamSplitEncodingFixedLenByteArray/len_65536-14 4770 236887 ns/op 1106.62 MB/s 277583 B/op 3 allocs/op BenchmarkByteStreamSplitDecodingFixedLenByteArray/len_1024-14 601875 1723 ns/op 2376.70 MB/s 0 B/op 0 allocs/op BenchmarkByteStreamSplitDecodingFixedLenByteArray/len_2048-14 363206 3422 ns/op 2394.18 MB/s 0 B/op 0 allocs/op BenchmarkByteStreamSplitDecodingFixedLenByteArray/len_4096-14 173041 6906 ns/op 2372.45 MB/s 0 B/op 0 allocs/op BenchmarkByteStreamSplitDecodingFixedLenByteArray/len_8192-14 81810 14307 ns/op 2290.40 MB/s 0 B/op 0 allocs/op BenchmarkByteStreamSplitDecodingFixedLenByteArray/len_16384-14 40518 29101 ns/op 2252.04 MB/s 1 B/op 0 allocs/op BenchmarkByteStreamSplitDecodingFixedLenByteArray/len_32768-14 21338 56678 ns/op 2312.58 MB/s 6 B/op 1 allocs/op BenchmarkByteStreamSplitDecodingFixedLenByteArray/len_65536-14 10000 111433 ns/op 2352.49 MB/s 26 B/op 6 allocs/op PASS ok github.com/apache/arrow/go/v17/parquet/internal/encoding 69.109s ``` ### Are there any user-facing changes? New ByteStreamSplit encoding option available. Godoc updated to reflect this. * GitHub Issue: #41640 Authored-by: Joel Lubinitsky <[email protected]> Signed-off-by: Matt Topol <[email protected]>
Issue resolved by pull request 43066 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the enhancement requested
Implement the format addition described in https://issues.apache.org/jira/browse/PARQUET-2414. An equivalent C++ implementation has already been done in #40094.
Component(s)
Go, Parquet
The text was updated successfully, but these errors were encountered: