-
-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parquet V2: AttributeError: 'NoneType' object has no attribute 'num_values' #493
Comments
How did you produce the data file? |
Sorry, I did not find any is_compressed flag in public methods.
But it returned that error:
According with apache parquet-format project The first fix is easy but I don't know how to solve that encoding problem. |
Yes, it looks like there are a few other differences, particularly that the encoding of the definition/repetition arrays are fixed. As for RLE_DICTIONARY, both are supported, but not together, so that also requires a little code in --- a/fastparquet/core.py
+++ b/fastparquet/core.py
@@ -117,9 +117,11 @@ def read_data_page(f, helper, header, metadata, skip_nulls=False,
int(daph.num_values - num_nulls),
width=width)
elif daph.encoding in [parquet_thrift.Encoding.PLAIN_DICTIONARY,
- parquet_thrift.Encoding.RLE]:
+ parquet_thrift.Encoding.RLE,
+ parquet_thrift.Encoding.RLE_DICTIONARY]:
# bit_width is stored as single byte.
- if daph.encoding == parquet_thrift.Encoding.RLE:
+ if daph.encoding in [parquet_thrift.Encoding.RLE,
+ parquet_thrift.Encoding.RLE_DICTIONARY]:
bit_width = helper.schema_element(
metadata.path_in_schema).type_length (assuming RLE_DICTIONARY means RLE hybrid). |
I see DELTA_LENGTH_BYTE_ARRAY is also "preferred", I wonder if it is finally used, in practice. Again, the code is essentially there, but would need some delving to make it work. |
Obviously this will take a little bit of work... ( cc @jpivarski - seems ideal for strings-in-awkward) |
@jpivarski could I test anything with that Partquet V2 file to you? |
@bgbraga Even reading back in this thread, I don't know what the questions is. What would you like? |
@jpivarski : I think this refers to my comment about the new style strings encoding available in parquet (actually it was always in the spec, but now has appeared in the wild), where the lengths are stored first, and then all the strings. This is a much nicer encoding from awkward's point of view (and arrow too). |
@bgbraga @martindurant Okay, I can see a possible demonstration. Awkward's string encoding is like this: >>> import numpy as np
>>> import awkward1 as ak
>>> array = ak.Array(["A", "bunch", "of", "strings", "like", "this."])
>>> array
<Array ['A', 'bunch', ... 'like', 'this.'] type='6 * string'>
>>> array.layout
<ListOffsetArray64>
<parameters>
<param key="__array__">"string"</param>
</parameters>
<offsets><Index64 i="[0 1 6 8 15 19 24]" offset="0" length="7"></offsets>
<content><NumpyArray format="B" shape="24"
data="0x 4162756e 63686f66 73747269 6e67736c 696b6574 6869732e">
<parameters>
<param key="__array__">"char"</param>
</parameters>
</NumpyArray></content>
</ListOffsetArray64>
>>> np.asarray(array.layout.offsets)
array([ 0, 1, 6, 8, 15, 19, 24], dtype=int64)
>>> np.asarray(array.layout.content)
array([ 65, 98, 117, 110, 99, 104, 111, 102, 115, 116, 114, 105, 110,
103, 115, 108, 105, 107, 101, 116, 104, 105, 115, 46], dtype=uint8)
>>> np.asarray(array.layout.content).tostring()
b'Abunchofstringslikethis.' So if you have an array of lengths >>> lengths = np.array([1, 5, 2, 7, 4, 5])
>>> prefix_sum = np.empty(len(lengths) + 1, dtype=np.int64)
>>> prefix_sum[0] = 0
>>> np.cumsum(lengths, out=prefix_sum[1:])
>>> characters = np.frombuffer(b"Abunchofstringslikethis.", dtype=np.int8)
>>> content = ak.layout.NumpyArray(characters, parameters={"__array__": "char"})
>>> offsets = ak.layout.Index64(prefix_sum)
>>> listarray = ak.layout.ListOffsetArray64(offsets, content,
... parameters={"__array__": "string"})
>>> array = ak.Array(listarray)
>>> array
<Array ['A', 'bunch', ... 'like', 'this.'] type='6 * string'>
>>> ak.to_list(array)
['A', 'bunch', 'of', 'strings', 'like', 'this.'] If you want to take advantage of Parquet's dictionary encoding, it can be done like this: >>> index = ak.layout.Index64(np.tile([0, 1, 2, 3, 4, 5], 10)) # sample dictionary
>>> indexedarray = ak.layout.IndexedArray64(index, listarray)
>>> array2 = ak.Array(indexedarray)
>>> array2
<Array ['A', 'bunch', ... 'like', 'this.'] type='60 * string'>
>>> ak.to_list(array2)
['A', 'bunch', 'of', 'strings', 'like', 'this.',
'A', 'bunch', 'of', 'strings', 'like', 'this.',
'A', 'bunch', 'of', 'strings', 'like', 'this.',
'A', 'bunch', 'of', 'strings', 'like', 'this.',
'A', 'bunch', 'of', 'strings', 'like', 'this.',
'A', 'bunch', 'of', 'strings', 'like', 'this.',
'A', 'bunch', 'of', 'strings', 'like', 'this.',
'A', 'bunch', 'of', 'strings', 'like', 'this.',
'A', 'bunch', 'of', 'strings', 'like', 'this.',
'A', 'bunch', 'of', 'strings', 'like', 'this.'] A demo could be worked up pretty quickly, though the more useful thing, a complete conversion of Parquet files into and out of Awkward Arrays, would be more work. I'm not sure if/who/when that might be attempted. |
Any progress? |
Not by me, sorry. |
Code:
Error:
Reason:
That method is defining daph as
header.data_page_header
but in this case it should beheader.data_page_header_v2
.When
daph.num_values
runs, it crash.The text was updated successfully, but these errors were encountered: