Parquet V2: AttributeError: 'NoneType' object has no attribute 'num_values' #493

bgbraga · 2020-03-29T22:44:46Z

Code:

    from fastparquet import ParquetFile

    pf = ParquetFile('/path/file.parquet')
    df = pf.to_pandas()

Error:

  File "/home/.../venv/lib64/python3.7/site-packages/fastparquet/core.py", line 112, in read_data_page
    nval = daph.num_values - num_nulls
AttributeError: 'NoneType' object has no attribute 'num_values'

Reason:
That method is defining daph as header.data_page_header but in this case it should be header.data_page_header_v2.

def read_data_page(f, helper, header, metadata, skip_nulls=False,
                   selfmade=False):
    """Read a data page: definitions, repetitions, values (in order)

    Only values are guaranteed to exist, e.g., for a top-level, required
    field.
    """
    daph = header.data_page_header

When daph.num_values runs, it crash.

The text was updated successfully, but these errors were encountered:

martindurant · 2020-03-30T13:10:02Z

How did you produce the data file?
You are very welcome to submit a PR to look for the V2 header first and then fall back to the current behaviour. It seems that in the v2 case, you need to heed the is_compressed flag for the definition/repetition levels.

bgbraga · 2020-03-31T02:11:12Z

How did you produce the data file?
It is generated by AWS CUR (Cost and Usage Report). It is a way that customer could export all AWS cost data to import in a database and create custom reports.

Sorry, I did not find any is_compressed flag in public methods.
But I tried a simple fix to test. It passed that point and also the daph.num_values ...

    daph = header.data_page_header
    if daph is None:
        daph = header.data_page_header_v2

But it returned that error:

  File "/home/.../venv/lib64/python3.7/site-packages/fastparquet/core.py", line 142, in read_data_page
    raise NotImplementedError('Encoding %s' % daph.encoding)
NotImplementedError: Encoding 8

According with apache parquet-format project
https://github.com/apache/parquet-format/blob/master/Encodings.md
Seems Encoding 8 is RLE_DICTIONARY
"Prefer using RLE_DICTIONARY in a data page and PLAIN in a dictionary page for Parquet 2.0+ files."

The first fix is easy but I don't know how to solve that encoding problem.

martindurant · 2020-03-31T13:45:48Z

Yes, it looks like there are a few other differences, particularly that the encoding of the definition/repetition arrays are fixed.
https://github.com/dask/fastparquet/blob/master/fastparquet/parquet.thrift#L427

As for RLE_DICTIONARY, both are supported, but not together, so that also requires a little code in read_data_page. It looks like the data part is actually identical to RLE as far as the data page is concerned. Perhaps the following would work

--- a/fastparquet/core.py
+++ b/fastparquet/core.py
@@ -117,9 +117,11 @@ def read_data_page(f, helper, header, metadata, skip_nulls=False,
                                      int(daph.num_values - num_nulls),
                                      width=width)
     elif daph.encoding in [parquet_thrift.Encoding.PLAIN_DICTIONARY,
-                           parquet_thrift.Encoding.RLE]:
+                           parquet_thrift.Encoding.RLE,
+                           parquet_thrift.Encoding.RLE_DICTIONARY]:
         # bit_width is stored as single byte.
-        if daph.encoding == parquet_thrift.Encoding.RLE:
+        if daph.encoding in [parquet_thrift.Encoding.RLE,
+                             parquet_thrift.Encoding.RLE_DICTIONARY]:
             bit_width = helper.schema_element(
                     metadata.path_in_schema).type_length

(assuming RLE_DICTIONARY means RLE hybrid).

martindurant · 2020-03-31T13:52:03Z

I see DELTA_LENGTH_BYTE_ARRAY is also "preferred", I wonder if it is finally used, in practice. Again, the code is essentially there, but would need some delving to make it work.

bgbraga · 2020-04-01T03:21:26Z

Ok. I changed that and also more one point that is not referencing data_page_header_v2

If I run as is, it break at a pyx that I can't debug:

  File "/home/.../venv/lib64/python3.7/site-packages/fastparquet/converted_types.py", line 105, in convert
    return array_decode_utf8(data)
  File "fastparquet/speedups.pyx", line 84, in fastparquet.speedups.array_decode_utf8
TypeError: expected array of bytes

Expected array of bytes is a confuse message as variable is an array of bytes:

BUT I see at def read_col another fixed Encoding.PLAIN_DICTIONARY:

        d = daph.encoding == parquet_thrift.Encoding.PLAIN_DICTIONARY
        if use_cat and not d:

I decided to change that encoding evaluation from Plain to Plan or RLE or RLE_DIC:

        d = daph.encoding in [parquet_thrift.Encoding.PLAIN_DICTIONARY,
                              parquet_thrift.Encoding.RLE,
                              parquet_thrift.Encoding.RLE_DICTIONARY]
        if use_cat and not d:

Results: a new error refering to Encoding 7:

  File "/home/.../venv/lib64/python3.7/site-packages/fastparquet/core.py", line 144, in read_data_page
    raise NotImplementedError('Encoding %s' % daph.encoding)

NotImplementedError: Encoding 7

More one encoding...
So I decide to do the same logic and include DELTA_BYTE_ARRAY at header encoding evaluation, like that:

daph.encoding in [parquet_thrift.Encoding.PLAIN_DICTIONARY,
                              parquet_thrift.Encoding.RLE,
                              parquet_thrift.Encoding.RLE_DICTIONARY,
                              parquet_thrift.Encoding.DELTA_BYTE_ARRAY]

Results:

  File "/home/.../venv/lib64/python3.7/site-packages/fastparquet/core.py", line 140, in read_data_page
    io_obj, bit_width, io_obj.len-io_obj.loc, o=values)
ValueError: cannot assign slice from input of different size

martindurant · 2020-04-01T12:32:00Z

Obviously this will take a little bit of work...
Glad to see encoding 7 appear, if it really is that, would allow a lot of speedup in theory.

( cc @jpivarski - seems ideal for strings-in-awkward)

bgbraga · 2020-04-15T11:39:56Z

@jpivarski could I test anything with that Partquet V2 file to you?

jpivarski · 2020-04-15T13:13:17Z

@bgbraga Even reading back in this thread, I don't know what the questions is. What would you like?

martindurant · 2020-04-15T13:16:57Z

@jpivarski : I think this refers to my comment about the new style strings encoding available in parquet (actually it was always in the spec, but now has appeared in the wild), where the lengths are stored first, and then all the strings. This is a much nicer encoding from awkward's point of view (and arrow too).
So, at the point when you are caring about parquet loading and benchmarks, you would probably prefer to deal with string columns using this encoding .

jpivarski · 2020-04-15T13:33:34Z

@bgbraga @martindurant Okay, I can see a possible demonstration. Awkward's string encoding is like this:

>>> import numpy as np
>>> import awkward1 as ak
>>> array = ak.Array(["A", "bunch", "of", "strings", "like", "this."])
>>> array
<Array ['A', 'bunch', ... 'like', 'this.'] type='6 * string'>
>>> array.layout
<ListOffsetArray64>
    <parameters>
        <param key="__array__">"string"</param>
    </parameters>
    <offsets><Index64 i="[0 1 6 8 15 19 24]" offset="0" length="7"></offsets>
    <content><NumpyArray format="B" shape="24"
              data="0x 4162756e 63686f66 73747269 6e67736c 696b6574 6869732e">
        <parameters>
            <param key="__array__">"char"</param>
        </parameters>
    </NumpyArray></content>
</ListOffsetArray64>
>>> np.asarray(array.layout.offsets)
array([ 0,  1,  6,  8, 15, 19, 24], dtype=int64)
>>> np.asarray(array.layout.content)
array([ 65,  98, 117, 110,  99, 104, 111, 102, 115, 116, 114, 105, 110,
       103, 115, 108, 105, 107, 101, 116, 104, 105, 115,  46], dtype=uint8)
>>> np.asarray(array.layout.content).tostring()
b'Abunchofstringslikethis.'

So if you have an array of lengths lengths and an array of contiguous characters characters, you could build it in the other direction:

>>> lengths = np.array([1, 5, 2, 7, 4, 5])
>>> prefix_sum = np.empty(len(lengths) + 1, dtype=np.int64)
>>> prefix_sum[0] = 0
>>> np.cumsum(lengths, out=prefix_sum[1:])
>>> characters = np.frombuffer(b"Abunchofstringslikethis.", dtype=np.int8)
>>> content = ak.layout.NumpyArray(characters, parameters={"__array__": "char"})
>>> offsets = ak.layout.Index64(prefix_sum)
>>> listarray = ak.layout.ListOffsetArray64(offsets, content,
...                                         parameters={"__array__": "string"})
>>> array = ak.Array(listarray)
>>> array
<Array ['A', 'bunch', ... 'like', 'this.'] type='6 * string'>
>>> ak.to_list(array)
['A', 'bunch', 'of', 'strings', 'like', 'this.']

If you want to take advantage of Parquet's dictionary encoding, it can be done like this:

>>> index = ak.layout.Index64(np.tile([0, 1, 2, 3, 4, 5], 10))   # sample dictionary
>>> indexedarray = ak.layout.IndexedArray64(index, listarray)
>>> array2 = ak.Array(indexedarray)
>>> array2
<Array ['A', 'bunch', ... 'like', 'this.'] type='60 * string'>
>>> ak.to_list(array2)
['A', 'bunch', 'of', 'strings', 'like', 'this.',
 'A', 'bunch', 'of', 'strings', 'like', 'this.',
 'A', 'bunch', 'of', 'strings', 'like', 'this.',
 'A', 'bunch', 'of', 'strings', 'like', 'this.',
 'A', 'bunch', 'of', 'strings', 'like', 'this.',
 'A', 'bunch', 'of', 'strings', 'like', 'this.',
 'A', 'bunch', 'of', 'strings', 'like', 'this.',
 'A', 'bunch', 'of', 'strings', 'like', 'this.',
 'A', 'bunch', 'of', 'strings', 'like', 'this.',
 'A', 'bunch', 'of', 'strings', 'like', 'this.']

A demo could be worked up pretty quickly, though the more useful thing, a complete conversion of Parquet files into and out of Awkward Arrays, would be more work. I'm not sure if/who/when that might be attempted.

deephbz · 2020-07-10T05:28:32Z

Any progress?

martindurant · 2020-07-10T12:55:51Z

Not by me, sorry.

martindurant mentioned this issue Apr 9, 2020

row_groups filters does not use min_value/max_value statistics #491

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet V2: AttributeError: 'NoneType' object has no attribute 'num_values' #493

Parquet V2: AttributeError: 'NoneType' object has no attribute 'num_values' #493

bgbraga commented Mar 29, 2020 •

edited

Loading

martindurant commented Mar 30, 2020

bgbraga commented Mar 31, 2020

martindurant commented Mar 31, 2020

martindurant commented Mar 31, 2020

bgbraga commented Apr 1, 2020

martindurant commented Apr 1, 2020

bgbraga commented Apr 15, 2020

jpivarski commented Apr 15, 2020

martindurant commented Apr 15, 2020

jpivarski commented Apr 15, 2020

deephbz commented Jul 10, 2020

martindurant commented Jul 10, 2020

Parquet V2: AttributeError: 'NoneType' object has no attribute 'num_values' #493

Parquet V2: AttributeError: 'NoneType' object has no attribute 'num_values' #493

Comments

bgbraga commented Mar 29, 2020 • edited Loading

martindurant commented Mar 30, 2020

bgbraga commented Mar 31, 2020

martindurant commented Mar 31, 2020

martindurant commented Mar 31, 2020

bgbraga commented Apr 1, 2020

martindurant commented Apr 1, 2020

bgbraga commented Apr 15, 2020

jpivarski commented Apr 15, 2020

martindurant commented Apr 15, 2020

jpivarski commented Apr 15, 2020

deephbz commented Jul 10, 2020

martindurant commented Jul 10, 2020

bgbraga commented Mar 29, 2020 •

edited

Loading