Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet V2: AttributeError: 'NoneType' object has no attribute 'num_values' #493

Open
bgbraga opened this issue Mar 29, 2020 · 12 comments
Open

Comments

@bgbraga
Copy link

bgbraga commented Mar 29, 2020

Code:

    from fastparquet import ParquetFile

    pf = ParquetFile('/path/file.parquet')
    df = pf.to_pandas()

Error:

  File "/home/.../venv/lib64/python3.7/site-packages/fastparquet/core.py", line 112, in read_data_page
    nval = daph.num_values - num_nulls
AttributeError: 'NoneType' object has no attribute 'num_values'

Reason:
That method is defining daph as header.data_page_header but in this case it should be header.data_page_header_v2.

def read_data_page(f, helper, header, metadata, skip_nulls=False,
                   selfmade=False):
    """Read a data page: definitions, repetitions, values (in order)

    Only values are guaranteed to exist, e.g., for a top-level, required
    field.
    """
    daph = header.data_page_header

When daph.num_values runs, it crash.

@martindurant
Copy link
Member

How did you produce the data file?
You are very welcome to submit a PR to look for the V2 header first and then fall back to the current behaviour. It seems that in the v2 case, you need to heed the is_compressed flag for the definition/repetition levels.

@bgbraga
Copy link
Author

bgbraga commented Mar 31, 2020

How did you produce the data file?
It is generated by AWS CUR (Cost and Usage Report). It is a way that customer could export all AWS cost data to import in a database and create custom reports.

Sorry, I did not find any is_compressed flag in public methods.
But I tried a simple fix to test. It passed that point and also the daph.num_values ...

    daph = header.data_page_header
    if daph is None:
        daph = header.data_page_header_v2

But it returned that error:

  File "/home/.../venv/lib64/python3.7/site-packages/fastparquet/core.py", line 142, in read_data_page
    raise NotImplementedError('Encoding %s' % daph.encoding)
NotImplementedError: Encoding 8

According with apache parquet-format project
https://github.com/apache/parquet-format/blob/master/Encodings.md
Seems Encoding 8 is RLE_DICTIONARY
"Prefer using RLE_DICTIONARY in a data page and PLAIN in a dictionary page for Parquet 2.0+ files."

The first fix is easy but I don't know how to solve that encoding problem.

@martindurant
Copy link
Member

Yes, it looks like there are a few other differences, particularly that the encoding of the definition/repetition arrays are fixed.
https://github.com/dask/fastparquet/blob/master/fastparquet/parquet.thrift#L427

As for RLE_DICTIONARY, both are supported, but not together, so that also requires a little code in read_data_page. It looks like the data part is actually identical to RLE as far as the data page is concerned. Perhaps the following would work

--- a/fastparquet/core.py
+++ b/fastparquet/core.py
@@ -117,9 +117,11 @@ def read_data_page(f, helper, header, metadata, skip_nulls=False,
                                      int(daph.num_values - num_nulls),
                                      width=width)
     elif daph.encoding in [parquet_thrift.Encoding.PLAIN_DICTIONARY,
-                           parquet_thrift.Encoding.RLE]:
+                           parquet_thrift.Encoding.RLE,
+                           parquet_thrift.Encoding.RLE_DICTIONARY]:
         # bit_width is stored as single byte.
-        if daph.encoding == parquet_thrift.Encoding.RLE:
+        if daph.encoding in [parquet_thrift.Encoding.RLE,
+                             parquet_thrift.Encoding.RLE_DICTIONARY]:
             bit_width = helper.schema_element(
                     metadata.path_in_schema).type_length

(assuming RLE_DICTIONARY means RLE hybrid).

@martindurant
Copy link
Member

I see DELTA_LENGTH_BYTE_ARRAY is also "preferred", I wonder if it is finally used, in practice. Again, the code is essentially there, but would need some delving to make it work.

@bgbraga
Copy link
Author

bgbraga commented Apr 1, 2020

Ok. I changed that and also more one point that is not referencing data_page_header_v2

If I run as is, it break at a pyx that I can't debug:

  File "/home/.../venv/lib64/python3.7/site-packages/fastparquet/converted_types.py", line 105, in convert
    return array_decode_utf8(data)
  File "fastparquet/speedups.pyx", line 84, in fastparquet.speedups.array_decode_utf8
TypeError: expected array of bytes

Expected array of bytes is a confuse message as variable is an array of bytes:
data

BUT I see at def read_col another fixed Encoding.PLAIN_DICTIONARY:

        d = daph.encoding == parquet_thrift.Encoding.PLAIN_DICTIONARY
        if use_cat and not d:

I decided to change that encoding evaluation from Plain to Plan or RLE or RLE_DIC:

        d = daph.encoding in [parquet_thrift.Encoding.PLAIN_DICTIONARY,
                              parquet_thrift.Encoding.RLE,
                              parquet_thrift.Encoding.RLE_DICTIONARY]
        if use_cat and not d:

Results: a new error refering to Encoding 7:

  File "/home/.../venv/lib64/python3.7/site-packages/fastparquet/core.py", line 144, in read_data_page
    raise NotImplementedError('Encoding %s' % daph.encoding)

NotImplementedError: Encoding 7

More one encoding...
So I decide to do the same logic and include DELTA_BYTE_ARRAY at header encoding evaluation, like that:

daph.encoding in [parquet_thrift.Encoding.PLAIN_DICTIONARY,
                              parquet_thrift.Encoding.RLE,
                              parquet_thrift.Encoding.RLE_DICTIONARY,
                              parquet_thrift.Encoding.DELTA_BYTE_ARRAY]

Results:

  File "/home/.../venv/lib64/python3.7/site-packages/fastparquet/core.py", line 140, in read_data_page
    io_obj, bit_width, io_obj.len-io_obj.loc, o=values)
ValueError: cannot assign slice from input of different size

@martindurant
Copy link
Member

Obviously this will take a little bit of work...
Glad to see encoding 7 appear, if it really is that, would allow a lot of speedup in theory.

( cc @jpivarski - seems ideal for strings-in-awkward)

@bgbraga
Copy link
Author

bgbraga commented Apr 15, 2020

@jpivarski could I test anything with that Partquet V2 file to you?

@jpivarski
Copy link

@bgbraga Even reading back in this thread, I don't know what the questions is. What would you like?

@martindurant
Copy link
Member

@jpivarski : I think this refers to my comment about the new style strings encoding available in parquet (actually it was always in the spec, but now has appeared in the wild), where the lengths are stored first, and then all the strings. This is a much nicer encoding from awkward's point of view (and arrow too).
So, at the point when you are caring about parquet loading and benchmarks, you would probably prefer to deal with string columns using this encoding .

@jpivarski
Copy link

@bgbraga @martindurant Okay, I can see a possible demonstration. Awkward's string encoding is like this:

>>> import numpy as np
>>> import awkward1 as ak
>>> array = ak.Array(["A", "bunch", "of", "strings", "like", "this."])
>>> array
<Array ['A', 'bunch', ... 'like', 'this.'] type='6 * string'>
>>> array.layout
<ListOffsetArray64>
    <parameters>
        <param key="__array__">"string"</param>
    </parameters>
    <offsets><Index64 i="[0 1 6 8 15 19 24]" offset="0" length="7"></offsets>
    <content><NumpyArray format="B" shape="24"
              data="0x 4162756e 63686f66 73747269 6e67736c 696b6574 6869732e">
        <parameters>
            <param key="__array__">"char"</param>
        </parameters>
    </NumpyArray></content>
</ListOffsetArray64>
>>> np.asarray(array.layout.offsets)
array([ 0,  1,  6,  8, 15, 19, 24], dtype=int64)
>>> np.asarray(array.layout.content)
array([ 65,  98, 117, 110,  99, 104, 111, 102, 115, 116, 114, 105, 110,
       103, 115, 108, 105, 107, 101, 116, 104, 105, 115,  46], dtype=uint8)
>>> np.asarray(array.layout.content).tostring()
b'Abunchofstringslikethis.'

So if you have an array of lengths lengths and an array of contiguous characters characters, you could build it in the other direction:

>>> lengths = np.array([1, 5, 2, 7, 4, 5])
>>> prefix_sum = np.empty(len(lengths) + 1, dtype=np.int64)
>>> prefix_sum[0] = 0
>>> np.cumsum(lengths, out=prefix_sum[1:])
>>> characters = np.frombuffer(b"Abunchofstringslikethis.", dtype=np.int8)
>>> content = ak.layout.NumpyArray(characters, parameters={"__array__": "char"})
>>> offsets = ak.layout.Index64(prefix_sum)
>>> listarray = ak.layout.ListOffsetArray64(offsets, content,
...                                         parameters={"__array__": "string"})
>>> array = ak.Array(listarray)
>>> array
<Array ['A', 'bunch', ... 'like', 'this.'] type='6 * string'>
>>> ak.to_list(array)
['A', 'bunch', 'of', 'strings', 'like', 'this.']

If you want to take advantage of Parquet's dictionary encoding, it can be done like this:

>>> index = ak.layout.Index64(np.tile([0, 1, 2, 3, 4, 5], 10))   # sample dictionary
>>> indexedarray = ak.layout.IndexedArray64(index, listarray)
>>> array2 = ak.Array(indexedarray)
>>> array2
<Array ['A', 'bunch', ... 'like', 'this.'] type='60 * string'>
>>> ak.to_list(array2)
['A', 'bunch', 'of', 'strings', 'like', 'this.',
 'A', 'bunch', 'of', 'strings', 'like', 'this.',
 'A', 'bunch', 'of', 'strings', 'like', 'this.',
 'A', 'bunch', 'of', 'strings', 'like', 'this.',
 'A', 'bunch', 'of', 'strings', 'like', 'this.',
 'A', 'bunch', 'of', 'strings', 'like', 'this.',
 'A', 'bunch', 'of', 'strings', 'like', 'this.',
 'A', 'bunch', 'of', 'strings', 'like', 'this.',
 'A', 'bunch', 'of', 'strings', 'like', 'this.',
 'A', 'bunch', 'of', 'strings', 'like', 'this.']

A demo could be worked up pretty quickly, though the more useful thing, a complete conversion of Parquet files into and out of Awkward Arrays, would be more work. I'm not sure if/who/when that might be attempted.

@deephbz
Copy link

deephbz commented Jul 10, 2020

Any progress?

@martindurant
Copy link
Member

Not by me, sorry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants