You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am currently working on the implementation of the dataframe interchange protocol for PyArrow. After testing the current PyArrow implementation for producing a __dataframe__ object with Pandas implementation for consuming I have noticed that columns that use bit/bytemask null representation, but do not have missing values, error.
For now we are checking for columns without missing values and in that case describe that column as non-nullable. But we think there should be an option for nullable columns with bit/bytemasks null representation to return None instead of a buffer.
The text was updated successfully, but these errors were encountered:
For now we are checking for columns without missing values and in that case describe that column as non-nullable. But we think there should be an option for nullable columns with bit/bytemasks null representation to return None instead of a buffer.
If a column ultimately doesn't have a mask when there are no missing values, I'm wondering if that's just fine? Like even it may be incorrect to describe an interchange column as having a bit/byte-mask when it doesn't have a bit/byte-mask.
For onlookers, the relevant docs for what buf, dtype = Column.get_buffers()["validity"] currently should contain
For now we are checking for columns without missing values and in that case describe that column as non-nullable. But we think there should be an option for nullable columns with bit/bytemasks null representation to return None instead of a buffer.
If a column ultimately doesn't have a mask when there are no missing values, I'm wondering if that's just fine? Like even it may be incorrect to describe an interchange column as having a bit/byte-mask when it doesn't have a bit/byte-mask.
That's certainly a possible solution, but I personally find that it feels a bit wrong. The column is nullable, in the meaning that it "can" have nulls (that's typically how "nullable" is interpreted, I think). The null count just happens to be 0, in which case arrow can optimize this by not allocating the bitmask.
Also for a datetime64 column, you probably won't change the null type from USE_SENTINEL to NON_NULLABLE if there are no nulls (NaT) present (although of course here it has no impact on the memory layout).
One corner case where this fallback to non-nullable doesn't necessarily work optimally is that a column can have multiple chunks, and in pyarrow, one chunk might have a null bitmap, and a next chunk might not have one.
I am currently working on the implementation of the dataframe interchange protocol for PyArrow. After testing the current PyArrow implementation for producing a
__dataframe__
object with Pandas implementation for consuming I have noticed that columns that use bit/bytemask null representation, but do not have missing values, error.The reason for this is that Apache Arrow does not create a mask buffer when there are no missing values present. Therefore the result of calling
.get_buffers()["validity"]
on the PyArrow__dataframe__
object without missing values isNone
which is currently not handled by the protocol specification. See:https://github.com/pandas-dev/pandas/blob/5c66e65d7b9fef47ccb585ce2fd0b3ea18dc82ea/pandas/core/interchange/from_dataframe.py#L502
For now we are checking for columns without missing values and in that case describe that column as non-nullable. But we think there should be an option for nullable columns with bit/bytemasks null representation to return
None
instead of a buffer.The text was updated successfully, but these errors were encountered: