Skip to content
This repository has been archived by the owner on Nov 1, 2024. It is now read-only.

Modifying nested column will have no effect #207

Open
wenleix opened this issue Feb 18, 2022 · 0 comments
Open

Modifying nested column will have no effect #207

wenleix opened this issue Feb 18, 2022 · 0 comments
Labels
bug Something isn't working

Comments

@wenleix
Copy link
Contributor

wenleix commented Feb 18, 2022

To reproduce:

import torcharrow as ta
import torcharrow.dtypes as dt
dtype = dt.Struct(
    [
        dt.Field("labels", dt.int8),
        dt.Field("dense_features", dt.Struct([dt.Field("int_1", dt.int32), dt.Field("int_2", dt.int32)])),
    ]
)
df = ta.DataFrame(
    [
        (1, (0, 1)),
        (0, (10, 11)),
        (1, (20, 21)),
    ],
    dtype=dtype)

Now df looksl like:

>>> df
  index    labels  dense_features
-------  --------  ----------------
      0         1  (0, 1)
      1         0  (10, 11)
      2         1  (20, 21)
dtype: Struct([Field('labels', int8), Field('dense_features', Struct([Field('int_1', int32), Field('int_2', int32)]))]), count: 3, null_count: 0

Try to change df["dense_features"]["int_1"] (and failed):

>>> df["dense_features"]["int_1"] = df["dense_features"]["int_1"] + 1
>>> df
  index    labels  dense_features
-------  --------  ----------------
      0         1  (0, 1)
      1         0  (10, 11)
      2         1  (20, 21)
dtype: Struct([Field('labels', int8), Field('dense_features', Struct([Field('int_1', int32), Field('int_2', int32)]))]), count: 3, null_count: 0

For now, the work around is to first get the nested DF out, apply the transformation, and then put it back:

https://github.com/facebookresearch/torcharrow/blob/6d2bca82e65f74193360bd06c5ab4f8c761c5342/torcharrow/test/integration/test_criteo.py#L149-L157

The problem is DataFrameCpu._set_field_data generates a new RowVector and copy the column vector pointer -- for a nested RowVector, it only updates the leaf level struct but doesn't propagate upwards: https://github.com/facebookresearch/torcharrow/blob/6d2bca82e65f74193360bd06c5ab4f8c761c5342/torcharrow/velox_rt/dataframe_cpu.py#L310-L329

Creating a new RowVector seems necessary, since assigning column to DataFrame may change the children column type. One idea would be allowing the wrapped RowColumn to change the delegated RowVector (e.g. something like self._data._reset_data(new_delegate)) . -- Basically DataFrame is a thin wrapper and everything is in RowColumn.

For this to work, DataFrame.dtype should always use the underlying Velox Vector's type as groundtruth.

@wenleix wenleix added the bug Something isn't working label Feb 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant