Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Writes to DataFrame.attrs are not preserved #7401

Open
3 tasks done
noloerino opened this issue Sep 24, 2024 · 1 comment · May be fixed by #7402
Open
3 tasks done

BUG: Writes to DataFrame.attrs are not preserved #7401

noloerino opened this issue Sep 24, 2024 · 1 comment · May be fixed by #7402
Labels
bug 🦗 Something isn't working P2 Minor bugs or low-priority feature requests pandas concordance 🐼 Functionality that does not match pandas

Comments

@noloerino
Copy link
Collaborator

Modin version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest released version of Modin.

  • I have confirmed this bug exists on the main branch of Modin. (In order to do this you can follow this guide.)

Reproducible Example

import modin.pandas as pd
df.attrs["x"] = 1
df.attrs  # attrs dict is still empty

Issue Description

DataFrame.attrs lets users specify metadata on frames that are deep-copied to new dataframes when operations are performed. In Modin, attrs defaults to pandas, but this means that any writes to it are not reflected in the original frame, much less any other operations.

When a write to attrs is attempted, it only modifies the attrs field of the native pandas.DataFrame that's produced within DataFrame._default_to_pandas, and the modin.pandas.DataFrame has no knowledge of this operation.

Expected Behavior

Writes to attrs are reflected in subsequent read operations, and propagated across operations.

Error Logs

Replace this line with the error backtrace (if applicable).

Installed Versions

INSTALLED VERSIONS

commit : 1c4d173
python : 3.10.13.final.0
python-bits : 64
OS : Darwin
OS-release : 23.6.0
Version : Darwin Kernel Version 23.6.0: Mon Jul 29 21:13:04 PDT 2024; root:xnu-10063.141.2~1/RELEASE_ARM64_T6020
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

Modin dependencies

modin : 0.32.0+6.g1c4d173d
ray : 2.34.0
dask : 2024.8.1
distributed : 2024.8.1

pandas dependencies

pandas : 2.2.2
numpy : 1.26.4
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 68.0.0
pip : 23.3
Cython : None
pytest : 8.3.2
hypothesis : None
sphinx : 5.3.0
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 5.3.0
html5lib : None
pymysql : None
psycopg2 : 2.9.9
jinja2 : 3.1.4
IPython : 8.17.2
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.2
bottleneck : None
dataframe-api-compat : None
fastparquet : 2024.5.0
fsspec : 2024.6.1
gcsfs : None
matplotlib : 3.9.2
numba : None
numexpr : 2.10.1
odfpy : None
openpyxl : 3.1.5
pandas_gbq : 0.23.1
pyarrow : 17.0.0
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : 2024.6.1
scipy : 1.14.1
sqlalchemy : 2.0.32
tables : 3.10.1
tabulate : None
xarray : 2024.7.0
xlrd : 2.0.1
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

@noloerino noloerino added bug 🦗 Something isn't working Triage 🩹 Issues that need triage pandas concordance 🐼 Functionality that does not match pandas P2 Minor bugs or low-priority feature requests and removed Triage 🩹 Issues that need triage labels Sep 24, 2024
@noloerino
Copy link
Collaborator Author

See pandas discussion: pandas-dev/pandas#52166

Though attrs is not fully mature, it seems to be used pretty frequently in downstream libraries to track metadata for use cases like plot generation, and the feature seems to be here to stay.

pandas supports propagation of attrs through __finalize__, which Modin vacuously defaults to pandas. I think the least intrusive approach for us would be to keep attrs as a non-distributed, regular Python dict and track attrs at the query compiler level. However, it may be better to track attrs through __finalize__ like native pandas does, but this would require changing almost every frontend method to call this before returning.

noloerino added a commit to noloerino/modin that referenced this issue Sep 24, 2024
@noloerino noloerino linked a pull request Sep 24, 2024 that will close this issue
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🦗 Something isn't working P2 Minor bugs or low-priority feature requests pandas concordance 🐼 Functionality that does not match pandas
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant