Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modifying a subset of AnnData using the .iloc/.loc method does not make a new copy, and the original object is modified #1840

Open
2 of 3 tasks
crazyxiaoj opened this issue Jan 27, 2025 · 7 comments
Labels

Comments

@crazyxiaoj
Copy link

crazyxiaoj commented Jan 27, 2025

Please make sure these conditions are met

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of anndata.
  • (optional) I have confirmed this bug exists on the master branch of anndata.

Report

When using the .iloc or .loc methods to modify a subset of an AnnData object, it seems that no new copy is created; instead, the original AnnData object is directly modified.

Code:

from anndata import AnnData
import numpy as np

a = AnnData(X=np.arange(16).reshape(4,4), var=list('ABCD'), obs=list('abcd'))
b = a[:2,:2]
b.obs.iloc[:,:] = 0  # the same results using the .loc method.
b
# View of AnnData object with n_obs × n_vars = 2 × 2
#     obs: 0
#     var: 0
a.obs
#    0
# 0  0
# 1  0
# 2  c
# 3  d

As a beginner, I'm not sure if this behavior is a bug or by design. Could someone clarify whether this is intentional, and if so, could you please explain why it functions this way? Thanks for your assistance!

Versions

| Package | Version |
| ------- | ------- |
| pandas  | 2.2.3   |
| anndata | 0.11.3  |
| numpy   | 2.1.3   |
| Dependency         | Version     |
| ------------------ | ----------- |
| Pygments           | 2.18.0      |
| matplotlib         | 3.9.3       |
| defusedxml         | 0.7.1       |
| traitlets          | 5.14.3      |
| stack_data         | 0.6.3       |
| decorator          | 5.1.1       |
| jaraco.text        | 3.12.1      |
| six                | 1.17.0      |
| charset-normalizer | 3.4.0       |
| scipy              | 1.14.1      |
| pillow             | 11.0.0      |
| pyparsing          | 3.2.0       |
| session-info2      | 0.1.2       |
| platformdirs       | 4.3.6       |
| packaging          | 24.2        |
| h5py               | 3.12.1      |
| jaraco.collections | 5.1.0       |
| jaraco.context     | 5.3.0       |
| setuptools         | 75.6.0      |
| natsort            | 8.4.0       |
| cycler             | 0.12.1      |
| asttokens          | 3.0.0       |
| parso              | 0.8.4       |
| python-dateutil    | 2.9.0.post0 |
| kiwisolver         | 1.4.7       |
| jedi               | 0.19.2      |
| prompt_toolkit     | 3.0.48      |
| ipython            | 8.30.0      |
| pytz               | 2024.1      |
| pure_eval          | 0.2.3       |
| more-itertools     | 10.3.0      |
| pickleshare        | 0.7.5       |
| jaraco.functools   | 4.0.1       |
| wcwidth            | 0.2.13      |
| executing          | 2.1.0       |
| Component | Info                                                                          |
| --------- | ----------------------------------------------------------------------------- |
| Python    | 3.13.1 | packaged by conda-forge | (main, Dec  5 2024, 21:23:54) [GCC 13.3.0] |
| OS        | Linux-5.10.16.3-microsoft-standard-WSL2-x86_64-with-glibc2.31                 |
| Updated   | 2025-01-27 11:47                                                              |
@AlessiaLeclercq
Copy link

Hello,
I am also trying to subset an AnnData object using some obs values.
The dataset is the "Peaks_RNA.loom" found here.
Specifically I have an AnnData object called and I want to subset according to the obs columns "Method" and "Tissue".
Here the code:

import os
import scanpy as sc
path = ... #path to loom file 
data = sc.read_loom(path) 
print(data.shape) #returns  526094 × 59480
subset_data = data[data.obs["Method"]=="rnaXatac"]
subset_data = data[data.obs["Tissue"].isin(["Cerebellum", "Brain"])]
print(subset_data) # View of AnnData object with n_obs × n_vars = 44333 × 59480 ... 

However I wish it to be a proper AnnData object as to save it into h5ad file.
How can I do it? I am using python 3.9.6.
Here follows the description of the environment:

anndata==0.10.8
annoy==1.17.3
array_api_compat==1.9.1
bbknn==1.6.0
cellrank==2.0.6
click==8.1.7
contourpy==1.3.0
cycler==0.12.1
Cython==3.0.11
dnspython==2.7.0
docrep==0.3.2
et_xmlfile==2.0.0
exceptiongroup==1.2.2
fcsparser==0.2.8
filelock==3.16.1
fonttools==4.54.1
fsspec==2024.12.0
future==1.0.0
get-annotations==0.1.2
h5py==3.12.1
harmonypy==0.0.10
hyperopt==0.1.2
igraph==0.11.8
importlib_metadata==8.5.0
importlib_resources==6.4.5
jax==0.4.30
jaxlib==0.4.30
jaxopt==0.8.3
Jinja2==3.0.3
joblib==1.4.2
kiwisolver==1.4.7
legacy-api-wrap==1.4
leidenalg==0.10.2
llvmlite==0.43.0
loompy==3.0.7
louvain==0.8.2
markdown-it-py==3.0.0
MarkupSafe==3.0.2
matplotlib==3.9.2
mdurl==0.1.2
mellon==1.5.0
ml_dtypes==0.5.1
mofapy2==0.7.2
mpmath==1.3.0
mudata==0.2.4
muon==0.1.6
natsort==8.4.0
networkx==3.2.1
numba==0.60.0
numpy==1.26.4
numpy-groupies==0.11.2
openpyxl==3.1.5
opt_einsum==3.4.0
packaging==24.1
palantir==1.3.6
pandas==2.2.3
patsy==0.5.6
petsc==3.22.0
petsc4py==3.22.0
pillow==11.0.0
progressbar2==4.5.0
protobuf==5.29.0
pygam==0.9.1
Pygments==2.19.1
pygpcca==1.0.4
pymongo==4.10.1
pynndescent==0.5.13
pyparsing==3.2.0
pysam==0.22.1
python-dateutil==2.9.0.post0
python-utils==3.9.0
pytz==2024.2
rich==13.9.4
scanpy==1.10.3
scikit-learn==1.5.2
scikit-misc==0.3.1
scipy==1.11.4
scvelo @ git+https://github.com/theislab/scvelo@22b6e7e6cdb3c321c5a1be4ab2f29486ba01ab4f
scvi==0.6.8
scvi-colab==0.12.0
seaborn==0.13.2
session-info==1.0.0
six==1.16.0
slepc==3.22.1
slepc4py==3.22.1
statsmodels==0.14.4
stdlib-list==0.11.0
sympy==1.13.1
texttable==1.7.0
threadpoolctl==3.5.0
torch==2.5.1
tqdm==4.66.6
typing_extensions==4.12.2
tzdata==2024.2
umap-learn==0.5.7
wrapt==1.16.0
xlrd==2.0.1
zipp==3.20.2

@ilan-gold
Copy link
Contributor

When using the .iloc or .loc methods to modify a subset of an AnnData object, it seems that no new copy is created; instead, the original AnnData object is directly modified.

@crazyxiaoj as far as I can tell, this behavior is totally expected. A view is just that, a view. So if you edit the view, you'll edit the actual object. It might be worth disallowing this completely, but there are probably cases where the behavior is desirable.

However I wish it to be a proper AnnData object as to save it into h5ad file.

@AlessiaLeclercq If you can't do it directly with the object you have (possible), you certainly can create a copy via copy i.e., adata.copy(): https://anndata.readthedocs.io/en/latest/generated/anndata.AnnData.copy.html

import anndata as ad
import numpy as np

adata = ad.AnnData(X=np.array([[1, 2], [3, 4]]))
adata[:1,:].write_h5ad("foo.h5ad") # works, but also `.copy` is fine

@crazyxiaoj
Copy link
Author

When using the .iloc or .loc methods to modify a subset of an AnnData object, it seems that no new copy is created; instead, the original AnnData object is directly modified.

@crazyxiaoj as far as I can tell, this behavior is totally expected. A view is just that, a view. So if you edit the view, you'll edit the actual object. It might be worth disallowing this completely, but there are probably cases where the behavior is desirable.

Your explanation is a bit unclear to me. I referred to the content on the following webpage: https://anndata.readthedocs.io/en/stable/generated/anndata.AnnData.html.

Here’s the relevant excerpt:

Copying a view causes an equivalent “real” AnnData object to be generated. Attempting to modify a view (at any attribute except X) is handled in a copy-on-modify manner, meaning the object is initialized in place.

Based on the paragraph above, it appears that modifying properties like obs results in the creation of a new AnnData object. Additionally, I noticed that performing an assignment directly using [], rather than the iloc method, also triggers the creation of a new object.

@ilan-gold
Copy link
Contributor

Based on the paragraph above, it appears that modifying properties like obs results in the creation of a new AnnData object. Additionally, I noticed that performing an assignment directly using [], rather than the iloc method, also triggers the creation of a new object.

Thanks for sharing this. The issue here would be wrapping every single dataframe method. I'm not sure why this wasn't done initially since only drop was wrapped. I was aware of the "copy-on-write" paradigm but I thought the promise was more shallow than this i.e., affecting things only like columns or keys. We should compile a list of things here, I suppose:

  1. set_index (although this one is very bad for other reasons)
  2. loc
  3. iloc
  4. insert
  5. pop
  6. drop_duplicates
  7. rename_axis

and much more. This might be why this wasn't done. So it's possible we should carve out an exception for pandas

@crazyxiaoj
Copy link
Author

Thank you for your clarification. I think I'm beginning to understand.

Do you still believe it's necessary to open this issue? If you feel it is no longer needed, we can consider closing this issue.

@ilan-gold
Copy link
Contributor

Do you still believe it's necessary to open this issue? If you feel it is no longer needed, we can consider closing this issue.

Well it is certainly an inconsistency so it seems we should either edit the docs or add the feature set. @ivirshup I've asked to weigh in

@AlessiaLeclercq
Copy link

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants