-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add normalization method to scanpy? #1
Comments
Hi, Giovanni, thanks for reaching out! We are planning to submit a PR to scanpy, perhaps even next week. Are you one of the scanpy developers? I have some questions:
|
Hi @dkobak , thank you for the prompt reply! All valid questions and it's indeed best to discuss them upfront. Pinging @ivirshup @flying-sheep and @LuckyMD to join discussion. So a standard normalization workflows in scanpy now looks like this: sc.pp.normalize_total(adata, inplace=True)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, flavor="seurat", n_top_genes=2000) as you pointed out, the pearson residual could come in before or after that. E.g.: sc.pp.pearson_residual(adata)
sc.pp.highly_variable_genes(adata)
# or
# some normalization
sc.pp.highly_variable_genes(adata)
sc.pp.pearson_residual(adata) It should also be pointed out that For these reasons, my naive view on this is to have a separate function that would give more flexibility to the users, as long as they know what they are doing. So
yes
no it does not, but it's flexible enough to do it with or without normalized counts. RE dense matrix, it's ok as long as there is a warning to the user
for the reasons above, I would implement it in a separate function. Looking forward to hear what you and the others think! |
Thanks @giovp. This is very helpful. What is the meaning of "total" in
For highly variable genes I would suggest to add a
Despite what you said, I am a bit concerned about Pearson residuals resulting in a dense matrix. So I would suggest that I am not sure if it would make sense to allow In addition, what do you think having |
I think pearson residuals would have to be a separate function, and it would require a bit of separate preprocessing infrastructure, as it wouldn't fit into the mix-and-max workflow scanpy currently has for pre-processing (which is currently not great). This is mainly as pearson residuals wouldn't really work with scanpy's One thing I've been wondering about models that output residuals: Regarding filtering to highly variable genes... this is something we have been removing gradually from Scanpy. When Scanpy initially filtered out highly variable genes for further processing, we stored the full feature set in How about implementing this as two scanpy functions (for normalization and HVG selection) without filtering but adding a warning about scaling in the docs? We would also need to have a test to see if genes have mean 0 expression before standard HVG functionality is used within scanpy. Regarding alternative forms of variance stabilization... I'm not sure what the best way forward there would be. I believe you showed sqrt-transform not to be the best idea, and it's typically not used. Maybe there's such a thing as providing too many options? On the other hand, it's easy to implement and some people might find it useful... I haven't really seen it outside of pre-processing papers tbh. |
Thanks for joining in @LuckyMD. So if I understood correctly, you agree with having two new functions
right?
Interesting question. The problem is that we do seq depth normalization within computing the residuals. So there is no fixed "intercept" for each gene. Instead the "intercept" is different for each cell (because it depends on the cell's seq depth). I guess one could add some intercept, e.g.
I see what you are saying. But when dealing with say 1M cells, dense data matrix of 1M times 25k genes takes A LOT of space. I would definitely like to have a possibility to run the analysis on my laptop which would require selecting let's say 1k highly var genes before computing dense residuals. Can we have it at least as an option (turned off by default), without breaking scanpy's logic? In this case, it'd be nice to have
We do argue it's not the best idea but it's not worse than log1p (rather better). I am not sure about this one, but my preference would be to have such function... As you say, "it's easy to implement and some people might find it useful". It's not used very often, that's true, but that's because everybody is following tradition and using |
Partly... I guess the second function would be an addition to
It would be interesting to compare your suggestion with expression levels per gene that come out of other normalization methods that retain expression magnitude (like scran pooling).
The second issue is that you'd have to use a non-pearson residuals normalization for the first HVG selection and then run normalization on this HVG set with pearson residuals to then know which HVG selection you might want to do based on variance of residuals, no?
That's probably true. I guess I use |
Hey @dkobak. I haven't actually had a chance to fully read the preprint yet, so I might be missing out on some details of the method.
So, what are the uses of the dense residual matrix? Do you still want it around after you've computed a PCA or other decomposition on top of it? To me, it seems like there are more downstream uses of the full feature set than a reduced one. A possible solution here would be to not actually create a dense matrix. This is currently what we do with our default PCA for sparse data by using a LinearOperators let you implicitly centerHere's an example of how we mean center the expression matrix for PCA import numpy as np
from scipy.sparse import spmatrix
from scipy.sparse.linalg import LinearOperator
def _implicitly_center(X: spmatrix, mu: np.ndarray) -> LinearOperator:
mu = mu[None, :]
XH = X.T.conj(copy=False)
_ones = np.ones(X.shape[0])[None, :].dot
return LinearOperator(
matvec=lambda x: X.dot(x) - mu.dot(x),
dtype=X.dtype,
matmat=lambda x: X.dot(x) - mu.dot(x),
shape=X.shape,
rmatvec=lambda x: XH.dot(x) - mu.T.dot(_ones(x)),
rmatmat=lambda x: XH.dot(x) - mu.T.dot(_ones(x)),
) Another option would be to just store the residuals for the variable genes. A simple encoding would be a |
Replying to both comments above:
That's up to you guys. We can make
Actually no! HVG selection with Pearson residuals does not require a lot of memory. Residuals can be computed per-gene (or in some batches, e.g. 1000 genes at a time), then we find their variance and don't need to store the residuals. So one can easily find variances of Pearson residuals without forming the entire Pearson residual matrix in memory. So one would be able to run
To be honest, I think sqrt makes more sense than log1p for any downstream use, at least for UMI data.
Hmm. Maybe one does not want to have it around after PCA. When you run t-SNE/UMAP/clustering/etc in scanpy, are you using the PCA object? If yes, then one approach would be to actually combine
This won't work I think. The reason Pearson residual matrix is dense is not because the gene means are subtracted. The "mean" that is subtracted from each UMI count depends on both cell and gene, so the matrix of means is itself dense.
You mean store the Pearson residuals not in |
Great! Then I would vote for this as an extension to
Any particular reason for this? |
I'd say the recommended workflow is UMAP on the pca, then clustering on the neighbor network from UMAP. But this is pretty commonly the case in single cell analysis.
👍, but you'd know better if there were more downstream applications for the residuals.
So, I think the issue is not that the matrix of means is dense. This is fairly easy to deal with while never making a dense matrix. The Full codeimport numpy as np
from scipy import sparse
from scipy.sparse.linalg import LinearOperator, svds, aslinearoperator
from sklearn.utils import check_random_state
from sklearn.utils.extmath import svd_flip
from scanpy.pp._utils import _get_mean_var
def operator_from_vector_product(col, row):
assert len(col.shape) == 2 and col.shape[1] == 1
assert len(row.shape) == 2 and row.shape[0] == 1
return LinearOperator(
matvec=lambda x: col @ (row @ x),
matmat=lambda x: col @ (row @ x),
rmatvec=lambda x: row.T @ (col.T @ x),
rmatmat=lambda x: row.T @ (col.T @ x),
dtype=col.dtype,
shape=(col.shape[0], row.shape[1]),
)
# In practice this probably should have another centering step for the residuals
def pca_from_centered(X, X_centered, npcs=50, random_state=None):
random_state = check_random_state(random_state)
np.random.set_state(random_state.get_state())
random_init = np.random.rand(np.min(X.shape))
u, s, v = svds(X_centered, solver="arpack", k=npcs, v0=random_init)
u, v = svd_flip(u, v)
idx = np.argsort(-s)
v = v[idx, :]
X_pca = (u * s)[:, idx]
ev = s[idx] ** 2 / (X.shape[0] - 1)
total_var = _get_mean_var(X)[1].sum()
ev_ratio = ev / total_var
output = {
'X_pca': X_pca,
'variance': ev,
'variance_ratio': ev_ratio,
'components': v,
}
return output Exampledef operator_from_vector_product(col, row):
assert len(col.shape) == 2 and col.shape[1] == 1
assert len(row.shape) == 2 and row.shape[0] == 1
return LinearOperator(
matvec=lambda x: col @ (row @ x),
matmat=lambda x: col @ (row @ x),
rmatvec=lambda x: row.T @ (col.T @ x),
rmatmat=lambda x: row.T @ (col.T @ x),
dtype=col.dtype,
shape=(col.shape[0], row.shape[1]),
)
# Test data
X = sparse.random(10000, 1000, format="csr", density=0.2)
mu = operator_from_vector_product(X.mean(axis=1).A, X.mean(axis=0).A)
Z = aslinearoperator(X) - mu
%memit
# peak memory: 342.61 MiB, increment: 0.07 MiB
%%memit
pca_from_centered(X, Z, random_state=42)
# peak memory: 376.68 MiB, increment: 34.05 MiB
%%memit
pca_from_centered(X, X.toarray() - (X.mean(axis=1) @ X.mean(axis=0)), random_state=42)
# peak memory: 589.89 MiB, increment: 213.20 MiB
# These results are very similar. You can compare the top few components to check.
# There's some weird behaviour with `svds` and dense arrays involving the number
# of singular values, but I forget exactly what it is. I just don't know if
Yeah, as something like
Here I was thinking downstream tools with special behavior for pearson residuals would look for some You could look at the output of |
Very good point about rank-1 \mu matrix representable this way in Okay, based on all of the above, my suggestion would be:
BTW, what does
Actually this way
(@jlause We should implement some checks that if the data look like read counts and not like UMIs, e.g. values too large, then you get a warning. We are currently working on Pearson residuals for read counts but it's work in progress.) @ivirshup What do you think? |
Just a quick comment... your suggested pipeline would omit a separate normalization function that users might expect to run before |
@LuckyMD Hmm. We could simply call this function Then the application pipeline would be
May still be confusing for some because We could make (@jlause One thing we should do is to check if everything that is passed to Pearson residuals functions consists of integers. If not, fire a warning.) |
Hey all, thanks for all this input! Regarding @LuckyMD's point of the 'one-stop-preprocessing': I like the idea of having a complete workflow like Additionally, we could implement the two core functionalities we propose as independent functions for flexible usage, as @dkobak suggested above:
Within the recipe, we would than just call these functions in appropriate order and add PCA on top. @ivirshup @LuckyMD @giovp What do you think about this setup? |
Regarding the practical implementation of
From earlier posts I understood that filtering to HVGs can be applied in two ways:
I think what @dkobak proposed is a mixture:
What is the preferred way? To me, it seems the easiest could be to not integrate any |
@jlause Clarification:
What I meant is that this would ONLY happen when the user calls But you may be right that the 2nd way is enough and we don't need to implement One thing, however: there is a slight difference between computing Pearson residuals on the subsetted matrix vs. on the full matrix but outputing only a subset of genes. Because the seq depths would be different... Not sure it matters much. |
I agree fully with @jlause. The proposed 3 functions ( I would also prefer normalization functions that don't change the dimensions of the input object as a design principle.
This is the only aspect that might be problematic. We could get around this by adding a |
@jlause @LuckyMD Okay, so I am on board with I am less sure about
vs.
compared to the standard pipeline
Right? I think the first option is actually closer to what the default scanpy approach is. Otherwise you could combine |
@dkobak @LuckyMD Regarding First of all: I thought having the pipeline as a recipe might be appropriate because it would ship the 'standard approach' we suggest in the paper, including all steps from raw to PCA, providing only a few degrees of freedom (I saw other recipes are doing something similar). The pseudocode for
The version suggested by @dkobak would look like this:
and would allow the user to choose initial filtering and HVG selection. In both bundles, one could additionally make sure that the |
I would use something like
in either bundle. |
@dkobak @jlause Happy new year! Just got back to this discussion.
I guess the analogy is to the old For me, both function concepts are otherwise completely fine. The only minor issue with expecting a prior HVG selection is that any HVG selection approach in Scanpy apart from the one proposed here would expect normalized data (and logged... I think). Thus, I assume most people that use this recipe would also perform the HVG selection before as there is currently not really an alternative. I would however agree with keeping cell and gene filtering out of the recipe. I wonder if keeping a subset of the |
I think we should make Looking forward to @ivirshup's thoughts on this. Happy New Year to all! |
hi!
recently read the paper and found the method really convincing and effective!
would you be interested in submitting a PR to scanpy? Had a look at the code and seems pretty straightforward to re implement there. This is essentially the function needed right?
umi-normalization/tools.py
Line 128 in d15bd59
If you are interested and willing, I would suggest you to loosely follow
sc.pp.normalize_total
for implementationhttps://github.com/theislab/scanpy/blob/5533b644e796379fd146bf8e659fd49f92f718cd/scanpy/preprocessing/_normalization.py#L28-L202
you can also have a look at docs
would be very happy to assist/help out in case you are interested!
Thank you!
Giovanni
The text was updated successfully, but these errors were encountered: