Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Labels in Memory T Cell Atlas missing #14

Closed
mihem opened this issue Oct 12, 2021 · 8 comments
Closed

Labels in Memory T Cell Atlas missing #14

mihem opened this issue Oct 12, 2021 · 8 comments

Comments

@mihem
Copy link

mihem commented Oct 12, 2021

Thanks for this great package that is really helpful in single cell analysis.

I struggle with the memory T cell atlas reference, though. Following the tutorial and downloading the annotations from GEO results in 260 missing labels.

ref <- readRDS('../pre-built_references/tbru_ref.rds')
ref_tbru <- readRDS("tbru_ref.rds")
ref_tbru$save_uwot_path <- "tbru_uwot_model"
ref_tbru$meta_data <- read_delim("GSE158769_meta_data.txt")

sum(is.na(ref_tbru$meta_data$cluster_ids))
sum(is.na(ref_tbru$meta_data$cluster_name))

> [1] 260
> [1] 260

This wouldn't be a problem in this huge dataset I think, but after mapQuery, knnPredict.Seurat (I use a Seurat query object) fails with

Error in class::knn(t(ref_obj$Z_corr), Embeddings(query_obj, "harmony"), : no missing values are allowed

Traceback:

3: stop("no missing values are allowed") 2: class::knn(t(ref_obj$Z_corr), Embeddings(query_obj, "harmony"), ref_obj$meta_data[[label_transfer]], k = k) at utils_seurat_symphony.R#452 1: knnPredict.Seurat(tcells, ref_tbru, "cluster_name")

@joycekang Can you maybe provide all labels (same lab, right? ;)) or provide a workaround (filter out cells with missing label?).

Thank you,
Mischko

@joycekang
Copy link
Collaborator

Hi @mihem,

Thanks for the question:

The reference object currently published on Zenodo was built before the publication of the Nathan et al. dataset. Therefore, it does not contain cell type labels or protein expression. These can now be obtained from GEO (GSE158769): https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE158769

You can update the reference metadata directly, such as ref$meta_data = new_meta_data. Let me know if that helps!

@mihem
Copy link
Author

mihem commented Oct 12, 2021

Hi Joyce,

Sorry I think you misunderstood me. What you said is well explained in your vignette. I downloaded the ref object from zenodo and the meta data from geo and added them exactly as you explained. However there are several missing labels (260). Therefore the knn.Predict function fails.

So my question is not: "where to get the meta data"... but "could you please provide the labels without missing values or help me subsetting the ref object so that it removes the few cells with missing values".

Thank you

@joycekang
Copy link
Collaborator

joycekang commented Oct 12, 2021

Ah, apologies - I read it too fast. I think removing the NA cells is a good way to go.

To do this, here is some code (haven't tried running it myself but this should work):

idx_na = which(is.na(ref_tbru$meta_data$cluster_ids))
ref_tbru$meta_data = ref_tbru$meta_data[-idx_na, ]
ref_tbru$Z_corr = ref_tbru$Z_corr[,-idx_na]
ref_tbru$Z_orig = ref_tbru$Z_orig[,-idx_na] # not essential
ref_tbru$R = ref_tbru$R[,-idx_na]

Can you give that a try? If that doesn't work I can take another look tonight. Thanks!

@mihem
Copy link
Author

mihem commented Oct 14, 2021

Hi Joyce,

that was an easy solution and worked well, thanks!

the class::knn part of the knnPredict function is a performance bottleneck though, took me ~5 min to run on this large dataset, in comparison mapQuery only takes a few second. In general symphony/harmony is super fast compared to the integration function of Seurat so it would be nice if this function would be equally fast I think.

@joycekang
Copy link
Collaborator

hi @mihem,

Glad that worked. For the k-NN function, I agree that it can be suboptimally slow for large reference. What I will say is that Symphony embeddings are agnostic to the downstream inference function. That is, you can train any sort of model (multinomial logistic regression using the glmnet package, SVM, etc) using the Harmonized PCs from the reference cells as input, and cell type labels as output. In our paper, we found that k-NN (marginally) outperformed the other methods we tried, but these other methods would be a lot faster than k-NN once the models are trained.

As example, multinomial LR would be something like this:

mod.glmnet <- glmnet::cv.glmnet(
x = t(reference$Z_corr), y = as.factor(reference$meta_data$cell_type),
family = "multinomial",
alpha = 0,
type.multinomial = "grouped"
)

predictions = mod.glmnet %>% predict(t(query$Z), type = 'class', s = 'lambda.min')

@ilyakorsunsky has implemented a slightly more sophisticated version of this for the Fibroblast Atlas from our lab, where each soft cluster learns it's own model, then for a given query cell it incorporates predictions from all soft clusters.

@mihem
Copy link
Author

mihem commented Oct 14, 2021

Thanks for the insights.
Okay, but training the multinomial LR took extremely long and that's also part of the pipeline. I mean it's not super important, but isn't there a faster C++ implemented knn library that you can use?

@joycekang
Copy link
Collaborator

Fair point! I have it on the todo list - do you happen to have a suggestion for a faster C++ implemented library?

@mihem
Copy link
Author

mihem commented Oct 21, 2021

Cool. Sorry, no real expertise on that. Some googing led me to:
https://github.com/eddelbuettel/rcppannoy which is a well-maintained package and is used by Seurat, based on the python Annoy package from Spotify
https://github.com/jlmelville/rcpphnsw also well maintained
Again no expert, but I guess if those have a similar accuracy in your benchmarking, they are a great choice because they should lead to an enormous (>100x?) speed increase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants