-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
simulate() error when running Power.Rmd script #2
Comments
Thanks for taking a look at things! Happy to help. It looks like you have not built the "simulate" function in earlier steps. The steps in the "Simulation Example" under "Parameter Estimation" must be run first before running the "Power" script. Please, let me know if you encounter any additional errors. What are your goals for proceeding? Are you just trying to simulate data or compute power curves? If so, try running using the hierarchicell package (https://github.com/kdzimm/hierarchicell - recently released and is more recently updated than this code). I'd recommend following these steps to get to simulated data: library(hierarchicell) |
Hi @kdzimm, sorry for the late response. Running the simulation example first worked - thanks! I looked at this to get a sense of how you set up the objects for running the zlmCond analysis in the Power.rmd script. I have single-cell RNA-seq data from 24 samples that I want to analyze with MAST, incorporating a random effect for sample ID. I've been using Seurat for data processing, but the Seurat implementation of MAST in the FindMarkers() function for differential expression does not enable random effect modeling. When I ran your script with my data, setting up my sca object exactly as you did, I got several errors and many warnings:
Looking at a recent post about the first error (lme4 issue #573), I'm concerned the negative values in my matrix are causing problems. In Seurat, I used SCTransform to normalize data from each sample - this function generates Pearson residuals from a regularized negative binomial regression, where cellular sequencing depth is utilized as a covariate in a generalized linear model. The Seurat team recommended using these residuals (stored in the scale.data slot) for differential expression analysis - see this vignette under "Where are normalized values stored for sctransform?". Another option would be to use corrected log-normalized counts in the data slot of the SCTransform assay in the Seurat object - I plan to try this next. Do you have any insight for how to address these errors or would you recommend contacting the MAST and/or Seurat teams directly? I appreciate your help! |
No need to apologize for the delay! This is not an error I have seen before, but I think you may be right that the error is being caused by the negative values from the SCTransfom normalization. In my experience with MAST, I am usually providing log2(x + 1) transformed values (either raw counts or TPMs) directly to MAST without doing normalization (just transformation) first. It is my understanding the "ngeneson" parameter in the model will account for "cell size" and act as a sort of normalization. I also noticed you are filtering based on mean > 0 here too. After a normalization that provides negative values, I am not sure this is doing the same thing. I included this in my code to remove genes that were zero in all cells. Personally, I would try using the data that are not normalized with SCTransform and see if that helps. If that doesn't do it, then it is either a lme4 or MAST issue (or a combination of the two) in which case you would probably be better off asking the MAST team about first (and then it may lead you to needing help from Ben Bolker with lme4). The MAST group has been very helpful to me in the past. I would start there if my first suggestion does not work. Keep in mind that you are very likely to always get errors when fitting mixed models with these highly zero-inflated data. Finak discusses this very clearly here: RGLab/MAST#148 I'll be honest, I am less familiar with Seurat. I have only used it to do cell clustering and make tSNEs. I don't know how it incorporates MAST into its software - but it appears the wrapper for it does not allow for random effects. I noticed @esrasefik was having some issues with this here: satijalab/seurat#3712 . it appears you have already had some interaction with her. Do you know if she was able to solve her problems? Finally, I will note the following (from our manuscript): "While we recommend computing differential expression analysis using MAST with RE, alternative methods include using MAST with fixed-effects for individual, a tweedie GLMM, or permutation testing. Accounting for the within-sample correlation with a fixed-effect term for individual, will have a slight difference in interpretation, but should be considered an alternative option to random effects — particularly when the number of independent experimental units is modest. To not violate the exchangeability assumption, permutation methods must randomize at the independent experimental unit (e.g., individual) and properly account for covariates (i.e., conditional permutation). The tweedie GLMM method, which was selected for its distributional flexibility that can account for zero inflation, could be implemented using the “glmmTMB” R-package34, but other mixed models could also be applied using a more appropriate distribution" In addition, I will note that if you have reasonable balance across samples (that is the number of cells within your cell type of interest is reasonably balanced across individuals) then a mean-of-means approach (pseudo-bulk - where you take the mean of expression in all cells within an individual) will perform comparably. This is not a bad option if MAST isn't working for you and you can't figure out any other methods. I hope this helps!! |
@kdzimm, thank you so much for your help and advice! I changed a few things according to your suggestions and it seems to be working now. I took the corrected count data from SCTransform and then log2(x + 1) transformed those values. I still got an error (grouping factors must have > 1 sampled level), so I filtered out genes expressed in <10% of cells, following recommendations here and here. I also got convergence warnings so I added
I still got a convergence warning, but I'm guessing that is due to the 2 genes that were filtered out using na.omit. For interpreting the results in fcHurdle, are these definitions of the variables correct? (I was not able to find definitions in the MAST vignette.) Lastly, would positive logFC values indicate higher expression in the orig.ident Normal group? I did not set a reference level for this variable before running Thanks again! |
Looks good! Great job figuring it out! What you did is not easy. Hopefully, as single-cell continues to develop this will get easier for users to run mixed models with single-cell data. Yes, it is likely you will always have a set of genes that do not converge - particularly with sparse single-cell data and with small sample sizes. ci.hi is the high end of the 95% confidence interval for your logFC. ci.low is the low end. I believe the Normal group is your reference level based on your doLRT='origin.identNormal' command, but I always like to double check this on one or two genes by hand with the raw data after the fact - just to be certain I am interpreting direction correctly. Take good care! I am going to close this for now, but please let me know if you have any other questions! |
Hi @kdzimm, I have 2 follow-up questions.
Thanks! |
These are really good questions.
"Although our focus here is on hypothesis testing for finding differentially expressed genes within a cell type across conditions, the concept is applicable when comparing expression patterns between cell types and is broadly appropriate for all single-cell sequencing technologies such as proteomics, metabolomics, and epigenetics." Thanks for thinking through your analysis carefully and asking meaningful questions! I believe your research will be much better for it. |
Thank you so much for your thorough responses and for pointing me to these resources - I really appreciate your time and help! |
Hi @kdzimm! I've been using this pipeline on other clusters, and sometimes I get the following error after running zlm():
It will still run though, and then I'll get a similar error when generating a results summary with LRT:
This will also still run and I still get results, but I'm not sure if I can trust them. I'm just wondering if you've come across this error before? One of the few places I found the error referenced is lme4 issue #573. Do you recommend trying the workaround posted by bbolker? |
Hello @kdzimm et al.,
Thank you for providing this excellent resource for incorporating a random effect for individual when running MAST for single-cell differential expression analyses. After loading all of the necessary packages, I started building the simulation as in 'Power.Rmd' but got an error at line 27. Here is my full script in R version 4.0.3:
It seems that the format of the simulate() argument is not correct, and I am having trouble determining what was intended here.
Thanks for your help!
The text was updated successfully, but these errors were encountered: