Decontx: Heuristic clustering or z parameter? How are the cluster labels specified with the z parameter affecting the decontamination? #379
-
Dear Cambell Lab, I am using your tool DecontX to decontaminate my droplet scRNA-seq dataset and I came up with some doubts. "decontX will perform heuristic clustering to quickly define major cell clusters. However if you have your own cell cluster labels, they can be specified with the z parameter." I would like to ask first, which strategy would you recommend better? Using the heuristic clustering, I obtain a much higher contamination fraction than specifying them with the z parameter. But, I am unsure about using the z parameter because, then, the contamination fraction depends on the number of clusters that I supply to the function (the contamination fraction decreases when the number of clusters increases ). How can I determine the best strategy and, if using the z parameter, the best number of clusters? Thank you very much in advance and thank you for this great tool! Best regards, Marta |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
Hi @MartaSanchezCarbonell, thanks for trying out our tool! Your question is a good one. DecontX assumes that each cell is a mixture of counts from its true cell population and from ambient RNA. It uses the cluster labels to estimate what the true cell population should look like. Our quick/heuristic clustering procedure will tend to identify "broader" cell populations. For example all T-cells may be grouped into a single cluster whereas if you did a different clustering workflow like Seurat, the T-cells may get split up into multiple clusters (e.g. CD8, CD4, etc). There are also two ways in which we can estimate the distribution of ambient RNA distribution. The first is when the raw/background matrix is directly supplied via the Now getting back to your question. If you supply more granular cluster labels via the Note that using the raw/background matrix may also have a similar effect for some of the more predominant cell populations as they are the populations contributing the most RNA to the ambient profiles in the empty drops. The other challenging scenario is when a dataset has a transitionary or intermediate cell population that is "in-between" two other cell populations. This cluster can be flagged as having high contamination when it is clustered with other cell populations using DecontX's default clustering method. It may be better to make this group of cells have its own cluster label rather than lumping it in with the other cells. But this is situational and may depend on how likely you believe this is a real cluster and not just a group of highly contaminated cells. I hope that helps explain the algorithm a bit more. It may be hard to give you more guidance without seeing the specifics of you dataset. As this question may come up a lot, I am going to move this thread to a Discussion instead of an Issue. |
Beta Was this translation helpful? Give feedback.
Hi @MartaSanchezCarbonell, thanks for trying out our tool! Your question is a good one. DecontX assumes that each cell is a mixture of counts from its true cell population and from ambient RNA. It uses the cluster labels to estimate what the true cell population should look like. Our quick/heuristic clustering procedure will tend to identify "broader" cell populations. For example all T-cells may be grouped into a single cluster whereas if you did a different clustering workflow like Seurat, the T-cells may get split up into multiple clusters (e.g. CD8, CD4, etc).
There are also two ways in which we can estimate the distribution of ambient RNA distribution. The first is when the raw/backg…