-
Notifications
You must be signed in to change notification settings - Fork 14
Frequently asked questions
-
What types of data can be used as reference?
clustifyr
uses gene(row)-by-celltype(column) expression matrices. This means bulk RNA-seq and microarray data can be directly used. For scRNA-seq data, we haveaverage_clusters()
to convert matrix data and metadata. For Seurat and SCE objects, we provide wrapper functionobject_ref()
. Summarized/pseudobulk transcriptomes (for instance Gene Expression by Cluster, median from Allen Brain Institute) can also be used, but note to setpseudobulk_method = "median"
inclustify
to match the reference data if summarization/pseudobulk was not by averaging. -
Can I directly make references from online scRNA-seq datasets? Yes, with the caveat that metadata containing cell type assignments must be available, which is frustratingly uncommon (see our quantification/monitoring of the issue here). We now have a Shiny app
run_clustifyr_app()
that can directly preview and use GEO files, andget_ucsc_reference()
to build reference from a https://cells.ucsc.edu/ link. -
Should the input/reference data be normalized? The default metric for
clustifyr
is ranked correlation, so it does tolerate mixed raw/normalized expression fairly well. Still, we recommend matching the input and ref matrices to the same normalization method if possible. The object wrappers are taking log-normalized data for downstream steps. It should be noted that data slot from SCtransform obfuscates the original gene expression ranking, and is not ideal forclustifyr
- in this case we recommend going directly from raw counts. -
How should I determine parameters? Please see our published manuscript with parameter and usage discussions. In general, default settings are satisfactory in our own internal usage/testing. However, you might want to inspect the correlation matrix and call results, instead of just the final result (use
obj_out = FALSE
inclustify()
). -
How many variable genes should I provide? While this of course greatly depends on the datasets in question, we generally have good results with ~500-1000 variable genes. This is why we recommend running
M3Drop
for this step. It should be noted that Seurat V3 onwards automatically stores 2000 by default, which may be too many (if the result correlation matrix shows high and similar values for too many cell types). Currently, by defaultclustify()
on Seurat objects will use top 1000 genes. -
I have "CLASH" in many of my final calls, why is that? "CLASH" indicates ties in the correlation values. In practice, this should be very rare unless the amount of query genes is very (dangerously) low (use
verbose = TRUE
inclustify()
for more information). Query genes take the intersection of provided gene list (or autodetected from Seurat objects) and genes in the reference. -
I need help troubleshooting unknown errors in my reference building/clustifying. As we try to provide better error messaging, it is still important to note that, in general, the most error-prone step is at designating the column in the metadata that contains clustering information. This is generally the
cluster_col
argument. -
What if I only have marker gene lists instead of full transcriptome references? Please see
clustify_lists()
, which implements several simple methods. In particular, if both positive and negative markers are available, set argumentmetric = "posneg"
. -
Do I need to have equal number of marker genes per cell type? Better support will be our next focus. Currently
metric = "posneg"
works with uneven numbers of markers. An alternative workflow can be used with argumentinput_markers = TRUE
:
# pbmc_markers as FindAllMarkers output gene list
pbmc_input <- split(pbmc_markers$gene, pbmc_markers$cluster)
# reference gene list that is uneven length
pbmc_ref_mm <- pos_neg_marker(
list(B = c("CD79A", "CD79B", "MS4A1"),
NK = c("GZMB", "GNLY"))
)
# reverse input and reference
res <- clustify_lists(
pbmc_ref_mm,
pbmc_input,
metric = "jaccard",
input_markers = TRUE
)
-
Can I extract the overlapping genes from
clustify_lists()
? This was an requested and added feature.details_out = TRUE
will output 2 matrices, the normal score, and a matrix containing all genes overlapping in the corresponding ref vs query pairs.
res <- clustify_lists(
pbmc_matrix_small,
metadata = pbmc_meta,
cluster_col = "classified",
marker = cbmc_m,
details_out = TRUE
)
names(res)
-
Why is the default setting
per_cell = FALSE
? While doing classification on per cell level is available, it is slow and not very accurate. Default settings are also not optimized for per-cell classification.clustifyr
is mainly focused on leveraging results from clustering techniques. As other aspects of scRNA-seq analysis is often focused on clusters, we have set our focus on this resolution as well. This does mean that improper clustering of either the query or ref datasets will lead to issues, as well as cases of continuous cellular transitions where discrete clusters are not present. From benchmarking, even 15 cells per cluster is still performing well, and in our internal usage we would intentionally overcluster the data and check ifclustify()
results are stable (see alsoovercluster_test()
). -
Can I use multiple references in the same clustify run? Yes, simply adding columns to a reference matrix works to expand it. We also provide
build_atlas()
, which can be run along the lines ofbuild_atlas(matrix_objs = list(reference1, reference2, reference3, ...), genes_fn = clustifyr::human_genes_10x)
. -
Does clustifyr work for spatial scRNA-seq data? It works decently on the Seurat tutorial data. See short example for both
clustify()
(correlation) andclustify_lists()
(gene list enrichment) approaches. (Note, as mentioned above, we recommend avoiding SCtransform data, and opting for using raw data directly instead. This can now be directly handled by Seurat wrapper.) -
Can I pull out additional information on what gene signatures don't match the reference clusters? Please add arguments
organism = "hsapiens", plot_name = "rank_diffs"
toclustify()
. This saves a "rank_diffs.pdf", comparing gene expression of the queried clusters versus the assigned reference cell gene signature. Highlighted in red are genes expressed (ranked) higher in query data, and in blue gene expressed (ranked) lower than the reference. Top 10 GO-BP terms are also included. See the functionassess_rank_bias()
for step-by-step generation of the plot outside of theclustify()
wrapper. -
How do I cite
clustifyr
?
citation("clustifyr")