Frequently asked questions

Frequently Asked Questions

What types of data can be used as reference? clustifyr uses gene(row)-by-celltype(column) expression matrices. This means bulk RNA-seq and microarray data can be directly used. For scRNA-seq data, we have average_clusters() to convert matrix data and metadata. For Seurat and SCE objects, we provide wrapper function object_ref(). Summarized/pseudobulk transcriptomes (for instance Gene Expression by Cluster, median from Allen Brain Institute) can also be used, but note to set pseudobulk_method = "median" in clustify to match the reference data if summarization/pseudobulk was not by averaging.
Can I directly make references from online scRNA-seq datasets? Yes, with the caveat that metadata containing cell type assignments must be available, which is frustratingly uncommon (see our quantification/monitoring of the issue here). We now have a Shiny app run_clustifyr_app() that can directly preview and use GEO files, and get_ucsc_reference() to build reference from a https://cells.ucsc.edu/ link.
Should the input/reference data be normalized? The default metric for clustifyr is ranked correlation, so it does tolerate mixed raw/normalized expression fairly well. Still, we recommend matching the input and ref matrices to the same normalization method if possible. The object wrappers are taking log-normalized data for downstream steps. It should be noted that data slot from SCtransform obfuscates the original gene expression ranking, and is not ideal for clustifyr - in this case we recommend going directly from raw counts.
How should I determine parameters? Please see our published manuscript with parameter and usage discussions. In general, default settings are satisfactory in our own internal usage/testing. However, you might want to inspect the correlation matrix and call results, instead of just the final result (use obj_out = FALSE in clustify()).
How many variable genes should I provide? While this of course greatly depends on the datasets in question, we generally have good results with ~500-1000 variable genes. This is why we recommend running M3Drop for this step. It should be noted that Seurat V3 onwards automatically stores 2000 by default, which may be too many (if the result correlation matrix shows high and similar values for too many cell types). Currently, by default clustify() on Seurat objects will use top 1000 genes.
I have "CLASH" in many of my final calls, why is that? "CLASH" indicates ties in the correlation values. In practice, this should be very rare unless the amount of query genes is very (dangerously) low (use verbose = TRUE in clustify() for more information). Query genes take the intersection of provided gene list (or autodetected from Seurat objects) and genes in the reference.
I need help troubleshooting unknown errors in my reference building/clustifying. As we try to provide better error messaging, it is still important to note that, in general, the most error-prone step is at designating the column in the metadata that contains clustering information. This is generally the cluster_col argument.
What if I only have marker gene lists instead of full transcriptome references? Please see clustify_lists(), which implements several simple methods. In particular, if both positive and negative markers are available, set argument metric = "posneg".
Do I need to have equal number of marker genes per cell type? Better support will be our next focus. Currently metric = "posneg" works with uneven numbers of markers. An alternative workflow can be used with argument input_markers = TRUE:

# pbmc_markers as FindAllMarkers output gene list
pbmc_input <- split(pbmc_markers$gene, pbmc_markers$cluster)
# reference gene list that is uneven length
pbmc_ref_mm <- pos_neg_marker(
  list(B = c("CD79A", "CD79B", "MS4A1"), 
       NK = c("GZMB", "GNLY"))
)

# reverse input and reference
res <- clustify_lists(
  pbmc_ref_mm,
  pbmc_input,
  metric = "jaccard",
  input_markers = TRUE
)

Can I extract the overlapping genes fromclustify_lists()? This was an requested and added feature. details_out = TRUE will output 2 matrices, the normal score, and a matrix containing all genes overlapping in the corresponding ref vs query pairs.

res <- clustify_lists(
  pbmc_matrix_small,
  metadata = pbmc_meta,
  cluster_col = "classified",
  marker = cbmc_m,
  details_out = TRUE
)

names(res)

Why is the default setting per_cell = FALSE? While doing classification on per cell level is available, it is slow and not very accurate. Default settings are also not optimized for per-cell classification. clustifyr is mainly focused on leveraging results from clustering techniques. As other aspects of scRNA-seq analysis is often focused on clusters, we have set our focus on this resolution as well. This does mean that improper clustering of either the query or ref datasets will lead to issues, as well as cases of continuous cellular transitions where discrete clusters are not present. From benchmarking, even 15 cells per cluster is still performing well, and in our internal usage we would intentionally overcluster the data and check if clustify() results are stable (see also overcluster_test()).
Can I use multiple references in the same clustify run? Yes, simply adding columns to a reference matrix works to expand it. We also provide build_atlas(), which can be run along the lines of build_atlas(matrix_objs = list(reference1, reference2, reference3, ...), genes_fn = clustifyr::human_genes_10x).
Does clustifyr work for spatial scRNA-seq data? It works decently on the Seurat tutorial data. See short example for both clustify()(correlation) and clustify_lists()(gene list enrichment) approaches. (Note, as mentioned above, we recommend avoiding SCtransform data, and opting for using raw data directly instead. This can now be directly handled by Seurat wrapper.)
Can I pull out additional information on what gene signatures don't match the reference clusters? Please add arguments organism = "hsapiens", plot_name = "rank_diffs" to clustify(). This saves a "rank_diffs.pdf", comparing gene expression of the queried clusters versus the assigned reference cell gene signature. Highlighted in red are genes expressed (ranked) higher in query data, and in blue gene expressed (ranked) lower than the reference. Top 10 GO-BP terms are also included. See the function assess_rank_bias() for step-by-step generation of the plot outside of the clustify() wrapper.
How do I cite clustifyr?

citation("clustifyr")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Frequently asked questions

Frequently Asked Questions

Clone this wiki locally