Skip to content

The Read Z-score (ReadZS) is a metric that summarizes the transcriptional state of a gene in a single cell.

License

Notifications You must be signed in to change notification settings

salzman-lab/ReadZS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Introduction

salzmanlab/readzs is a bioinformatics best-practice analysis pipeline for The Read Z-score (ReadZS), a metric that summarizes the transcriptional state of a gene in a single cell. About the ReadZS

This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.

The nf-core framework for community-curated bioinformatics pipelines. Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen. Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

Pipeline summary

nf-core/readzs

  1. Filter and quantify read enrichment from BAM files
  2. Calculate the ReadZS (read z-score) for single cells across genomic bins
  3. Downstream analyses
    1. Aggregate the ReadZS along cell type annotations and identify windows with significant cell type-specific regulation
    2. Plot read distributions for cell types in windows of interest
    3. Perform GMM-based subclustering to identify peaks in read distributions
    4. Annotate peaks with distances to annotated features

Quick Start

  1. Install nextflow (>=20.04.0)

  2. Depending on your use case, install conda, docker, or singularity. By using the docker or singularity nextflow profile, the pipeline can be run within the ReadZS docker container (also available on dockerhub, which contains all the required dependencies.

  3. Run the pipeline on test data.

    nextflow run salzmanlab/readzs \
        -r master \
        -latest \
        -profile small_test_data,<docker,singularity,conda>

    Stanford Sherlock users should use the sherlock profile:

     nextflow run salzmanlab/readzs \
         -r master \
         -latest \
         -profile small_test_data,sherlock
    
  4. To run on other datasets, modify a config file with data-specific parameters, using conf/test.config as a template. Note: do not include dashes in the run names or channel names. You may need to modify the executor scope in the config file, in accordance to your compute needs.

Input Arguments

Argument Description Example Usage
runName Descriptive name for ReadZS run, used in the final output files. Note: the run name should not contain any dashes (-). Tumor_5
input Input samplesheet in csv format, format described below Tumor_5_samplesheet.csv
useChannels true if the same samples were split across multiple sequencer lanes, with barcode overlap between different samples true, false
isSICILIAN If the input bam files are output from SICILIAN true, false
isCellranger true if input data is output from Cellranger true, false
ontologyCols Double-encapsulated list string describing the metadata columns that will create the "ontology" variable, e.g. cell type "'tissue, compartment, annotation'"
metadata Path to metadata (e.g. cell type) annotation file, described below metadata_Tumor5.tsv
chr_lengths Two-column, tab-delimited file containing chromosome names in the first column and chromosome lengths in the second column. Chromosome names must match those in BAM files. /home/refs/human.chrs
gff Location of genome GFF file, used for plotting; can be obtained from GENCODE /home/refs/humanv37.gff
annotation_bed BED-formatted file, used to annotate the windows, e.g. "refFlat" table of genes in BED format obtained from UCSC Table Browser /home/refs/hg38_genes.bed

Default Parameters

These default values can be modified to suit the needs of your data.

Argument Description Default Value
binSize Size of genomic bins ("windows"), used to calculate z-scores 5000
minCellsPerWindowOnt Minimum cells per window-ontology required to calculate medians for that window-ontology 20
minCtsPerCell Minimum counts per cell for a window required to include this cell in calculating medians for that window-ontology 10
nPermutations Number of permutations to be used in significance calculation 1000
numPlots Number of top windows to generate read distribution histograms for 20
peak_method Subclustering method for calling peaks, options: knee, max knee
zscores_only Calculate ReadZS values for cells over genomic bins, without annotating cells with cell-types or any downstram analyses. false
skip_plot Run all steps of pipeline, except for plot generation of read distributions. false
skip_subcluster Run all steps of pipeline, except for subclustering/peak calling. false
plot_only If all steps up to median calculation have been previously performed, only perform plot generation of read distributions. false
subcluster_only If all steps up to annotation steps have been previously performed, only perform subclustering/peak calling. false

plot_only Parameters

Example run command:

nextflow run salzmanlab/readzs \
    --plot_only true \
    --input Tumor_5_samplesheet.csv \
    --all_pvals_path *home/results/results/annotated_files/`${runName}`_all_pvals.txt* \
    --resultsDir *home/results* \
    --runName Tumor5 \
    --ontologyCols "'tissue, compartment, annotation'" \
    --gff genome.gff
Argument Description Example Usage
input Input samplesheet in csv format, format described below Tumor_5_samplesheet.csv
all_pvals_path Path to all_pvals file generated by previous ReadZS run, containing a windows column. home/results/results/annotated_files/${runName}_all_pvals.txt
resultsDir Path to results directory of previous run. home/results
runName Descriptive name for ReadZS run, used in the final output files Tumor_5
ontologyCols Double-encapsulated list string describing the metadata columns that will create the "ontology" variable, e.g. cell type "'tissue, compartment, annotation'"
gff Location of genome GFF file, used for plotting; can be obtained from GENCODE /home/refs/humanv37.gff
binSize Size of genomic bins, used to calculate z-scores Defaults to 5000
numPlots Number of top windows to generate read distribution histograms for Defaults to 20

peaks_only Parameters

Example run command:

nextflow run salzmanlab/readzs \
    --peaks_only true \
    --input Tumor_5_samplesheet.csv \
    --counts_path home/results/counts \
    --ann_pvals_path home/results/results/annotated_files/`${runName}`_ann_pvals.txt \
    --runName Tumor5
Argument Description Example Usage
input Input samplesheet in csv format, format described below Tumor_5_samplesheet.csv
counts_path Path to directory containing counts files, in results directory of previous ReadZS run. home/results/counts
ann_pvals_path Path to ann_pvals file generated by previous ReadZS run. home/results/results/annotated_files/${runName}_ann_pvals.txt
runName Descriptive name for ReadZS run, used in the final output files Tumor_5
peak_method Subclustering method for calling peaks, options: knee, max Defaults to knee

File Descriptions

input

The samplesheet specifies the BAM files to be used. It should be comma-delimited (no spaces after the comma), with no header.

If useChannels = true, the file must have 3 columns:

  • channel name (i.e. Tumor5_bladder)
  • file identifier (i.e. Tumor5_bladder_L001)
  • path to bam file (i.e. data/Tumor5_bladder_001.bam)

If useChannels = false, the file must have 2 columns:

  • file identifier (i.e. Tumor5_bladder_L001)
  • path to bam file (i.e. data/Tumor5_bladder_001.bam)

metadata

The metadata gives additional information about each cell, to be used for calculation of median ReadZS and significance calculation. The metadata file should contain the following columns:

  • cell_id
    • Cell identification column, with each entry structured as "${channel}_barcode"
    • If useChannels = false, each entry should be structured as "${file_identifier}_barcode"
    • Example cell_id value:
      • "Tumor5_AACCATGCAGCTCGCA"
  • Metadata columns used to define the ontology (grouping for which the median ReadZS will be calculated)
    • Examples:
      • tissue, compartment, annotation
      • ontology = "lung_immune_macrophage"

Output

Files

  • Significant windows (genomic windows determined to have significant ontology-specific differences in RNA processing):
    • results/${runName}/medians/${runName}_*_medians_*.txt
      • List of all median values for cells in window-ontologies passing filter
    • results/${runName}/signif_medians/${runName}_*_medians_*.txt
      • List of significant median values for all cells in window-ontologies passing filter
    • results/${runName}/annotated_files/${runName}_all_pvals.txt
      • All windows, annotated with annotation_bed
    • results/${runName}/annotated_files/${runName}_ann_pvals.txt
      • All significant windows, annotated with annotation_bed and ranked by effect size
      • Rankings from this file are used to rank plotted windows and windows for calling peaks
    • results/${runName}/annotated_files/annotated_windows.file
      • Bed file of windows of size = binSize, with each window annotated with annotation_bed
  • Peaks (peaks in read distributions across ALL cells, as determined by the GMM):
    • results/${runName}/peaks/${runName}_${peak_method}_peaks.tsv
      • Peaks called for each significant window, with one peak per line
  • Plots (histograms of read distributions, grouped by ontology):
    • results/${runName}/plots/*.png
      • Read distributions of the top 2 and bottom ontologies of each significant window, as ranked by effect size

results/${runName}/signif_medians/${runName}_*_medians_*.txt

Field Description
window Stranded window from genome, as created by windowsFile
ontology Metadata grouping, as created by ontologyCols
sum_counts_per_window_per_ont Total read counts per window-ontology
med_counts_per_window_per_ont Median read counts per window-ontology
median_z_scaled Median ReadZS value of the window-ontology
chi2_p_val Chi2 p-value for median_z_scaled
perm_p_val Permutation p-value for median_z_scaled
significant true if perm_p_val is statistically signficant
medians_range Range of median ReadZS for all ontologies in a window

results/${runName}/peaks/${runName}_${peak_method}_peaks.tsv

Field Description
window Significant window, ranked by effect size
num_peaks Number of peaks per window
peak_pos Peak position
comp_prob Component probability
ICL_vec The vector of ICL criterion values for each number of components

Credits

salzmanlab/readzs was originally written by the Salzman Lab.

We thank the following people for their extensive assistance in the development of this pipeline: Julia Oliveri, Robert Bierman, and Sarthak Satpathy.

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines.

Citations

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

You can cite ReadZS and Nextflow as follows:

ReadZS detects developmentally regulated RNA processing programs in single cell RNA-seq and defines subpopulations independent of gene expression Elisabeth Meyer, Roozbeh Dehghannasiri, Kaitlin Chaung, Julia Salzman. bioRxiv 2021 Oct 01. doi: 10.1101/2021.09.29.462469.

The nf-core framework for community-curated bioinformatics pipelines. Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen. Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

About

The Read Z-score (ReadZS) is a metric that summarizes the transcriptional state of a gene in a single cell.

Resources

License

Code of conduct

Stars

Watchers

Forks

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •