Introduction

salzmanlab/readzs is a bioinformatics best-practice analysis pipeline for The Read Z-score (ReadZS), a metric that summarizes the transcriptional state of a gene in a single cell. About the ReadZS

This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.

The nf-core framework for community-curated bioinformatics pipelines. Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen. Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

Pipeline summary

Filter and quantify read enrichment from BAM files
Calculate the ReadZS (read z-score) for single cells across genomic bins
Downstream analyses
1. Aggregate the ReadZS along cell type annotations and identify windows with significant cell type-specific regulation
2. Plot read distributions for cell types in windows of interest
3. Perform GMM-based subclustering to identify peaks in read distributions
4. Annotate peaks with distances to annotated features

Quick Start

Install nextflow (>=20.04.0)
Depending on your use case, install conda, docker, or singularity. By using the docker or singularity nextflow profile, the pipeline can be run within the ReadZS docker container (also available on dockerhub, which contains all the required dependencies.

Run the pipeline on test data.

nextflow run salzmanlab/readzs \
    -r master \
    -latest \
    -profile small_test_data,<docker,singularity,conda>

Stanford Sherlock users should use the sherlock profile:

 nextflow run salzmanlab/readzs \
     -r master \
     -latest \
     -profile small_test_data,sherlock

To run on other datasets, modify a config file with data-specific parameters, using conf/test.config as a template. Note: do not include dashes in the run names or channel names. You may need to modify the executor scope in the config file, in accordance to your compute needs.

Input Arguments

Argument	Description	Example Usage
`runName`	Descriptive name for ReadZS run, used in the final output files. Note: the run name should not contain any dashes (-).	Tumor_5
`input`	Input samplesheet in csv format, format described below	`Tumor_5_samplesheet.csv`
`useChannels`	`true` if the same samples were split across multiple sequencer lanes, with barcode overlap between different samples	`true`, `false`
`isSICILIAN`	If the input bam files are output from SICILIAN	`true`, `false`
`isCellranger`	`true` if input data is output from Cellranger	`true`, `false`
`ontologyCols`	Double-encapsulated list string describing the `metadata` columns that will create the "ontology" variable, e.g. cell type	"'tissue, compartment, annotation'"
`metadata`	Path to metadata (e.g. cell type) annotation file, described below	`metadata_Tumor5.tsv`
`chr_lengths`	Two-column, tab-delimited file containing chromosome names in the first column and chromosome lengths in the second column. Chromosome names must match those in BAM files.	/home/refs/human.chrs
`gff`	Location of genome GFF file, used for plotting; can be obtained from GENCODE	/home/refs/humanv37.gff
`annotation_bed`	BED-formatted file, used to annotate the windows, e.g. "refFlat" table of genes in BED format obtained from UCSC Table Browser	/home/refs/hg38_genes.bed

Default Parameters

These default values can be modified to suit the needs of your data.

Argument	Description	Default Value
`binSize`	Size of genomic bins ("windows"), used to calculate z-scores	5000
`minCellsPerWindowOnt`	Minimum cells per window-ontology required to calculate medians for that window-ontology	20
`minCtsPerCell`	Minimum counts per cell for a window required to include this cell in calculating medians for that window-ontology	10
`nPermutations`	Number of permutations to be used in significance calculation	1000
`numPlots`	Number of top windows to generate read distribution histograms for	20
`peak_method`	Subclustering method for calling peaks, options: `knee`, `max`	`knee`
`zscores_only`	Calculate ReadZS values for cells over genomic bins, without annotating cells with cell-types or any downstram analyses.	`false`
`skip_plot`	Run all steps of pipeline, except for plot generation of read distributions.	`false`
`skip_subcluster`	Run all steps of pipeline, except for subclustering/peak calling.	`false`
`plot_only`	If all steps up to median calculation have been previously performed, only perform plot generation of read distributions.	`false`
`subcluster_only`	If all steps up to annotation steps have been previously performed, only perform subclustering/peak calling.	`false`

`plot_only` Parameters

Example run command:

nextflow run salzmanlab/readzs \
    --plot_only true \
    --input Tumor_5_samplesheet.csv \
    --all_pvals_path *home/results/results/annotated_files/`${runName}`_all_pvals.txt* \
    --resultsDir *home/results* \
    --runName Tumor5 \
    --ontologyCols "'tissue, compartment, annotation'" \
    --gff genome.gff

Argument	Description	Example Usage
`input`	Input samplesheet in csv format, format described below	`Tumor_5_samplesheet.csv`
`all_pvals_path`	Path to all_pvals file generated by previous ReadZS run, containing a `windows` column.	home/results/results/annotated_files/`${runName}`_all_pvals.txt
`resultsDir`	Path to results directory of previous run.	home/results
`runName`	Descriptive name for ReadZS run, used in the final output files	Tumor_5
`ontologyCols`	Double-encapsulated list string describing the `metadata` columns that will create the "ontology" variable, e.g. cell type	"'tissue, compartment, annotation'"
`gff`	Location of genome GFF file, used for plotting; can be obtained from GENCODE	/home/refs/humanv37.gff
`binSize`	Size of genomic bins, used to calculate z-scores	Defaults to 5000
`numPlots`	Number of top windows to generate read distribution histograms for	Defaults to 20

`peaks_only` Parameters

Example run command:

nextflow run salzmanlab/readzs \
    --peaks_only true \
    --input Tumor_5_samplesheet.csv \
    --counts_path home/results/counts \
    --ann_pvals_path home/results/results/annotated_files/`${runName}`_ann_pvals.txt \
    --runName Tumor5

Argument	Description	Example Usage
`input`	Input samplesheet in csv format, format described below	`Tumor_5_samplesheet.csv`
`counts_path`	Path to directory containing counts files, in results directory of previous ReadZS run.	home/results/counts
`ann_pvals_path`	Path to ann_pvals file generated by previous ReadZS run.	home/results/results/annotated_files/`${runName}`_ann_pvals.txt
`runName`	Descriptive name for ReadZS run, used in the final output files	Tumor_5
`peak_method`	Subclustering method for calling peaks, options: `knee`, `max`	Defaults to `knee`

File Descriptions

`input`

The samplesheet specifies the BAM files to be used. It should be comma-delimited (no spaces after the comma), with no header.

If useChannels = true, the file must have 3 columns:

channel name (i.e. Tumor5_bladder)
file identifier (i.e. Tumor5_bladder_L001)
path to bam file (i.e. data/Tumor5_bladder_001.bam)

If useChannels = false, the file must have 2 columns:

file identifier (i.e. Tumor5_bladder_L001)
path to bam file (i.e. data/Tumor5_bladder_001.bam)

`metadata`

The metadata gives additional information about each cell, to be used for calculation of median ReadZS and significance calculation. The metadata file should contain the following columns:

cell_id
- Cell identification column, with each entry structured as "${channel}_barcode"
- If useChannels = false, each entry should be structured as "${file_identifier}_barcode"
- Example cell_id value:
  - "Tumor5_AACCATGCAGCTCGCA"
Metadata columns used to define the ontology (grouping for which the median ReadZS will be calculated)
- Examples:
  - tissue, compartment, annotation
  - ontology = "lung_immune_macrophage"

Output

Files

Significant windows (genomic windows determined to have significant ontology-specific differences in RNA processing):
- results/${runName}/medians/${runName}_*_medians_*.txt
  - List of all median values for cells in window-ontologies passing filter
- results/${runName}/signif_medians/${runName}_*_medians_*.txt
  - List of significant median values for all cells in window-ontologies passing filter
- results/${runName}/annotated_files/${runName}_all_pvals.txt
  - All windows, annotated with annotation_bed
- results/${runName}/annotated_files/${runName}_ann_pvals.txt
  - All significant windows, annotated with annotation_bed and ranked by effect size
  - Rankings from this file are used to rank plotted windows and windows for calling peaks
- results/${runName}/annotated_files/annotated_windows.file
  - Bed file of windows of size = binSize, with each window annotated with annotation_bed
Peaks (peaks in read distributions across ALL cells, as determined by the GMM):
- results/${runName}/peaks/${runName}_${peak_method}_peaks.tsv
  - Peaks called for each significant window, with one peak per line
Plots (histograms of read distributions, grouped by ontology):
- results/${runName}/plots/*.png
  - Read distributions of the top 2 and bottom ontologies of each significant window, as ranked by effect size

`results/${runName}/signif_medians/${runName}__medians_.txt`

Field	Description
`window`	Stranded window from genome, as created by `windowsFile`
`ontology`	Metadata grouping, as created by `ontologyCols`
`sum_counts_per_window_per_ont`	Total read counts per window-ontology
`med_counts_per_window_per_ont`	Median read counts per window-ontology
`median_z_scaled`	Median ReadZS value of the window-ontology
`chi2_p_val`	Chi² p-value for `median_z_scaled`
`perm_p_val`	Permutation p-value for `median_z_scaled`
`significant`	`true` if `perm_p_val` is statistically signficant
`medians_range`	Range of median ReadZS for all ontologies in a window

`results/${runName}/peaks/${runName}_${peak_method}_peaks.tsv`

Field	Description
`window`	Significant window, ranked by effect size
`num_peaks`	Number of peaks per window
`peak_pos`	Peak position
`comp_prob`	Component probability
`ICL_vec`	The vector of ICL criterion values for each number of components

Credits

salzmanlab/readzs was originally written by the Salzman Lab.

We thank the following people for their extensive assistance in the development of this pipeline: Julia Oliveri, Robert Bierman, and Sarthak Satpathy.

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines.

Citations

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

You can cite ReadZS and Nextflow as follows:

ReadZS detects developmentally regulated RNA processing programs in single cell RNA-seq and defines subpopulations independent of gene expression Elisabeth Meyer, Roozbeh Dehghannasiri, Kaitlin Chaung, Julia Salzman. bioRxiv 2021 Oct 01. doi: 10.1101/2021.09.29.462469.

The nf-core framework for community-curated bioinformatics pipelines. Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen. Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

Name		Name	Last commit message	Last commit date
Latest commit History 157 Commits
.github		.github
assets		assets
bin		bin
conf		conf
docs		docs
lib		lib
modules		modules
small_test_data		small_test_data
subworkflows/local		subworkflows/local
workflows		workflows
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.markdownlint.yml		.markdownlint.yml
CHANGELOG.md		CHANGELOG.md
CITATIONS.md		CITATIONS.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
main.nf		main.nf
modules.json		modules.json
nextflow.config		nextflow.config
nextflow_schema.json		nextflow_schema.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Pipeline summary

Quick Start

Input Arguments

Default Parameters

`plot_only` Parameters

`peaks_only` Parameters

File Descriptions

`input`

`metadata`

Output

Files

`results/${runName}/signif_medians/${runName}__medians_.txt`

`results/${runName}/peaks/${runName}_${peak_method}_peaks.tsv`

Credits

Contributions and Support

Citations

About

Releases 1

Packages

Contributors 4

Languages

License

salzman-lab/ReadZS

Folders and files

Latest commit

History

Repository files navigation

Introduction

Pipeline summary

Quick Start

Input Arguments

Default Parameters

plot_only Parameters

peaks_only Parameters

File Descriptions

input

metadata

Output

Files

results/${runName}/signif_medians/${runName}_*_medians_*.txt

results/${runName}/peaks/${runName}_${peak_method}_peaks.tsv

Credits

Contributions and Support

Citations

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 4

Languages

`plot_only` Parameters

`peaks_only` Parameters

`input`

`metadata`

`results/${runName}/signif_medians/${runName}__medians_.txt`

`results/${runName}/peaks/${runName}_${peak_method}_peaks.tsv`

Packages