404
+ +Page not found
+ + +diff --git a/site/404.html b/site/404.html new file mode 100644 index 0000000..09ca6a0 --- /dev/null +++ b/site/404.html @@ -0,0 +1,148 @@ + + +
+ + + + +Page not found
+ + +The Ensemblex pipeline was produced for projects funded by the Canadian Institute of Health Research and Michael J. Fox Foundation Parkinson's Progression Markers Initiative (MJFF PPMI) in collaboration with The Neuro's Early Drug Discovery Unit (EDDU), McGill University. It is written by Michael Fiorini and Saeid Amiri with supervision from Rhalena Thomas and Sali Farhan at the Montreal Neurological Institute-Hospital. Copyright belongs MNI BIOINFO CORE.
+ +This guide illustrates how to use the Ensemblex pipeline to demultiplexed pooled scRNAseq samples with prior genotype information. Here, we will leverage a pooled scRNAseq dataset produced by Jerber et al.. This pool contains induced pluripotent cell lines (iPSC) from 9 healthy controls that were differentiated towards a dopaminergic neuron state. The Ensemblex pipeline is illustrated in the diagram below:
+
+
+
NOTE: To download the necessary files for the tutorial please see the Downloading data section of the Ensemblex documentation.
+[to be completed]
+module load StdEnv/2023 +module load apptainer/1.2.4
+In Step 1, we will set up the working directory for the Ensemblex pipeline and decide which version of the pipeline we want to use.
+First, create a dedicated folder for the analysis (hereafter referred to as the working directory). Then, define the path to the working directory and the path to ensemblex.pip:
+## Create and navigate to the working directory
+cd ensemblex_tutorial
+mkdir working_directory
+cd ~/ensemblex_tutorial/working_directory
+
+## Define the path to ensemblex.pip
+ensemblex_HOME=~/ensemblex.pip
+
+## Define the path to the working directory
+ensemblex_PWD=~/ensemblex_tutorial/working_directory
+
+Next, we can set up the working directory and choose the Ensemblex pipeline for demultiplexing with prior genotype information (--step init-GT
) using the following code:
bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step init-GT
+
+After running the above code, the working directory should have the following structure:
+ensemblex_tutorial
+└── working_directory
+ ├── demuxalot
+ ├── demuxlet
+ ├── ensemblex_gt
+ ├── input_files
+ ├── job_info
+ │ ├── configs
+ │ │ └── ensemblex_config.ini
+ │ ├── logs
+ │ └── summary_report.txt
+ ├── souporcell
+ └── vireo_gt
+
+Upon setting up the Ensemblex pipeline, we can proceed to Step 2 where we will prepare the input files for Ensemblex's constituent genetic demultiplexing tools.
+In Step 2, we will define the necessary files needed for ensemblex's constituent genetic demultiplexing tools and will place them within the working directory.
+Note: For the tutorial we will be using the data downloaded in the Downloading data section of the Ensemblex documentation.
+First, define all of the required files:
+BAM=~/ensemblex_tutorial/CellRanger/outs/possorted_genome_bam.bam
+
+BAM_INDEX=~/ensemblex_tutorial/CellRanger/outs/possorted_genome_bam.bam.bai
+
+BARCODES=~/ensemblex_tutorial/CellRanger/outs/filtered_gene_bc_matrices/refdata-cellranger-GRCh37/barcodes.tsv
+
+SAMPLE_VCF=~/ensemblex_tutorial/sample_genotype/sample_genotype_merge.vcf
+
+REFERENCE_VCF=~/ensemblex_tutorial/reference_files/common_SNPs_only.recode.vcf
+
+REFERENCE_FASTA=~/ensemblex_tutorial/reference_files/genome.fa
+
+REFERENCE_FASTA_INDEX=~/ensemblex_tutorial/reference_files/genome.fa.fai
+
+Next, we will sort the pooled samples and reference .vcf files according to the .bam file and place them within the working directory:
+## Sort pooled samples .vcf file
+bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD/input_files/pooled_samples.vcf --step sort --vcf $SAMPLE_VCF --bam $ensemblex_PWD/input_files/pooled_bam.bam
+
+## Sort reference .vcf file
+bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD/input_files/reference.vcf --step sort --vcf $SAMPLE_VCF --bam $ensemblex_PWD/input_files/pooled_bam.bam
+
+NOTE: To sort the vcf files we use the pipeline produced by the authors of Demuxlet/Freemuxlet (Kang et al. ).
+Next, we will place the remaining necessary files within the working directory:
+cp $BAM $ensemblex_PWD/input_files/pooled_bam.bam
+cp $BAM_INDEX $ensemblex_PWD/input_files/pooled_bam.bam.bai
+cp $BARCODES $ensemblex_PWD/input_files/pooled_barcodes.tsv
+cp $REFERENCE_FASTA $ensemblex_PWD/input_files/reference.fa
+cp $REFERENCE_FASTA_INDEX $ensemblex_PWD/input_files/reference.fa.fai
+
+After running the above code, $ensemblex_PWD/input_files
should contain the following files:
input_files
+├── pooled_bam.bam
+├── pooled_bam.bam.bai
+├── pooled_barcodes.tsv
+├── pooled_samples.vcf
+├── reference.fa
+├── reference.fa.fai
+└── reference.vcf
+
+NOTE: It is important that the file names match those listed above as they are necessary for the Ensemblex pipeline to recognize them.
+In Step 3, we will demultiplex the pooled samples with each of Ensemblex's constituent genetic demultiplexing tools:
+First, we will navigate to the ensemblex_config.ini
file to adjust the demultiplexing parameters for each of the constituent genetic demultiplexing tools:
## Navigate to the .ini file
+cd $ensemblex_PWD/job_info/configs
+
+## Open the .ini file and adjust parameters directly in the terminal
+nano ensemblex_config.ini
+
+For the tutorial, we set the following parameters for the constituent genetic demultiplexing tools:
+Parameter | +Value | +
---|---|
PAR_demuxalot_genotype_names | +'HPSI0115i-hecn_6,HPSI0214i-pelm_3,HPSI0314i-sojd_3,HPSI0414i-sebn_3,HPSI0514i-uenn_3,HPSI0714i-pipw_4,HPSI0715i-meue_5,HPSI0914i-vaka_5,HPSI1014i-quls_2' | +
PAR_demuxalot_prior_strength | +100 | +
PAR_demuxalot_minimum_coverage | +200 | +
PAR_demuxalot_minimum_alternative_coverage | +10 | +
PAR_demuxalot_n_best_snps_per_donor | +100 | +
PAR_demuxalot_genotypes_prior_strength | +1 | +
PAR_demuxalot_doublet_prior | +0.25 | +
PAR_demuxlet_field | +GT | +
PAR_vireo_N | +9 | +
PAR_vireo_type | +GT | +
PAR_vireo_processes | +20 | +
PAR_vireo_minMAF | +0.1 | +
PAR_vireo_minCOUNT | +20 | +
PAR_vireo_forcelearnGT | +T | +
PAR_minimap2 | +'-ax splice -t 8 -G50k -k 21 -w 11 --sr -A2 -B8 -O12,32 -E2,1 -r200 -p.5 -N20 -f1000,5000 -n2 -m20 -s40 -g2000 -2K50m --secondary=no' | +
PAR_freebayes | +'-iXu -C 2 -q 20 -n 3 -E 1 -m 30 --min-coverage 6' | +
PAR_vartrix_umi | +TRUE | +
PAR_vartrix_mapq | +30 | +
PAR_vartrix_threads | +8 | +
PAR_souporcell_k | +9 | +
PAR_souporcell_t | +8 | +
Now that the parameters have been defined, we can demultiplex the pools with the constituent genetic demultiplexing tools.
+To run Demuxalot use the following code:
+bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step demuxalot
+
+If Demuxalot completed successfully, the following files should be available in $ensemblex_PWD/demuxalot
:
demuxalot
+ ├── Demuxalot_result.csv
+ └── new_snps_single_file.betas
+
+To run Demuxlet use the following code:
+bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step demuxlet
+
+If Demuxlet completed successfully, the following files should be available in $ensemblex_PWD/demuxlet
:
demuxlet
+ ├── outs.best
+ ├── pileup.cel.gz
+ ├── pileup.plp.gz
+ ├── pileup.umi.gz
+ └── pileup.var.gz
+
+To run Souporcell use the following code:
+bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step souporcell
+
+If Souporcell completed successfully, the following files should be available in $ensemblex_PWD/souporcell
:
souporcell
+ ├── alt.mtx
+ ├── cluster_genotypes.vcf
+ ├── clusters_tmp.tsv
+ ├── clusters.tsv
+ ├── fq.fq
+ ├── minimap.sam
+ ├── minitagged.bam
+ ├── minitagged_sorted.bam
+ ├── minitagged_sorted.bam.bai
+ ├── Pool.vcf
+ ├── ref.mtx
+ └── soup.txt
+
+To run Vireo-GT use the following code:
+bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step vireo
+
+If Vireo-GT completed successfully, the following files should be available in $ensemblex_PWD/vireo_gt
:
vireo_gt
+ ├── cellSNP.base.vcf.gz
+ ├── cellSNP.cells.vcf.gz
+ ├── cellSNP.samples.tsv
+ ├── cellSNP.tag.AD.mtx
+ ├── cellSNP.tag.DP.mtx
+ ├── cellSNP.tag.OTH.mtx
+ ├── donor_ids.tsv
+ ├── fig_GT_distance_estimated.pdf
+ ├── fig_GT_distance_input.pdf
+ ├── GT_donors.vireo.vcf.gz
+ ├── _log.txt
+ ├── prob_doublet.tsv.gz
+ ├── prob_singlet.tsv.gz
+ └── summary.tsv
+
+Upon demultiplexing the pooled samples with each of Ensemblex's constituent genetic demultiplexing tools, we can proceed to Step 4 where we will process the output files of the consituent tools with the Ensemblex algorithm to generate the ensemble sample classifications
+NOTE: To minimize computation time for the tutorial, we have provided the necessary outpu files from the constituent tools here. To access the files and place them in the working directory, use the following code:
+## Demuxalot
+cd $ensemblex_PWD/demuxalot
+wget https://github.com/neurobioinfo/ensemblex/blob/caad8c250566bfa9a6d7a78b77d2cc338468a58e/tutorial/Demuxalot_result.csv
+
+## Demuxlet
+cd $ensemblex_PWD/demuxlet
+wget https://github.com/neurobioinfo/ensemblex/blob/caad8c250566bfa9a6d7a78b77d2cc338468a58e/tutorial/outs.best
+
+## Souporcell
+cd $ensemblex_PWD/souporcell
+wget https://github.com/neurobioinfo/ensemblex/blob/caad8c250566bfa9a6d7a78b77d2cc338468a58e/tutorial/clusters.tsv
+
+## Vireo
+cd $ensemblex_PWD/vireo_gt
+wget https://github.com/neurobioinfo/ensemblex/blob/caad8c250566bfa9a6d7a78b77d2cc338468a58e/tutorial/donor_ids.tsv
+
+
+In Step 4, we will process the output files of the four constituent genetic demultiplexing tools with the three-step Ensemblex algorithm:
+First, we will navigate to the ensemblex_config.ini
file to adjust the demultiplexing parameters for the Ensemblex algorithm:
## Navigate to the .ini file
+cd $ensemblex_PWD/job_info/configs
+
+## Open the .ini file and adjust parameters directly in the terminal
+nano ensemblex_config.ini
+
+For the tutorial, we set the following parameters for the Ensemblex algorithm:
+Parameter | +Value | +
---|---|
Pool parameters | ++ |
PAR_ensemblex_sample_size | +9 | +
PAR_ensemblex_expected_doublet_rate | +0.10 | +
Set up parameters | ++ |
PAR_ensemblex_merge_constituents | +Yes | +
Step 1 parameters: Probabilistic-weighted ensemble | ++ |
PAR_ensemblex_probabilistic_weighted_ensemble | +Yes | +
Step 2 parameters: Graph-based doublet detection | ++ |
PAR_ensemblex_preliminary_parameter_sweep | +No | +
PAR_ensemblex_nCD | +NULL | +
PAR_ensemblex_pT | +NULL | +
PAR_ensemblex_graph_based_doublet_detection | +Yes | +
Step 3 parameters: Ensemble-independent doublet detection | ++ |
PAR_ensemblex_preliminary_ensemble_independent_doublet | +No | +
PAR_ensemblex_ensemble_independent_doublet | +Yes | +
PAR_ensemblex_doublet_Demuxalot_threshold | +Yes | +
PAR_ensemblex_doublet_Demuxalot_no_threshold | +No | +
PAR_ensemblex_doublet_Demuxlet_threshold | +No | +
PAR_ensemblex_doublet_Demuxlet_no_threshold | +No | +
PAR_ensemblex_doublet_Souporcell_threshold | +No | +
PAR_ensemblex_doublet_Souporcell_no_threshold | +No | +
PAR_ensemblex_doublet_Vireo_threshold | +Yes | +
PAR_ensemblex_doublet_Vireo_no_threshold | +No | +
Confidence score parameters | ++ |
PAR_ensemblex_compute_singlet_confidence | +Yes | +
If Ensemblex completed successfully, the following files should be available in $ensemblex_PWD/ensemblex_gt
:
ensemblex_gt
+├── confidence
+│ └── ensemblex_final_cell_assignment.csv
+├── constituent_tool_merge.csv
+├── step1
+│ ├── ARI_demultiplexing_tools.pdf
+│ ├── BA_demultiplexing_tools.pdf
+│ ├── Balanced_accuracy_summary.csv
+│ └── step1_cell_assignment.csv
+├── step2
+│ ├── optimal_nCD.pdf
+│ ├── optimal_pT.pdf
+│ ├── PC1_var_contrib.pdf
+│ ├── PC2_var_contrib.pdf
+│ ├── PCA1_graph_based_doublet_detection.pdf
+│ ├── PCA2_graph_based_doublet_detection.pdf
+│ ├── PCA3_graph_based_doublet_detection.pdf
+│ ├── PCA_plot.pdf
+│ ├── PCA_scree_plot.pdf
+│ └── Step2_cell_assignment.csv
+└── step3
+ ├── Doublet_overlap_no_threshold.pdf
+ ├── Doublet_overlap_threshold.pdf
+ ├── Number_ensemblex_doublets_EID_no_threshold.pdf
+ ├── Number_ensemblex_doublets_EID_threshold.pdf
+ └── Step3_cell_assignment.csv
+
+Ensemblex's final assignments are described in the ensemblex_final_cell_assignment.csv
file.
Specifically, the ensemblex_assignment
column describes Ensemblex's final assignments after application of the singlet confidence threshold (i.e., singlets that fail to meet a singlet confidence of 1.0 are labelled as unassigned); we recomment that users use this column to label their cells for downstream analyses. The ensemblex_best_assignment
column describes Ensemblex's best assignments, independent of the singlets confidence threshold (i.e., singlets that fail to meet a singlet confidence of 1.0 are NOT labelled as unassigned).
The cell barcodes listed under the barcode
column can be used to add the ensemblex_final_cell_assignment.csv
information to the metadata of a Seurat object.
The following table describes the computational resources used in this tutorial for genetic demultiplexing by the constituent tools and application of the Ensemblex algorithm.
+Tool | +Time | +CPU | +Memory | +
---|---|---|---|
Demuxalot | +01:34:59 | +6 | +12.95 GB | +
Demuxlet | +03:16:03 | +6 | +138.32 GB | +
Souporcell | +2-14:49:21 | +1 | +21.83 GB | +
Vireo | +2-01:30:24 | +6 | +29.42 GB | +
Ensemblex | +02:05:27 | +1 | +5.67 GB | +
This guide illustrates the steps taken for our analysis of the PBMC dataset in our pre-print manuscript. Here, we are using the HTO analysis track of scRNAbox to analyze a publicly available scRNAseq dataset produced by Stoeckius et al.. This data set describes peripheral blood mononuclear cells (PBMC) from eight human donors, which were tagged with sample-specific barcodes, pooled, and sequenced together in a single run.
+In you want to use the PBMC dataset to test the scRNAbox pipeline, please see here for detialed instructions on how to download the publicly available data.
+To download the latest version of scrnabox.slurm
(v0.1.52.50) run the following command:
wget https://github.com/neurobioinfo/scrnabox/releases/download/v0.1.52.5/scrnabox.slurm.zip
+unzip scrnabox.slurm.zip
+
+For a description of the options for running scrnabox.slurm
run the following command:
bash /pathway/to/scrnabox.slurm/launch_scrnabox.sh -h
+
+If the scrnabox.slurm
has been installed properly, the above command should return the folllowing:
scrnabox pipeline version 0.1.52.50
+-------------------
+mandatory arguments:
+ -d (--dir) = Working directory (where all the outputs will be printed) (give full path)
+ --steps = Specify what steps, e.g., 2 to run step 2. 2-6, run steps 2 through 6
+
+ optional arguments:
+ -h (--help) = See helps regarding the pipeline arguments.
+ --method = Select your preferred method: HTO and SCRNA for hashtag, and Standard scRNA, respectively.
+ --msd = You can get the hashtag labels by running the following code (HTO Step 4).
+ --markergsea = Identify marker genes for each cluster and run marker gene set enrichment analysis (GSEA) using EnrichR libraries (Step 7).
+ --knownmarkers = Profile the individual or aggregated expression of known marker genes.
+ --referenceannotation = Generate annotation predictions based on the annotations of a reference Seurat object (Step 7).
+ --annotate = Add clustering annotations to Seurat object metadata (Step 7).
+ --addmeta = Add metadata columns to the Seurat object (Step 8).
+ --rundge = Perform differential gene expression contrasts (Step 8).
+ --seulist = You can directly call the list of Seurat objects to the pipeline.
+ --rcheck = You can identify which libraries are not installed.
+
+ -------------------
+ For a comprehensive help, visit https://neurobioinfo.github.io/scrnabox/site/ for documentation.
+
+For information regarding the installation of CellRanger, please visit the 10X Genomics documentation. If CellRanger is already installed on your HPC system, you may skip the CellRanger installation procedures.
+For our analysis of the midbrain dataset we used the 10XGenomics GRCh38-3.0.0 reference genome and CellRanger v5.0.1. For more information regarding how to prepare reference genomes for the CellRanger counts pipeline, please see the 10X Genomics documentation.
+We must prepapre a common R library where we will load all of the required R packages. If the required R packages are already installed on your HPC system in a common R library, you may skip the following procedures.
+
We will first install R
. The analyses presented in our pre-print manuscript were conducted using v4.2.1.
# install R
+module load r/4.2.1
+
+Then, we will run the installation code, which creates a directory where the R packages will be loaded and will install the required R packages:
+# Folder for R packages
+R_PATH=~/path/to/R/library
+mkdir -p $R_PATH
+
+# Install package
+Rscript ./scrnabox.slurm/soft/R/install_packages.R $R_PATH
+
+Now that scrnabox.slurm
, CellRanger
, R
, and the required R packages have been installed, we can proceed to our analysis with the scRNAbox pipeline. We will create a pipeline
folder designated for the analysis and run Step 0, selecting the HTO analysis track (--method HTO
), using the following code:
mkdir pipeline
+cd pipeline
+
+export SCRNABOX_HOME=~/scrnabox/scrnabox.slurm
+export SCRNABOX_PWD=~/pipeline
+
+bash $SCRNABOX_HOME/launch_scrnabox.sh \
+-d ${SCRNABOX_PWD} \
+--steps 0 \
+--method HTO
+
+Next, we will navigate to the scrnabox_config.ini
file in ~/pipeline/job_info/configs
to define the HPC account holder (ACCOUNT), the path to the environmental module (MODULEUSE), the path to CellRanger from the environmental module directory (CELLRANGER), CellRanger version (CELLRANGER_VERSION), R version (R_VERSION), and the path to the R library (R_LIB_PATH):
cd ~/pipeline/job_info/configs
+nano scrnabox_config.ini
+
+ACCOUNT=account-name
+MODULEUSE=/path/to/environmental/module
+CELLRANGER=/path/to/cellranger/from/module/directory
+CELLRANGER_VERSION=5.0.1
+R_VERSION=4.2.1
+R_LIB_PATH=/path/to/R/library
+
+Next, we can check to see if all of the required R packages have been properly installed using the following command:
+bash $SCRNABOX_HOME/launch_scrnabox.sh \
+-d ${SCRNABOX_PWD} \
+--steps 0 \
+--rcheck
+
+In Step 1, we will run the CellRanger counts pipeline to generate feature-barcode expression matrices from the FASTQ files. While it is possible to manually prepare the library.csv
and feature_ref.csv
files for the sequencing run prior to running Step 1, for this analysis we are going to opt for automated library preparation. For more information regarding the manual prepartion of library.csv
and feature_ref.csv
files, please see the the CellRanger library preparation tutorial.
+
+For our analysis of the PBMC dataset we set the following execution parameters for Step 1 (~/pipeline/job_info/parameters/step1_par.txt
):
Parameter | +Value | +
---|---|
par_automated_library_prep | +Yes | +
par_fastq_directory | +/path/to/directory/contaning/fastqs | +
par_RNA_run_names | +run1GEX | +
par_HTO_run_names | +run1HTO | +
par_seq_run_names | +run1 | +
par_paired_end_seq | +Yes | +
par_id | +Hash1, Hash2, Hash3, Hash4, Hash5, Hash6, Hash7, Hash8 | +
par_name | +A_TotalSeqA, B_TotalSeqA, C_TotalSeqA, D_TotalSeqA, E_TotalSeqA, F_TotalSeqA, G_TotalSeqA, H_TotalSeqA | +
par_read | +R2 | +
par_pattern | +5P(BC) | +
par_sequence | +AGGACCATCCAA, ACATGTTACCGT, AGCTTACTATCC, TCGATAATGCGA, GAGGCTGAGCTA, GTGTGACGTATT, ACTGTCTAACGG, TATCACATCGGT | +
par_ref_dir_grch | +~/genome/10xGenomics/refdata-cellranger-GRCh38-3.0.0 | +
par_r1_length | +NULL (commented out) | +
par_r2_length | +NULL (commented out) | +
par_mempercode | +30 | +
par_include_introns | +NULL (commented out) | +
par_no_target_umi_filter | +NULL (commented out) | +
par_expect_cells | +NULL (commented out) | +
par_force_cells | +NULL (commented out) | +
par_no_bam | +NULL (commented out) | +
Note: The parameters file for each step is located in ~/pipeline/job_info/parameters
. For a comprehensive description of the execution parameters for each step see here.
Given that CellRanger runs a user interface and is not submitted as a Job, it is recommended to run Step 1 in a 'screen' which will allow the the task to keep running if the connection is broken. To run Step 1, use the following command:
+export SCRNABOX_HOME=~/scrnabox/scrnabox.slurm
+export SCRNABOX_PWD=~/pipeline
+
+screen -S run_PBMC_application_case
+bash $SCRNABOX_HOME/launch_scrnabox.sh \
+-d ${SCRNABOX_PWD} \
+--steps 1
+
+The outputs of the CellRanger counts pipeline are deposited into ~/pipeline/step1
.
In Step 2, we are going to begin by correcting the RNA assay for ambient RNA removal using SoupX (Young et al. 2020). We will then use the the ambient RNA-corrected feature-barcode matrices to create a Seurat object.
+
+For our analysis of the PBMC dataset we set the following execution parameters for Step 2 (~/pipeline/job_info/parameters/step2_par.txt
):
Parameter | +Value | +
---|---|
par_save_RNA | +Yes | +
par_save_metadata | +Yes | +
par_ambient_RNA | +Yes | +
par_normalization.method | +LogNormalize | +
par_scale.factor | +10000 | +
par_selection.method | +vst | +
par_nfeatures | +2500 | +
We can run Step 2 using the following code:
+export SCRNABOX_HOME=~/scrnabox/scrnabox.slurm
+export SCRNABOX_PWD=~/pipeline
+
+bash $SCRNABOX_HOME/launch_scrnabox.sh \
+-d ${SCRNABOX_PWD} \
+--steps 2
+
+Step 2 produces the following outputs:
+~/pipeline
+step2
+├── figs2
+│ ├── ambient_RNA_estimation_run1.pdf
+│ ├── ambient_RNA_markers_run1.pdf
+│ ├── cell_cyle_dim_plot_run1.pdf
+│ ├── vioplot_run1.pdf
+│ └── zoomed_in_vioplot_run1.pdf
+├── info2
+│ ├── estimated_ambient_RNA_run1.txt
+│ ├── MetaData_1.txt
+│ ├── meta_info_1.txt
+│ ├── run1_ambient_rna_summary.rds
+│ ├── sessionInfo.txt
+│ ├── seu1_RNA.txt
+│ └── summary_seu1.txt
+├── objs2
+│ └── run1.rds
+└── step2_ambient
+ └── run1
+ ├── barcodes.tsv
+ ├── genes.tsv
+ └── matrix.mtxs
+
+Note: For a comprehensive description of the outputs for each analytical step, please see the Outputs section of the scRNAbox documentation.
+
+
+
Figure 1. Figures produced by Step 2 of the scRNAbox pipeline. A) Estimated ambient RNA contamination rate (Rho) by SoupX. Estimates of the RNA contamination rate using various estimators are visualized via a frequency distribution; the true contamination rate is assigned as the most frequent estimate (red line; 8.7%). B) Log10 ratios of observed counts to expected counts for marker genes from each cluster. Clusters are defined by the CellRanger counts pipeline. The red line displays the estimated RNA contamination rate if the estimation was based entirely on the corresponding gene. C) Principal component analysis (PCA) of Seurat S and G2M cell cycle reference genes. D) Violin plots showing the distribution of cells according to quality control metrics calculated in Step 2. E) Zoomed in violin plots, from the minimum to the mean, showing the distribution of cells according to quality control metrics calculated in Step 2.
+In Step 3, we are going to perform quality control procedures and filter out low quality cells. We are going to filter out cells with < 50 unique RNA transcripts, > 6000 unique RNA transcripts, < 200 total RNA transcripts, > 7000 total RNA transcripts, and > 50% mitochondria.
+For our analysis of the PBMC dataset we set the following execution parameters for Step 3 (~/pipeline/job_info/parameters/step2_par.txt
):
Parameter | +Value | +
---|---|
par_save_RNA | +Yes | +
par_save_metadata | +Yes | +
par_seurat_object | +NULL | +
par_nFeature_RNA_L | +50 | +
par_nFeature_RNA_U | +6000 | +
par_nCount_RNA_L | +200 | +
par_nCount_RNA_U | +7000 | +
par_mitochondria_percent_L | +0 | +
par_mitochondria_percent_U | +50 | +
par_ribosomal_percent_L | +0 | +
par_ribosomal_percent_U | +100 | +
par_remove_mitochondrial_genes | +No | +
par_remove_ribosomal_genes | +No | +
par_remove_genes | +NULL | +
par_regress_cell_cycle_genes | +Yes | +
par_normalization.method | +LogNormalize | +
par_scale.factor | +10000 | +
par_selection.method | +vst | +
par_nfeatures | +2500 | +
par_top | +10 | +
par_npcs_pca | +30 | +
We can run Step 3 using the following code:
+export SCRNABOX_HOME=~/scrnabox/scrnabox.slurm
+export SCRNABOX_PWD=~/pipeline
+
+bash $SCRNABOX_HOME/launch_scrnabox.sh \
+-d ${SCRNABOX_PWD} \
+--steps 3
+
+Step 3 produces the following outputs.
+step3
+├── figs3
+│ ├── dimplot_pca_run1.pdf
+│ ├── elbowplot_run1.pdf
+│ ├── filtered_QC_vioplot_run1.pdf
+│ └── VariableFeaturePlot_run1.pdf
+├── info3
+│ ├── MetaData_run1.txt
+│ ├── meta_info_run1.txt
+│ ├── most_variable_genes_run1.txt
+│ ├── run1_RNA.txt
+│ ├── sessionInfo.txt
+│ └── summary_run1.txt
+└── objs3
+ └── run1.rds
+
+
+
+
Figure 2. Figures produced by Step 3 of the scRNAbox pipeline. A) Violin plots showing the distribution of cells according to quality control metrics after filtering by user-defined thresholds. B) Scatter plot showing the top 2500 most variable features; the top 10 most variable features are labelled. C) Principal component analysis (PCA) visualizing the first two principal component (PC). D) Elbow plot to visualize the percentage of variance explained by each PC.
+In Step 4, we are going to demultiplex the pooled samples and remove doublets (erroneous libraries produced by two or more cells) based on the expression of the sample-specific barcodes (antibody assay).
+If the barcode labels used in the analysis are unknown, the first step is to retrieve them from the Seurat object. To do this, we do not need to modify the execution parameters and can go straight to running the following code:
+export SCRNABOX_HOME=~/scrnabox/scrnabox.slurm
+export SCRNABOX_PWD=~/pipeline
+
+bash $SCRNABOX_HOME/launch_scrnabox.sh \
+-d ${SCRNABOX_PWD} \
+--steps 4 \
+--msd T
+
+The above code produces the following file:
+step4
+├── figs4
+├── info4
+│ └── seu1.rds_old_antibody_label_MULTIseqDemuxHTOcounts.csv
+└── objs4
+
+Which contains the names of the barcode labels (i.e. A_TotalSeqA, B_TotalSeqA, C_TotalSeqA, D_TotalSeqA, E_TotalSeqA, F_TotalSeqA, G_TotalSeqA, H_TotalSeqA, Doublet, Negative).
+Now that we know the barcode labels used in the PBMC dataset, we can perform demultiplexing and doublet detection.
+For our analysis of the PBMC dataset we set the following execution parameters for Step 4 (~/pipeline/job_info/parameters/step4_par.txt
):
Parameter | +Value | +
---|---|
par_save_RNA | +Yes | +
par_save_metadata | +Yes | +
par_normalization.method | +CLR | +
par_scale.factor | +10000 | +
par_selection.method | +vst | +
par_nfeatures | +2500 | +
par_dimensionality_reduction | +Yes | +
par_npcs_pca | +30 | +
par_dims_umap | +3 | +
par_n.neighbor | +65 | +
par_dropDN | +Yes | +
par_label_dropDN | +Doublet, Negative | +
par_quantile | +0.9 | +
par_autoThresh | +TRUE | +
par_maxiter | +5 | +
par_RidgePlot_ncol | +3 | +
par_old_antibody_label | +A-TotalSeqA, B-TotalSeqA, C-TotalSeqA, D-TotalSeqA, E-TotalSeqA, F-TotalSeqA, G-TotalSeqA, H-TotalSeqA, Doublet | +
par_new_antibody_label | +sample-A, sample-B, sample-C, sample-D, sample-E, sample-F, sample-G, sample-H, Doublet | +
We can run Step 4 using the following code:
+export SCRNABOX_HOME=~/scrnabox/scrnabox.slurm
+export SCRNABOX_PWD=~/pipeline
+
+bash $SCRNABOX_HOME/launch_scrnabox.sh \
+-d ${SCRNABOX_PWD} \
+--steps 4
+
+Step 4 produces the following outputs.
+step4
+├── figs4
+│ ├── run1_DotPlot_HTO_MSD.pdf
+│ ├── run1_Heatmap_HTO_MSD.pdf
+│ ├── run1_HTO_dimplot_pca.pdf
+│ ├── run1_HTO_dimplot_umap.pdf
+│ ├── run1_nCounts_RNA_MSD.pdf
+│ └── run1_Ridgeplot_HTO_MSD.pdf
+├── info4
+│ ├── run1_filtered_MULTIseqDemuxHTOcounts.csv
+│ ├── run1_MetaData.txt
+│ ├── run1_meta_info_.txt
+│ ├── run1_MULTIseqDemuxHTOcounts.csv
+│ ├── run1_RNA.txt
+│ └── sessionInfo.txt
+└── objs4
+ └── run1.rds
+
+
+
+
Figure 3. Figures produced by Step 4 of the Cell Hashtag Analysis Track. A) Uniform Manifold Approximation and Projections (UMAP) plot, taking the first three pricipal components (PC) of the antibody assay as input. B) Principal component analysis (PCA) showing the first two PCs of the antibody assay. C) Ridgeplot visualizing the enrichment of barcode labels across sample assignments at the sample level. D) Dot plot visualizing the enrichment of barcode labels across sample assignments at the sample level. E) Heatmap visualizing the enrichment of barcode labels across sample assignments at the cel level. D) Violin plot visualizing the distribution of the number of total RNA transcripts identified per cell, startified by sample assignment.
+The code used to produce the publication-ready figures used in our pre-print manuscript is avaliable here here.
+The following job configurations were used for our analysis of the PBMC dataset. Job Configurations can be modified for each analytical step in the scrnabox_config.ini
file in ~/pipeline/job_info/configs
Step | +THREADS_ARRAY | +MEM_ARRAY | +WALLTIME_ARRAY | +
---|---|---|---|
Step2 | +4 | +16g | +00-05:00 | +
Step3 | +4 | +16g | +00-05:00 | +
Step4 | +4 | +16g | +00-05:00 | +
MIT License
+Copyright (c) 2022 The Neuro Bioinformatics Core
+Permission is hereby granted, free of charge, to any person obtaining a copy +of this software and associated documentation files (the "Software"), to deal +in the Software without restriction, including without limitation the rights +to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +copies of the Software, and to permit persons to whom the Software is +furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all +copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +SOFTWARE.
+ +In Step 1, we will set up the working directory for the Ensemblex pipeline and decide which version of the pipeline we want to use:
+First, create a dedicated folder for the analysis (hereafter referred to as the working directory). Then, define the path to the working directory and the path to ensemblex.pip:
+## Create and navigate to the working directory
+mkdir working_directory
+cd /path/to/working_directory
+
+## Define the path to ensemblex.pip
+ensemblex_HOME=/path/to/ensemblex.pip
+
+## Define the path to the working directory
+ensemblex_PWD=/path/to/working_directory
+
+Next, we can set up the working directory for demultiplexing with prior genotype information using the following code:
+bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step init-GT
+
+After running the above code, the working directory should have the following structure
+working_directory
+├── demuxalot
+├── demuxlet
+├── ensemblex_gt
+├── input_files
+├── job_info
+│ ├── configs
+│ │ └── ensemblex_config.ini
+│ ├── logs
+│ └── summary_report.txt
+├── souporcell
+└── vireo_gt
+
+Upon setting up the Ensemblex pipeline, we can proceed to Step 2 where we will prepare the input files for Ensemblex's constituent genetic demultiplexing tools: Preparation of input files
+First, create a dedicated folder for the analysis (hereafter referred to as the working directory). Then, define the path to the working directory and the path to ensemblex.pip:
+## Create and navigate to the working directory
+mkdir working_directory
+cd /path/to/working_directory
+
+## Define the path to ensemblex.pip
+ensemblex_HOME=/path/to/ensemblex.pip
+
+## Define the path to the working directory
+ensemblex_PWD=/path/to/working_directory
+
+Next, we can set up the working directory for demultiplexing without prior genotype information using the following code:
+bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step init-noGT
+
+After running the above code, the working directory should have the following structure
+working_directory
+├── demuxalot
+├── freemuxlet
+├── ensemblex
+├── input_files
+├── job_info
+│ ├── configs
+│ │ └── ensemblex_config.ini
+│ ├── logs
+│ └── summary_report.txt
+├── souporcell
+└── vireo
+
+Upon setting up the Ensemblex pipeline, we can proceed to Step 2 where we will prepare the input files for Ensemblex's constituent genetic demultiplexing tools: Preparation of input files
+ +In Step 2, we will define the necessary files needed for Ensemblex's constituent genetic demultiplexing tools and will place them within the working directory. The necessary files vary depending on the version of the Ensemblex pipeline being used:
+To demultiplex the pooled samples with prior genotype information, the following files are required:
+File | +Description | +
---|---|
gene_expression.bam | +Gene expression bam file of the pooled samples (e.g., 10X Genomics possorted_genome_bam.bam) | +
gene_expression.bam.bai | +Gene expression bam index file of the pooled samples (e.g., 10X Genomics possorted_genome_bam.bam.bai) | +
barcodes.tsv | +Barcodes tsv file of the pooled cells (e.g., 10X Genomics barcodes.tsv) | +
pooled_samples.vcf | +vcf file describing the genotypes of the pooled samples | +
genome_reference.fa | +Genome reference fasta file (e.g., 10X Genomics: ~/Homo_sapiens.GRCh37/genome/10xGenomics/refdata-cellranger-GRCh37/fasta/genome.fa) | +
genome_reference.fa.fai | +Genome reference fasta index file (e.g., 10X Genomics: ~/Homo_sapiens.GRCh37/genome/10xGenomics/refdata-cellranger-GRCh37/fasta/genome.fa.fai) | +
genotype_reference.vcf | +Population reference vcf file (e.g., 1000 Genomes Project) | +
NOTE: We demonstrate how to download reference vcf and fasta files in the Tutorial section of the Ensemblex documentation.
+First, define all of the required files:
+BAM=/path/to/possorted_genome_bam.bam
+BAM_INDEX=/path/to/possorted_genome_bam.bam.bai
+BARCODES=/path/to/barcodes.tsv
+SAMPLE_VCF=/path/to/pooled_samples.vcf
+REFERENCE_VCF=/path/to/genotype_reference.vcf
+REFERENCE_FASTA=/path/to/genome.fa
+REFERENCE_FASTA_INDEX=/path/to/genome.fa.fai
+
+Then, place the required files in the Ensemblex pipeline working directory:
+## Define the path to the working directory
+ensemblex_PWD=/path/to/working_directory
+
+## Copy the files to the input_files directory in the working directory
+cp $BAM $ensemblex_PWD/input_files/pooled_bam.bam
+cp $BAM_INDEX $ensemblex_PWD/input_files/pooled_bam.bam.bai
+cp $BARCODES $ensemblex_PWD/input_files/pooled_barcodes.tsv
+cp $SAMPLE_VCF $ensemblex_PWD/input_files/pooled_samples.vcf
+cp $REFERENCE_VCF $ensemblex_PWD/input_files/reference.vcf
+cp $REFERENCE_FASTA $ensemblex_PWD/input_files/reference.fa
+cp $REFERENCE_FASTA_INDEX $ensemblex_PWD/input_files/reference.fa.fai
+
+If the file transfer was successful, the input_files directory of the Ensemblex pipeline working directory will contain the following files:
+working_directory
+└── input_files
+ ├── pooled_bam.bam
+ ├── pooled_bam.bam.bai
+ ├── pooled_barcodes.tsv
+ ├── pooled_samples.vcf
+ ├── reference.fa
+ ├── reference.fa.fai
+ └── reference.vcf
+
+NOTE: You will notice that the names of the input files have been standardized, it is important that the input files have the corresonding name for the Ensemblex pipeline to work properly.
+Upon placing the required files in the Ensemblex pipeline, we can proceed to Step 3 where we will demultiplex the pooled samples using Ensemblex's constituent genetic demultiplexing tools: Genetic demultiplexing by consituent tools
+To demultiplex the pooled samples without prior genotype information, the following files are required:
+File | +Description | +
---|---|
gene_expression.bam | +Gene expression bam file of the pooled samples (e.g., 10X Genomics possorted_genome_bam.bam) | +
gene_expression.bam.bai | +Gene expression bam index file of the pooled samples (e.g., 10X Genomics possorted_genome_bam.bam.bai) | +
barcodes.tsv | +Barcodes tsv file of the pooled cells (e.g., 10X Genomics barcodes.tsv) | +
genome_reference.fa | +Genome reference fasta file (e.g., 10X Genomics: ~/Homo_sapiens.GRCh37/genome/10xGenomics/refdata-cellranger-GRCh37/fasta/genome.fa) | +
genome_reference.fa.fai | +Genome reference fasta index file (e.g., 10X Genomics: ~/Homo_sapiens.GRCh37/genome/10xGenomics/refdata-cellranger-GRCh37/fasta/genome.fa.fai) | +
genotype_reference.vcf | +Population reference vcf file (e.g., 1000 Genomes Project) | +
NOTE: We demonstrate how to download reference vcf and fasta files in the Tutorial section of the Ensemblex documentation.
+First, define all of the required files:
+BAM=/path/to/possorted_genome_bam.bam
+BAM_INDEX=/path/to/possorted_genome_bam.bam.bai
+BARCODES=/path/to/barcodes.tsv
+REFERENCE_VCF=/path/to/genotype_reference.vcf
+REFERENCE_FASTA=/path/to/genome.fa
+REFERENCE_FASTA_INDEX=/path/to/genome.fa.fai
+
+Then, place the required files in the Ensemblex pipeline working directory:
+## Define the path to the working directory
+ensemblex_PWD=/path/to/working_directory
+
+## Copy the files to the input_files directory in the working directory
+cp $BAM $ensemblex_PWD/input_files/pooled_bam.bam
+cp $BAM_INDEX $ensemblex_PWD/input_files/pooled_bam.bam.bai
+cp $BARCODES $ensemblex_PWD/input_files/pooled_barcodes.tsv
+cp $REFERENCE_VCF $ensemblex_PWD/input_files/reference.vcf
+cp $REFERENCE_FASTA $ensemblex_PWD/input_files/reference.fa
+cp $REFERENCE_FASTA_INDEX $ensemblex_PWD/input_files/reference.fa.fai
+
+If the file transfer was successful, the input_files directory of the Ensemblex pipeline working directory will contain the following files:
+working_directory
+└── input_files
+ ├── pooled_bam.bam
+ ├── pooled_bam.bam.bai
+ ├── pooled_barcodes.tsv
+ ├── reference.fa
+ ├── reference.fa.fai
+ └── reference.vcf
+
+NOTE: You will notice that the names of the input files have been standardized, it is important that the input files have the corresonding name for the Ensemblex pipeline to work properly.
+Upon placing the required files in the Ensemblex pipeline, we can proceed to Step 3 where we will demultiplex the pooled samples using Ensemblex's constituent genetic demultiplexing tools: Genetic demultiplexing by consituent tools
+ +In Step 3, we will demultiplex the pooled samples with each of Ensemblex's constituent genetic demultiplexing tools. The constituent genetic demultiplexing tools will vary depending on the version of the Ensemblex pipeline being used:
+NOTE: The analytical parameters for each constiuent tool can be adjusted using the the ensemblex_config.ini
file located in ~/working_directory/job_info/configs
. For a comprehensive description of how to adjust the analytical parameters of the Ensemblex pipeline please see Execution parameters.
When demultiplexing with prior genotype information, Ensemblex leverages the sample labels from
+To run Demuxalot use the following code:
+ensemblex_HOME=/path/to/ensemblex.pip
+ensemblex_PWD=/path/to/working_directory
+
+bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step demuxalot
+
+If Demuxalot completed successfully, the following files should be available in ~/working_directory/demuxalot
working_directory
+└── demuxalot
+ ├── Demuxalot_result.csv
+ └── new_snps_single_file.betas
+
+To run Demuxlet use the following code:
+ensemblex_HOME=/path/to/ensemblex.pip
+ensemblex_PWD=/path/to/working_directory
+
+bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step demuxlet
+
+If Demuxlet completed successfully, the following files should be available in ~/working_directory/demuxlet
working_directory
+└── demuxlet
+ ├── outs.best
+ ├── pileup.cel.gz
+ ├── pileup.plp.gz
+ ├── pileup.umi.gz
+ └── pileup.var.gz
+
+To run Souporcell use the following code:
+ensemblex_HOME=/path/to/ensemblex.pip
+ensemblex_PWD=/path/to/working_directory
+
+bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step souporcell
+
+If Souporcell completed successfully, the following files should be available in ~/working_directory/souporcell
working_directory
+└── souporcell
+ ├── alt.mtx
+ ├── cluster_genotypes.vcf
+ ├── clusters_tmp.tsv
+ ├── clusters.tsv
+ ├── fq.fq
+ ├── minimap.sam
+ ├── minitagged.bam
+ ├── minitagged_sorted.bam
+ ├── minitagged_sorted.bam.bai
+ ├── Pool.vcf
+ ├── ref.mtx
+ └── soup.txt
+
+To run Vireo-GT use the following code:
+ensemblex_HOME=/path/to/ensemblex.pip
+ensemblex_PWD=/path/to/working_directory
+
+bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step vireo
+
+If Vireo-GT completed successfully, the following files should be available in ~/working_directory/vireo_gt
working_directory
+└── vireo_gt
+ ├── cellSNP.base.vcf.gz
+ ├── cellSNP.cells.vcf.gz
+ ├── cellSNP.samples.tsv
+ ├── cellSNP.tag.AD.mtx
+ ├── cellSNP.tag.DP.mtx
+ ├── cellSNP.tag.OTH.mtx
+ ├── donor_ids.tsv
+ ├── fig_GT_distance_estimated.pdf
+ ├── fig_GT_distance_input.pdf
+ ├── GT_donors.vireo.vcf.gz
+ ├── _log.txt
+ ├── prob_doublet.tsv.gz
+ ├── prob_singlet.tsv.gz
+ └── summary.tsv
+
+Upon demultiplexing the pooled samples with each of Ensemblex's constituent genetic demultiplexing tools, we can proceed to Step 4 where we will process the output files of the consituent tools with the Ensemblex algorithm to generate the ensemble sample classifications: Application of Ensemblex
+When demultiplexing without prior genotype information, Ensemblex leverages the sample labels from
+To run Freemuxlet use the following code:
+ensemblex_HOME=/path/to/ensemblex.pip
+ensemblex_PWD=/path/to/working_directory
+
+bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step freemuxlet
+
+If Freemuxlet completed successfully, the following files should be available in ~/working_directory/freemuxlet
working_directory
+└── freemuxlet
+ ├── outs.clust1.samples.gz
+ ├── outs.clust1.vcf
+ ├── outs.lmix
+ ├── pileup.cel.gz
+ ├── pileup.plp.gz
+ ├── pileup.umi.gz
+ └── pileup.var.gz
+
+To run Souporcell use the following code:
+ensemblex_HOME=/path/to/ensemblex.pip
+ensemblex_PWD=/path/to/working_directory
+
+bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step souporcell
+
+If Souporcell completed successfully, the following files should be available in ~/working_directory/souporcell
working_directory
+└── souporcell
+ ├── alt.mtx
+ ├── cluster_genotypes.vcf
+ ├── clusters_tmp.tsv
+ ├── clusters.tsv
+ ├── fq.fq
+ ├── minimap.sam
+ ├── minitagged.bam
+ ├── minitagged_sorted.bam
+ ├── minitagged_sorted.bam.bai
+ ├── Pool.vcf
+ ├── ref.mtx
+ └── soup.txt
+
+To run Vireo use the following code:
+ensemblex_HOME=/path/to/ensemblex.pip
+ensemblex_PWD=/path/to/working_directory
+
+bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step vireo
+
+If Vireo completed successfully, the following files should be available in ~/working_directory/vireo
working_directory
+└── vireo
+ ├── cellSNP.base.vcf.gz
+ ├── cellSNP.cells.vcf.gz
+ ├── cellSNP.samples.tsv
+ ├── cellSNP.tag.AD.mtx
+ ├── cellSNP.tag.DP.mtx
+ ├── cellSNP.tag.OTH.mtx
+ ├── donor_ids.tsv
+ ├── fig_GT_distance_estimated.pdf
+ ├── GT_donors.vireo.vcf.gz
+ ├── _log.txt
+ ├── prob_doublet.tsv.gz
+ ├── prob_singlet.tsv.gz
+ └── summary.tsv
+
+NOTE: Because the Demuxalot algorithm requires prior genotype information, the Ensemblex pipeline uses the predicted vcf file generated by Freemuxlet as input into Demuxalot when prior genotype information is not available. Therefore, it is important to wait for Freemuxlet to complete before running Demuxalot. To check if the required Freemuxlet-generated vcf file is available prior to running Demuxalot, you can use the following code:
+if test -f /path/to/working_directory/freemuxlet/outs.clust1.vcf; then
+ echo "File exists."
+fi
+
+Upon confirming that the required Freemuxlet-generated file exists, we can run Demuxalot using the following code:
+ensemblex_HOME=/path/to/ensemblex.pip
+ensemblex_PWD=/path/to/working_directory
+
+bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step demuxalot
+
+If Demuxalot completed successfully, the following files should be available in ~/working_directory/demuxalot
working_directory
+└── demuxalot
+ ├── Demuxalot_result.csv
+ └── new_snps_single_file.betas
+
+Upon demultiplexing the pooled samples with each of Ensemblex's constituent genetic demultiplexing tools, we can proceed to Step 4 where we will process the output files of the consituent tools with the Ensemblex algorithm to generate the ensemble sample classifications: Application of Ensemblex
+ +In Step 4, we will process the output files from the constituent genetic demultiplexing tools with the Ensemblex framework. Ensemblex processes the output files in a three-step pipeline to identify the most probable sample label for each cell based on the predictions of the constituent tools:
+Step 1: Probabilistic-weighted ensemble
+In Step 1, Ensemblex utilizes an unsupervised weighting model to identify the most probable sample label for each cell. Ensemblex weighs each constituent tool’s assignment probability distribution by its estimated balanced accuracy for the dataset. The weighted assignment probabilities across all four constituent tools are then used to inform the most probable sample label for each cell.
Step 2: Graph-based doublet detection
+In Step 2, Ensemblex utilizes a graph-based approach to identify doublets that were incorrectly labeled as singlets in Step 1. Pooled cells are embedded into PCA space and the most confident doublets in the pool (nCD) are identified. Then, based on the Euclidean distance in PCA space, the pooled cells that surpass the percentile threshold (pT) of the nearest neighbour frequency to the confident doublets are labelled as doublets by Ensemblex. Ensemblex performs an automated parameter sweep to identify the optimal nCD and pT values; however, user can opt to manually define these parameters.
Step 3: Ensemble-independent doublet detection
+In Step 3, Ensemblex utilizes an ensemble-independent approach to further improve doublet detection. Here, cells that are labelled as doublets by Demuxalot or Vireo are labelled as doublets by Ensemblex; however, users can nominate different tools to utilize for Step 3, depending on the desired doublet detection stringency.
Users can choose to run each step of the Ensemblex framework sequentially (Steps 1 to 3) or can opt to skip certain steps. While Step 1 is necessary to generate the ensemble sample labels, Steps 2 and 3 were implemented to improve Ensemblex's ability to identify doublets; thus, if users do not want to prioritize doublet detection, they may skip Steps 2 and/or 3. Nonetheless, we demonstrated in our pre-print manuscript that utilizing the entire Ensemblex framework is important for maximizing the demultiplexing accuracy. Users can define which steps of the Ensemblex framework they want to utilize in the adjustable parameters file.
+The adjustable parameters file (ensemblex_config.ini
) is located in ~/working_directory/job_info/configs/
. For a comprehensive description of how to adjust the analytical parameters of the Ensemblex pipeline please see Execution parameters. The following parameters are adjustable when applying the Ensemblex algorithm:
Parameter | +Default | +Description | +
---|---|---|
Pool parameters | ++ | + |
PAR_ensemblex_sample_size | +NULL | +Number of samples multiplexed in the pool. | +
PAR_ensemblex_expected_doublet_rate | +NULL | +Expected doublet rate for the pool. If using 10X Genomics, the expected doublet rate can be estimated based on the number of recovered cells. For more information see 10X Genomics Documentation. | +
Set up parameters | ++ | + |
PAR_ensemblex_merge_constituents | +Yes | +Whether or not to merge the output files of the constituent demultiplexing tools. If running Ensemblex on a pool for the first time, this parameter should be set to "Yes". Subsequent runs of ensemblex (e.g., parameter optimization) can have this parameter set to "No" as the pipeline will automatically detect the previously generated merged file. | +
Step 1 parameters: Probabilistic-weighted ensemble | ++ | + |
PAR_ensemblex_probabilistic_weighted_ensemble | +Yes | +Whether or not to perform Step 1: Probabilistic-weighted ensemble. If running Ensemblex on a pool for the first time, this parameter should be set to "Yes". Subsequent runs of ensemblex (e.g., parameter optimization) can have this parameter set to "No" as the pipeline will automatically detect the previously generated Step 1 output file. | +
Step 2 parameters: Graph-based doublet detection | ++ | + |
PAR_ensemblex_preliminary_parameter_sweep | +No | +Whether or not to perform a preliminary parameter sweep for Step 2: Graph-based doublet detection. Users should utilize the preliminary parameter sweep if they wish to manually define the number of confident doublets in the pool (nCD) and the percentile threshold of the nearest neighour frequency (pT), which can be defined in the following two parameters, respectively. | +
PAR_ensemblex_nCD | +NULL | +Manually defined number of confident doublets in the pool (nCD). Value can be informed by the output files generated by setting PAR_ensemblex_preliminary_parameter_sweep to "Yes". | +
PAR_ensemblex_pT | +NULL | +Manually defined percentile threshold of the nearest neighour frequency (pT). Value can be informed by the output files generated by setting PAR_ensemblex_preliminary_parameter_sweep to "Yes". | +
PAR_ensemblex_graph_based_doublet_detection | +Yes | +Whether or not to perform Step 2: Graph-based doublet detection. If PAR_ensemblex_nCD and PAR_ensemblex_pT are not defined by the user (NULL), Ensemblex will automatically determine the optimal parameter values using an unsupervised parameter sweep. If PAR_ensemblex_nCD and PAR_ensemblex_pT are defined by the user, graph-based doublet detection will be performed with the user-defined values. | +
Step 3 parameters: Ensemble-independent doublet detection | ++ | + |
PAR_ensemblex_preliminary_ensemble_independent_doublet | +No | +Whether or not to perform a preliminary parameter sweep for Step 3: Ensemble-independent doublet detection. Users should utilize the preliminary parameter sweep if they wish to manually define which constituent tools to utilize for ensemble-independent doublet detection. Users can define which tools to utilize for ensemble-independent doublet detection in the following parameters. | +
PAR_ensemblex_ensemble_independent_doublet | +Yes | +Whether or not to perform Step 3: Ensemble-independent doublet detection. | +
PAR_ensemblex_doublet_Demuxalot_threshold | +Yes | +Whether or not to label doublets identified by Demuxalot as doublets. Only doublets with assignment probabilities exceeding Demuxalot's recommended probability threshold will be labeled as doublets by Ensemblex. | +
PAR_ensemblex_doublet_Demuxalot_no_threshold | +No | +Whether or not to label doublets identified by Demuxalot as doublets, regardless of the corresponding assignment probability. | +
PAR_ensemblex_doublet_Demuxlet_threshold | +No | +Whether or not to label doublets identified by Demuxlet as doublets. Only doublets with assignment probabilities exceeding Demuxlet's recommended probability threshold will be labeled as doublets by Ensemblex. | +
PAR_ensemblex_doublet_Demuxlet_no_threshold | +No | +Whether or not to label doublets identified by Demuxlet as doublets, regardless of the corresponding assignment probability. | +
PAR_ensemblex_doublet_Souporcell_threshold | +No | +Whether or not to label doublets identified by Souporcell as doublets. Only doublets with assignment probabilities exceeding Souporcell's recommended probability threshold will be labeled as doublets by Ensemblex. | +
PAR_ensemblex_doublet_Souporcell_no_threshold | +No | +Whether or not to label doublets identified by Souporcell as doublets, regardless of the corresponding assignment probability. | +
PAR_ensemblex_doublet_Vireo_threshold | +Yes | +Whether or not to label doublets identified by Vireo as doublets. Only doublets with assignment probabilities exceeding Vireo's recommended probability threshold will be labeled as doublets by Ensemblex. | +
PAR_ensemblex_doublet_Vireo_no_threshold | +No | +Whether or not to label doublets identified by Vireo as doublets, regardless of the corresponding assignment probability. | +
Confidence score parameters | ++ | + |
PAR_ensemblex_compute_singlet_confidence | +Yes | +Whether or not to compute Ensemblex's singlet confidence score. This will define low confidence assignments which should be removed from downstream analyses. | +
To apply the Ensemblex algorithm use the following code:
+ensemblex_HOME=/path/to/ensemblex.pip
+ensemblex_PWD=/path/to/working_directory
+
+bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step ensemblexing
+
+If the ensemblex algorithm completed successfully, the following files should be available in ~/working_directory/ensemblex
working_directory
+└── ensemblex
+ ├── confidence
+ │ └── ensemblex_final_cell_assignment.csv
+ ├── constituent_tool_merge.csv
+ ├── step1
+ │ ├── ARI_demultiplexing_tools.pdf
+ │ ├── BA_demultiplexing_tools.pdf
+ │ ├── Balanced_accuracy_summary.csv
+ │ └── step1_cell_assignment.csv
+ ├── step2
+ │ ├── optimal_nCD.pdf
+ │ ├── optimal_pT.pdf
+ │ ├── PC1_var_contrib.pdf
+ │ ├── PC2_var_contrib.pdf
+ │ ├── PCA1_graph_based_doublet_detection.pdf
+ │ ├── PCA2_graph_based_doublet_detection.pdf
+ │ ├── PCA3_graph_based_doublet_detection.pdf
+ │ ├── PCA_plot.pdf
+ │ ├── PCA_scree_plot.pdf
+ │ └── Step2_cell_assignment.csv
+ └── step3
+ ├── Doublet_overlap_no_threshold.pdf
+ ├── Doublet_overlap_threshold.pdf
+ ├── Number_Ensemblux_doublets_EID_no_threshold.pdf
+ ├── Number_Ensemblux_doublets_EID_threshold.pdf
+ └── Step3_cell_assignment.csv
+
+For a comprehensive description of the Ensemblex algorithm output files, please see Ensemblex outputs.
+ +Any contributions or suggestions for improving the ensemblex pipeline are welcomed and appreciated. You may directly contact Michael Fiorini or Saeid Amiri.
+If you encounter any issues, please open an issue in the GitHub repository.
+Alternatively, you are welcomed to email the developers directly; for any questions please contact Michael Fiorini: michael.fiorini@mail.mcgill.ca
Ensemblex is an accuracy-weighted ensemble framework for genetic demultiplexing of pooled single-cell RNA seqeuncing (scRNAseq) data. By addressing the limitiations of individual genetic demultiplexing tools, we demonstrated that Ensemblex:
+The ensemble method capitalizes on the added confidence of combining distinct statistical frameworks for genetic demultiplexing, but the modular algorithm can adapt to the overall performance of its constituent tools on the respective dataset, making it resilient against a poorly performing constituent tool.
+Ensemblex can be used to demultiplex pools with or without prior genotype information. When demultiplexing with prior genotype information, Ensemblex leverages the sample assignments of four individual, constituent genetic demultiplexing tools:
+When demultiplexing without prior genotype information, Ensemblex leverages the sample assignments of four individual, constituent genetic demultiplexing tools:
+Upon demultiplexing pools with each of the four constituent genetic demultiplexing tools, Ensemblex processes the output files in a three-step pipeline to identify the most probable sample label for each cell based on the predictions of the constituent tools:
+Step 1: Probabilistic-weighted ensemble
+Step 2: Graph-based doublet detection
+Step 3: Ensemble-independent doublet detection
As output, Ensemblex returns its own cell-specific sample labels and corresponding assignment probabilities and singlet confidence score, as well as the sample labels and corresponding assignment probabilities for each of its constituents. The demultiplexed sample labels could then be used to perform downstream analyses.
+
+
+
Figure 1. Overview of the Ensemblex worflow. A) The Ensemblex workflow begins with demultiplexing pooled samples by each of the constituent tools. The outputs from each individual demultiplexing tool are then used as input into the Ensemblex framework. B) The Ensemblex framework comprises three distinct steps that are assembled into a pipeline: 1) accuracy-weighted probabilistic ensemble, 2) graph-based doublet detection, and 3) ensemble-independent doublet detection. C) As output, Ensemblex returns its own sample-cell assignments as well as the sample-cell assignments of each of its constituent tools. D) Ensemblex's sample-cell assignments can be used to perform downstream analysis on the pooled scRNAseq data.
+To facilitate the application of Ensemblex, we provide a pipeline that demultiplexes pooled cells by each of the individual constituent genetic demultiplexing tools and processes the outputs with the Ensemblex algorithm. In this documentation, we outline each step of the Ensemblex pipeline, illustrate how to run the pipeline, define best practices, and provide a tutorial with pubicly available datasets.
+For a comprehensive descripttion of Ensemblex, ground-truth benchmarking, and application to real-world datasets, see our pre-print manuscript: Pre-print
+The Ensemblex container is freely available under an MIT open-source license at https://zenodo.org/records/11639103.
+The Ensemblex container can be downloaded using the following code:
+## Download the Ensemblex container
+curl "https://zenodo.org/records/11639103/files/ensemblex.pip.zip?download=1" --output ensemblex.pip.zip
+
+## Unzip the Ensemblex container
+unzip ensemblex.pip.zip
+
+If installation was successful the following will be available:
+ensemblex.pip
+├── gt
+│ ├── configs
+│ │ └── ensemblex_config.ini
+│ └── scripts
+│ ├── demuxalot
+│ │ ├── pipeline_demuxalot.sh
+│ │ └── pipline_demuxalot.py
+│ ├── demuxlet
+│ │ └── pipeline_demuxlet.sh
+│ ├── ensemblexing
+│ │ ├── ensemblexing.R
+│ │ ├── functions.R
+│ │ └── pipeline_ensemblexing.sh
+│ ├── souporcell
+│ │ └── pipeline_souporcell_generate.sh
+│ └── vireo
+│ └── pipeline_vireo.sh
+├── launch
+│ ├── launch_gt.sh
+│ └── launch_nogt.sh
+├── launch_ensemblex.sh
+├── nogt
+│ ├── configs
+│ │ └── ensemblex_config.ini
+│ └── scripts
+│ ├── demuxalot
+│ │ ├── pipeline_demuxalot.py
+│ │ └── pipeline_demuxalot.sh
+│ ├── ensemblexing
+│ │ ├── ensemblexing_nogt.R
+│ │ ├── functions_nogt.R
+│ │ └── pipeline_ensemblexing.sh
+│ ├── freemuxlet
+│ │ └── pipeline_freemuxlet.sh
+│ ├── souporcell
+│ │ └── pipeline_souporcell_generate.sh
+│ └── vireo
+│ └── pipeline_vireo.sh
+├── README
+├── soft
+│ └── ensemblex.sif
+└── tools
+ ├── sort_vcf_same_as_bam.sh
+ └── utils.sh
+
+In addition to the Ensemblex container, users must install Apptainer. For example:
+## Load Apptainer
+module load apptainer/1.2.4
+
+To test if the Ensemblex container is installed properly, run the following code:
+## Define the path to ensemblex.pip
+ensemblex_HOME=/path/to/ensemblex.pip
+
+## Print help message
+bash $ensemblex_HOME/launch_ensemblex.sh -h
+
+Which should return the following help message:
+-------------------
+Usage: /home/fiorini9/scratch/ensemblex.pip/launch_ensemblex.sh [arguments]
+ mandatory arguments:
+ -d (--dir) = Working directory (where all the outputs will be printed) (give full path)
+ --steps = Specify the steps to execute. Begin by selecting either init-GT or init-noGT to establish the working directory.
+ For GT: vireo, demuxalot, demuxlet, souporcell, ensemblexing
+ For noGT: vireo, demuxalot, freemuxlet, souporcell, ensemblexing
+
+ optional arguments:
+ -h (--help) = See helps regarding the pipeline arguments
+ --vcf = The path of vcf file
+ --bam = The path of bam file
+ --sortout = The path snd nsme of vcf generated using sort
+ -------------------
+ For a comprehensive help, visit https://neurobioinfo.github.io/ensemblex/site/ for documentation.
+
+Upon installing up the Ensemblex container, we can proceed to Step 1 where we will initiate the Ensemblex pipeline for demultiplexing: Set up
+ +For the tutorial, we will leverage a pooled scRNAseq dataset produced by Jerber et al.. This pool contains induced pluripotent cell lines (iPSC) from 9 healthy controls that were differentiated towards a dopaminergic neuron state.
+In this section of the tutorial, we will:
+Before we begin, we will create a designated folder for the Ensemblex tutorial:
+mkdir ensemblex_tutorial
+cd ensemblex_tutorial
+
+We will begin by downloading the pooled scRNAseq data from the Sequence Read Archive (SRA):
+## Create a folder to place pooled scRNAseq data
+mkdir pooled_scRNAseq
+cd pooled_scRNAseq
+
+
+## Download pooled scRNAseq FASTQ files
+wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR470/009/ERR4700019/ERR4700019_1.fastq.gz
+wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR470/009/ERR4700019/ERR4700019_2.fastq.gz
+
+wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR470/000/ERR4700020/ERR4700020_1.fastq.gz
+wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR470/000/ERR4700020/ERR4700020_2.fastq.gz
+
+wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR470/001/ERR4700021/ERR4700021_1.fastq.gz
+wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR470/001/ERR4700021/ERR4700021_2.fastq.gz
+
+wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR470/002/ERR4700022/ERR4700022_1.fastq.gz
+wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR470/002/ERR4700022/ERR4700022_2.fastq.gz
+
+
+## Rename pooled scRNAseq FASTQ files
+mv ERR4700019_1.fastq.gz ~/ensemblex_tutorial/pooled_scRNAseq/pool_S1_L001_R1_001.fastq.gz
+mv ERR4700019_2.fastq.gz ~/ensemblex_tutorial/pooled_scRNAseq/pool_S1_L001_R2_001.fastq.gz
+
+mv ERR4700020_1.fastq.gz ~/ensemblex_tutorial/pooled_scRNAseq/pool_S1_L002_R1_001.fastq.gz
+mv ERR4700020_2.fastq.gz ~/ensemblex_tutorial/pooled_scRNAseq/pool_S1_L002_R2_001.fastq.gz
+
+mv ERR4700021_1.fastq.gz ~/ensemblex_tutorial/pooled_scRNAseq/pool_S1_L003_R1_001.fastq.gz
+mv ERR4700021_2.fastq.gz ~/ensemblex_tutorial/pooled_scRNAseq/pool_S1_L003_R2_001.fastq.gz
+
+mv ERR4700022_1.fastq.gz ~/ensemblex_tutorial/pooled_scRNAseq/pool_S1_L004_R1_001.fastq.gz
+mv ERR4700022_2.fastq.gz ~/ensemblex_tutorial/pooled_scRNAseq/pool_S1_L004_R2_001.fastq.gz
+
+
+Next, we will process the pooled scRNAseq data with the CellRanger counts pipeline:
+## Create CellRanger directory
+cd ~/ensemblex_tutorial
+mkdir CellRanger
+cd CellRanger
+
+cellranger count \
+--id=pool \
+--fastqs=/home/fiorini9/scratch/ensemblex_pipeline_test/ensemblex_tutorial/pooled_scRNAseq \
+--sample=pool \
+--transcriptome=~/10xGenomics/refdata-cellranger-GRCh37
+
+If the CellRanger counts pipeline completed successfully, it will have generated the following files that we will use for genetic demultiplexing downstream:
+NOTE: For more information regarding the CellRanger counts pipeline, please see the 10X documentation.
+Next, we will download the whole exome .vcf files corresponding to the nine pooled individuals from which the iPSC lines derived. We will download the .vcf files from the European Nucleotide Archive (ENA):
+## Create a folder to place sample genotype data
+cd ~/ensemblex_tutorial
+mkdir sample_genotype
+cd sample_genotype
+
+## HPSI0115i-hecn_6
+wget ftp://ftp.sra.ebi.ac.uk/vol1/analysis/ERZ487/ERZ487971/HPSI0115i-hecn_6.wes.exomeseq.SureSelect_HumanAllExon_v5.mpileup.20170327.genotypes.vcf.gz
+wget ftp://ftp.sra.ebi.ac.uk/vol1/analysis/ERZ487/ERZ487971/HPSI0115i-hecn_6.wes.exomeseq.SureSelect_HumanAllExon_v5.mpileup.20170327.genotypes.vcf.gz.tbi
+
+## HPSI0214i-pelm_3
+wget ftp://ftp.sra.ebi.ac.uk/vol1/analysis/ERZ122/ERZ122924/HPSI0214i-pelm_3.wes.exomeseq.SureSelect_HumanAllExon_v5.mpileup.20150415.genotypes.vcf.gz
+wget ftp://ftp.sra.ebi.ac.uk/vol1/analysis/ERZ122/ERZ122924/HPSI0214i-pelm_3.wes.exomeseq.SureSelect_HumanAllExon_v5.mpileup.20150415.genotypes.vcf.gz.tbi
+
+## HPSI0314i-sojd_3
+wget ftp://ftp.sra.ebi.ac.uk/vol1/analysis/ERZ266/ERZ266723/HPSI0314i-sojd_3.wes.exomeseq.SureSelect_HumanAllExon_v5.mpileup.20160122.genotypes.vcf.gz
+wget ftp://ftp.sra.ebi.ac.uk/vol1/analysis/ERZ266/ERZ266723/HPSI0314i-sojd_3.wes.exomeseq.SureSelect_HumanAllExon_v5.mpileup.20160122.genotypes.vcf.gz.tbi
+
+## HPSI0414i-sebn_3
+wget ftp://ftp.sra.ebi.ac.uk/vol1/analysis/ERZ376/ERZ376769/HPSI0414i-sebn_3.wes.exomeseq.SureSelect_HumanAllExon_v5.mpileup.20161031.genotypes.vcf.gz
+wget ftp://ftp.sra.ebi.ac.uk/vol1/analysis/ERZ376/ERZ376769/HPSI0414i-sebn_3.wes.exomeseq.SureSelect_HumanAllExon_v5.mpileup.20161031.genotypes.vcf.gz.tbi
+
+## HPSI0514i-uenn_3
+wget ftp://ftp.sra.ebi.ac.uk/vol1/analysis/ERZ488/ERZ488039/HPSI0514i-uenn_3.wes.exomeseq.SureSelect_HumanAllExon_v5.mpileup.20170327.genotypes.vcf.gz
+wget ftp://ftp.sra.ebi.ac.uk/vol1/analysis/ERZ488/ERZ488039/HPSI0514i-uenn_3.wes.exomeseq.SureSelect_HumanAllExon_v5.mpileup.20170327.genotypes.vcf.gz.tbi
+
+## HPSI0714i-pipw_4
+wget ftp://ftp.sra.ebi.ac.uk/vol1/analysis/ERZ376/ERZ376869/HPSI0714i-pipw_4.wes.exomeseq.SureSelect_HumanAllExon_v5.mpileup.20161031.genotypes.vcf.gz
+wget ftp://ftp.sra.ebi.ac.uk/vol1/analysis/ERZ376/ERZ376869/HPSI0714i-pipw_4.wes.exomeseq.SureSelect_HumanAllExon_v5.mpileup.20161031.genotypes.vcf.gz.tbi
+
+## HPSI0715i-meue_5
+wget ftp://ftp.sra.ebi.ac.uk/vol1/analysis/ERZ376/ERZ376787/HPSI0715i-meue_5.wes.exomeseq.SureSelect_HumanAllExon_v5.mpileup.20161031.genotypes.vcf.gz
+wget ftp://ftp.sra.ebi.ac.uk/vol1/analysis/ERZ376/ERZ376787/HPSI0715i-meue_5.wes.exomeseq.SureSelect_HumanAllExon_v5.mpileup.20161031.genotypes.vcf.gz.tbi
+
+## HPSI0914i-vaka_5
+wget ftp://ftp.sra.ebi.ac.uk/vol1/analysis/ERZ487/ERZ487965/HPSI0914i-vaka_5.wes.exomeseq.SureSelect_HumanAllExon_v5.mpileup.20170327.genotypes.vcf.gz
+wget ftp://ftp.sra.ebi.ac.uk/vol1/analysis/ERZ487/ERZ487965/HPSI0914i-vaka_5.wes.exomeseq.SureSelect_HumanAllExon_v5.mpileup.20170327.genotypes.vcf.gz.tbi
+
+## HPSI1014i-quls_2
+wget ftp://ftp.sra.ebi.ac.uk/vol1/analysis/ERZ487/ERZ487886/HPSI1014i-quls_2.wes.exomeseq.SureSelect_HumanAllExon_v5.mpileup.20170327.genotypes.vcf.gz
+wget ftp://ftp.sra.ebi.ac.uk/vol1/analysis/ERZ487/ERZ487886/HPSI1014i-quls_2.wes.exomeseq.SureSelect_HumanAllExon_v5.mpileup.20170327.genotypes.vcf.gz.tbi
+
+Upon downloading the individual genotype data, we will merge the individual files to generate a single .vcf file.
+## Merge .vcf files
+module load bcftools
+bcftools merge *.vcf.gz > sample_genotype_merge.vcf
+
+The resulting sample_genotype_merge.vcf
file will be used as prior genotype information for genetic demultiplexing downstream.
Next, we will download a reference genotype file from the 1000 Genomes Project, Phase 3:
+## Create a folder to place the reference files
+cd ~/ensemblex_tutorial
+mkdir reference_files
+cd reference_files
+
+## Download reference .vcf
+wget https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5c.20130502.sites.vcf.gz
+wget https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5c.20130502.sites.vcf.gz.tbi
+
+## Unzip .vcf file
+gunzip ALL.wgs.phase3_shapeit2_mvncall_integrated_v5c.20130502.sites.vcf.gz
+
+## Only keep SNPs
+module load vcftools
+vcftools --vcf ALL.wgs.phase3_shapeit2_mvncall_integrated_v5c.20130502.sites.vcf --remove-indels --recode --recode-INFO-all --out SNPs_only
+
+
+## Only keep common variants
+module load bcftools
+bcftools filter -e 'AF<0.01' SNPs_only.recode.vcf > common_SNPs_only.recode.vcf
+
+The resulting common_SNPs_only.recode.vcf
file will be used as reference genotype data for genetic demultiplexing downstream.
Finally, we will prepare a reference genome. For our tutorial we will use the GRCh37 10X reference genome. For information regarding references, see the 10X documentation.
+## Copy pre-built reference genome to working directory
+cp /cvmfs/soft.mugqic/CentOS6/genomes/species/Homo_sapiens.GRCh37/genome/10xGenomics/refdata-cellranger-GRCh37/fasta/genome.fa ~/ensemblex_pipeline_test/ensemblex_tutorial/reference_files
+
+We will use the genome.fa
reference genome for genetic demultiplexing downstream.
To run the Ensemblex pipeline on the downloaded data please see the Ensemblex with prior genotype information section of the Ensemblex pipeline.
+ +After applying the Ensemblex algorithm to the output files of the constituent genetic demultiplexing tools in Step 4, the ~/working_directory/ensemblex
folder will have the following structure:
working_directory
+└── ensemblex
+ ├── constituent_tool_merge.csv
+ ├── step1
+ ├── step2
+ ├── step3
+ └── confidence
+
+constituent_tool_merge.csv
is the merged outputs from each constituent genetic demultiplexing tool.step1/
contains the outputs from Step 1: probabilistic-weighted ensemble.step2/
contains the outputs from Step 2: graph-based doublet detection.step3/
contains the outputs from Step 3: ensemble-independent doublet detection.confidence/
contains the final Ensemblex output file, whose sample labels have been annotate with the Ensemblex signlet confidence score.Note: If users re-run a step of the Ensemblex workflow, the outputs from the previous run will automatically be overwritten. If you do not want to lose the outputs from a previous run, it is important to copy the materials to a separate directory.
+Ensemblex begins by merging the output files of the constituent genetic demultiplexing tools by cell barcode, which produces the constituent_tool_merge.csv
file. In this file, each constituent genetic demultiplexing tool has two columns corresponding to their sample labels:
demuxalot_assignment
demuxalot_best_assignment
demuxlet_assignment
demuxlet_best_assignment
souporcell_assignment
souporcell_best_assignment
vireo_assignment
vireo_best_assignment
Taking Vireo as an example, vireo_assignment
shows Vireo's sample labels after applying its recommended probability threshold; thus, cells that do not meet Vireo's recommended probability threshold will be labeled as "unassigned". In turn, vireo_best_assignment
shows Vireo's best guess assignments with out applying the recommended probability threshold; thus, cells that do not meet Vireo's recommended probability threshold will still show the best sample label and will not be labelled as "unassigned".
The constituent_tool_merge.csv
file also contains a general_consensus
column. This is not Ensemblex's sample labels. The general_consensus
column simply shows the sample labels that result from a majority vote classifier; split decisions are labeled as unassigned.
After running Step 1 of the Ensemblex algorithm, the /PWE
folder will contain the following files:
working_directory
+└── ensemblex
+ └── step1
+ ├── ARI_demultiplexing_tools.pdf
+ ├── BA_demultiplexing_tools.pdf
+ ├── Balanced_accuracy_summary.csv
+ └── Step1_cell_assignment.csv
+
+Output type | +Name | +Description | +
---|---|---|
Figure | +ARI_demultiplexing_tools.pdf | +Heatmap showing the Adjusted Rand Index (ARI) between the sample labels of the constituent genetic demultiplexing tools. | +
Figure | +BA_demultiplexing_tools.pdf | +Barplot showing the estimated balanced accuracy for each constituent genetic demultiplexing tool. | +
File | +Balanced_accuracy_summary.csv | +Summary file describing the estimated balanced accuracy computation for each constituent genetic demultiplexing tool. | +
File | +Step1_cell_assignment.csv | +Data file containing Ensemblex's sample labels after Step 1: accuracy-weighted probabilistic ensemble. | +
The Step1_cell_assignment.csv
file contains the following important columns:
ensemblex_assignment
: Ensemblex sample labels after performing accuracy-weighted probabilistic ensemble.ensemblex_probability
: Accuracy-weighted ensemble probability corresponding to Ensemblex's sample labels.NOTE: Prior to using Ensemblex's sample labels for downstream analyses, we recommend computing the Ensemblex singlet confidence score to identify low confidence singlet assignments that should be removed from the dataset to mitigate the introduction of technical artificats.
+After running Step 2 of the Ensemblex algorithm, the /GBD
folder will contain the following files:
working_directory
+└── ensemblex
+ └── step2
+ ├── optimal_nCD.pdf
+ ├── optimal_pT.pdf
+ ├── PC1_var_contrib.pdf
+ ├── PC2_var_contrib.pdf
+ ├── PCA1_graph_based_doublet_detection.pdf
+ ├── PCA2_graph_based_doublet_detection.pdf
+ ├── PCA3_graph_based_doublet_detection.pdf
+ ├── PCA_plot.pdf
+ ├── PCA_scree_plot.pdf
+ └── Step2_cell_assignment.csv
+
+Output type | +Name | +Description | +
---|---|---|
Figure | +optimal_nCD.pdf | +Dot plot showing the optimal nCD value. | +
Figure | +optimal_pT.pdf | +Dot plot showing the optimal pT value. | +
Figure | +PC1_var_contrib.pdf | +Bar plot showing the contribution of each variable to the variation across the first principal component. | +
Figure | +PC2_var_contrib.pdf | +Bar plot showing the contribution of each variable to the variation across the second principal component. | +
Figure | +PCA1_graph_based_doublet_detection.pdf | +PCA showing Ensemblex sample labels (singlet or doublet) prior to performing graph-based doublet detection. | +
Figure | +PCA2_graph_based_doublet_detection.pdf | +PCA showing the cells identified as the n most confident doublets in the pool. | +
Figure | +PCA3_graph_based_doublet_detection.pdf | +PCA showing Ensemblex sample labels (singlet or doublet) after performing graph-based doublet detection. | +
Figure | +PCA_plot.pdf | +PCA of pooled cells. | +
Figure | +PCA_scree_plot.pdf | +Bar plot showing the variance explained by each principal component. | +
File | +Step2_cell_assignment.csv | +Data file containing Ensemblex's sample labels after Step 2: graph-based doublet detection. | +
The Step2_cell_assignment.csv
file contains the following important column:
ensemblex_assignment
: Ensemblex sample labels after performing graph-based doublet detection.NOTE: Prior to using Ensemblex's sample labels for downstream analyses, we recommend computing the Ensemblex singlet confidence score to identify low confidence singlet assignments that should be removed from the dataset to mitigate the introduction of technical artificats.
+After running Step 3 of the Ensemblex algorithm, the /EID
folder will contain the following files:
working_directory
+└── ensemblex
+ └── step3
+ ├── Doublet_overlap_no_threshold.pdf
+ ├── Doublet_overlap_threshold.pdf
+ ├── Number_ensemblex_doublets_EID_no_threshold.pdf
+ ├── Number_ensemblex_doublets_EID_threshold.pdf
+ └── Step3_cell_assignment.csv
+
+
+Output type | +Name | +Description | +
---|---|---|
Figure | +Doublet_overlap_no_threshold.pdf | +Proportion of doublet calls overlapping between constituent genetic demultiplexing tools without applying assignment probability thresholds. | +
Figure | +Doublet_overlap_threshold.pdf | +Proportion of doublet calls overlapping between constituent genetic demultiplexing tools after applying assignment probability thresholds. | +
Figure | +Number_ensemblex_doublets_EID_no_threshold.pdf | +Number of cells that would be labelled as doublets by Ensemblex if a constituent tool was nominated for ensemble-independent doublet detection, without applying assignment probability thresholds. | +
Figure | +Number_ensemblex_doublets_EID_threshold.pdf | +Number of cells that would be labelled as doublets by Ensemblex if a constituent tool was nominated for ensemble-independent doublet detection, after applying assignment probability thresholds. | +
File | +Step3_cell_assignment.csv | +Data file containing Ensemblex's sample labels after Step 3: ensemble-independent doublet detection. | +
The Step3_cell_assignment.csv
file contains the following important column:
ensemblex_assignment
: Ensemblex sample labels after performing ensemble-independent doublet detection.NOTE: Prior to using Ensemblex's sample labels for downstream analyses, we recommend computing the Ensemblex singlet confidence score to identify low confidence singlet assignments that should be removed from the dataset to mitigate the introduction of technical artificats.
+After computing the Ensemblex singlet confidence score, the /confidence
folder will contain the following file:
working_directory
+└── ensemblex
+ └── confidence
+ └── ensemblex_final_cell_assignment.csv
+
+
+
+Output type | +Name | +Description | +
---|---|---|
File | +ensemblex_final_cell_assignment.csv | +Data file containing Ensemblex's final sample labels after computing the singlet confidence score. | +
The ensemblex_final_cell_assignment.csv
file contains the following important column:
ensemblex_assignment
: Ensemblex sample labels after applying the recommended singlet confidence score threshold; singlets with a confidence score < 1 are labeled as "unassigned".ensemblex_best_assignment
: Ensemblex's best guess assignments with out applying the recommended confidence score threshold; singlets with a confidence score < 1 will still show the best sample label and will not be labelled as "unassigned".ensemblex_singlet_confidence
: Ensemblex singlet confidence score.NOTE: We recommend using the sample labels from ensemblex_assignment
for downstream analyses.
The Ensemblex workflow begins by demultiplexing pooled cells with each of its constituent tools: Demuxalot, Demuxlet, Souporcell and Vireo-GT if using prior genotype information or Demuxalot, Freemuxlet, Souporcell and Vireo if prior genotype information is not available.
+
+
+
Figure 1. Input into the Ensemblex framework. The Ensemblex workflow begins with demultiplexing pooled samples by each of the constituent tools. The outputs from each individual demultiplexing tool are then used as input into the Ensemblex framework.
+Upon demultiplexing pools with each individual constituent genetic demultiplexing tool, Ensemblex processes the outputs in a three-step pipeline:
+
+
+
Figure 2. Overview of the three-step Ensemblex framework. The Ensemblex framework comprises three distinct steps that are assembled into a pipeline: 1) accuracy-weighted probabilistic ensemble, 2) graph-based doublet detection, and 3) ensemble-independent doublet detection.
+For demonstration purposes throughout this section, we leveraged simulated pools with known ground-truth sample labels that were generated with 80 independetly-sequenced induced pluripotent stem cell (iPSC) lines from individuals with Parkinson's disease and neurologically healthy controls. The lines were differentiated towards a dopaminergic cell fate as part of the Foundational Data Initiative for Parkinson's disease (FOUNDIN-PD; Bressan et al.)
+The accuracy-weighted probabilistic ensemble component of the Ensemblex utilizes an unsupervised weighting model to identify the most probable sample label for each cell. Ensemblex weighs each constituent tool’s assignment probability distribution by its estimated balanced accuracy for the dataset in a framework that was largely inspired by the work of Large et al.. To estimate the balanced accuracy of a particular constituent tool (e.g. Demuxalot) for real-word datasets lacking ground-truth labels, Ensemblex leverages the cells with a consensus assignment across the three remaining tools (e.g. Demuxlet, Souporcell, and Vireo-GT) as a proxy for ground-truth. The weighted assignment probabilities across all four constituent tools are then used to inform the most probable sample label for each cell.
+
+
+
Figure 3. Graphical representation of the accuracy-weighted probabilistic ensemble component of the Ensemblex framework.
+The graph-based doublet detection component of the Ensemblex framework was implemented to identify doublets that are incorrectly labeled as singlets by the accuracy-weighted probablistic ensemble component (Step 1). To demonstrate Step 2 of the Ensemblex framework we leveraged a simulated pool comprising 24 pooled samples, 17,384 cells, and a 15% doublet rate.
+
+
+
Figure 4. Graphical representation of the graph-based doublet detection component of the Ensemblex framework.
+The graph-based doublet detection component begins by leveraging select variables returned from each constituent tool:
+
+
+
Figure 5. Select variables returned by the constituent genetic demultiplexing tools used for graph-based doubet detection.
+Using these variables, Ensemblex screens each pooled cell to identify the n most confident doublets in the pool and performs a principal component analysis (PCA).
+
+
+
Figure 6. PCA of pooled cells using select variables returned by the constituent genetic demultiplexing tools. A) PCA highlighting ground truth cell labels: singlet or doublet. B) PCA highlighting the n most confident doublets identified by Ensemblex.
+The PCA embedding is then converted into a Euclidean distance matrix and each cell is assigned a percentile rank based on their distance to each confident doublet. After performing an automated parameter sweep, Ensemblex identifies the droplets that appear most frequently amongst the nearest neighbours of confident doublets as doublets.
+
+
+
Figure 7. PCA of pooled cells labeled according to Ensemblex labels prior to and after graph-based doublet detection. A) PCA highlighting ground truth cell labels: singlet or doublet. B) PCA highlighting Ensemblex's labels prior to graph-based doublet detection. C) PCA highlighting Ensemblex's labels after graph-based doublet detection.
+The ensemble-independent doublet detection component of the Ensemblex framework was implemented to further improve Ensemblex's ability to identify doublets. Benchmarking on simulated pools with known ground-truth sample labels revealed that certain genetic demultiplexing tools, namely Demuxalot and Vireo, showed high doublet detection specificity.
+
+
+
Figure 8. Constituent genetic demultiplexing tool doublet specificity on computationally multiplexed pools with ground truth sample labels. Doublet specificity was evaluated on pools ranging in size from 4 to 80 multiplexed samples.
+However, Steps 1 and 2 of the Ensemblex workflow failed to correctly label a subset of doublet calls by these tools. To mitigate this issue and maximize the rate of doublet identification, Ensemblex labels the cells that are identified as doublets by Vireo or Demuxalot as doublets, by default; however, users can nominate different tools for the ensemble-independent doublet detection component depending on the desired doublet detection stringency.
+
+
+
Figure 9. Graphical representation of the ensemble-independent doublet detection component of the Ensemblex framework.
+We sequentially applied each step of the Ensemblex framework to 96 computationally multiplexed pools with known ground truth sample labels ranging in size from 4 to 80 samples. The proportion of correctly classified singlets and doublets identified by Ensemblex after each step of the framework is shown in Figure 10.
+
+
+
Figure 10. Contribution of each component of the Ensemblex framework to demultiplexing accuracy. The average proportion of correctly classified A) singlets and B) doublets across replicates at a given pool size is shown after sequentially applying each step of the Ensemblex framework. The right panels show the average proportion of correct classifications across all 96 pools. The blue points show the proportion of cells that were correctly classified by at least one tool: Demuxalot, Demuxlet, Souporcell, or Vireo.
+For detailed methodology please see our pre-print manuscript.
+ +The Ensemblex pipeline was developed to facilitate the application of each of Ensemblex's constituent demultiplexing tools and seamlessly integrate the output files into the Ensemblex framework. We provide two distinct, yet highly comparable pipelines:
+The pipelines comprise of four distinct steps:
+
+
+
Each step of the pipeline is comprehensively described in the following sections of the Ensemblex documentation.
+Prior to running the Ensemblex pipeline, users should modify the execution parameters for the constituent genetic demultiplexing tools and the Ensemblex algorithm. Upon running Step 1: Set up, a /job_info
folder will be created in the wording directory. Within the /job_info
folder is a /configs
folder which contains the ensemblex_config.ini
; this .ini file contains all of the adjustable parameters for the Ensemblex pipeline.
working_directory
+└── job_info
+ ├── configs
+ │ └── ensemblex_config.ini
+ ├── logs
+ └── summary_report.txt
+
+To ensure replicability, the execution parameters are documented in ~/working_directory/job_info/summary_report.txt
.
The following section illustrates how to modify the ensemblex_config.ini
parameter file directly from the terminal. To begin, navigate to the /configs
folder and view its contents:
cd ~/working_directory/job_info/configs
+ls
+
+The following file will be available: ensemblex_config.ini
To modify the ensemblex_config.ini
parameter file directly in the terminal we will use Nano:
nano ensemblex_config.ini
+
+This will open ensemblex_config.ini
in the terminal and allow users to modify the parameters. To save the modifications and exit the parameter file, type ctrl+o
followed by ctrl+x
.
The following parameters are adjustable for Demuxalot:
+Parameter | +Default | +Description | +
---|---|---|
PAR_demuxalot_genotype_names | +NULL | +List of Sample ID's in the sample VCF file (e.g., 'Sample_1,Sample_2,Sample_3'). | +
PAR_demuxalot_minimum_coverage | +200 | +Minimum read coverage. | +
PAR_demuxalot_minimum_alternative_coverage | +10 | +Minimum alternative read coverage. | +
PAR_demuxalot_n_best_snps_per_donor | +100 | +Number of best snps for each donor to use for demultiplexing. | +
PAR_demuxalot_genotypes_prior_strength | +1 | +Genotype prior strength. | +
PAR_demuxalot_doublet_prior | +0.25 | +Doublet prior strength. | +
The following parameters are adjustable for Demuxlet:
+Parameter | +Default | +Description | +
---|---|---|
PAR_demuxlet_field | +GT | +Field to extract the genotypes (GT), genotype likelihood (PL), or posterior probability (GP) from the sample .vcf file. | +
NOTE: We are currently working on expanding the execution parameters for Demuxlet.
+The following parameters are adjustable for Vireo:
+Parameter | +Default | +Description | +
---|---|---|
PAR_vireo_N | +NULL | +Number of pooled samples. | +
PAR_vireo_type | +GT | +Field to extract the genotypes (GT), genotype likelihood (PL), or posterior probability (GP) from the sample .vcf file. | +
PAR_vireo_processes | +20 | +Number of subprocesses for computing. | +
PAR_vireo_minMAF | +0.1 | +Minimum minor allele frequency. | +
PAR_vireo_minCOUNT | +20 | +Minimum aggregated count. | +
PAR_vireo_forcelearnGT | +T | +Whether or not to treat donor GT as prior only. | +
NOTE: We are currently working on expanding the execution parameters for Vireo.
+The following parameters are adjustable for Souporcell:
+Parameter | +Default | +Description | +
---|---|---|
PAR_minimap2 | +-ax splice -t 8 -G50k -k 21 -w 11 --sr -A2 -B8 -O12,32 -E2,1 -r200 -p.5 -N20 -f1000,5000 -n2 -m20 -s40 -g2000 -2K50m --secondary=no | +For information regarding the minimap2 parameters, please see the documentation. | +
PAR_freebayes | +-iXu -C 2 -q 20 -n 3 -E 1 -m 30 --min-coverage 6 | +For information regarding the freebayes parameters, please see the documentation. | +
PAR_vartrix_umi | +TRUE | +Whether or no to consider UMI information when populating coverage matrices. | +
PAR_vartrix_mapq | +30 | +Minimum read mapping quality. | +
PAR_vartrix_threads | +8 | +Number of threads for computing. | +
PAR_souporcell_k | +NULL | +Number of pooled samples. | +
PAR_souporcell_t | +8 | +Number of threads for computing. | +
NOTE: We are currently working on expanding the execution parameters for Souporcell.
+The following parameters are adjustable for Demuxalot:
+Parameter | +Default | +Description | +
---|---|---|
PAR_demuxalot_genotype_names | +NULL | +List of Sample ID's in the sample VCF file generated by Freemuxlet: outs.clust1.vcf (e.g., 'CLUST0,CLUST1,CLUST2'). | +
PAR_demuxalot_minimum_coverage | +200 | +Minimum read coverage. | +
PAR_demuxalot_minimum_alternative_coverage | +10 | +Minimum alternative read coverage. | +
PAR_demuxalot_n_best_snps_per_donor | +100 | +Number of best snps for each donor to use for demultiplexing. | +
PAR_demuxalot_genotypes_prior_strength | +1 | +Genotype prior strength. | +
PAR_demuxalot_doublet_prior | +0.25 | +Doublet prior strength. | +
The following parameters are adjustable for Freemuxlet:
+Parameter | +Default | +Description | +
---|---|---|
PAR_freemuxlet_nsample | +NULL | +Number of pooled samples. | +
NOTE: We are currently working on expanding the execution parameters for Freemuxlet.
+The following parameters are adjustable for Vireo:
+Parameter | +Default | +Description | +
---|---|---|
PAR_vireo_N | +NULL | +Number of pooled samples. | +
PAR_vireo_processes | +20 | +Number of subprocesses for computing. | +
PAR_vireo_minMAF | +0.1 | +Minimum minor allele frequency. | +
PAR_vireo_minCOUNT | +20 | +Minimum aggregated count. | +
NOTE: We are currently working on expanding the execution parameters for Vireo.
+The following parameters are adjustable for Souporcell:
+Parameter | +Default | +Description | +
---|---|---|
PAR_minimap2 | +-ax splice -t 8 -G50k -k 21 -w 11 --sr -A2 -B8 -O12,32 -E2,1 -r200 -p.5 -N20 -f1000,5000 -n2 -m20 -s40 -g2000 -2K50m --secondary=no | +For information regarding the minimap2 parameters, please see the documentation. | +
PAR_freebayes | +-iXu -C 2 -q 20 -n 3 -E 1 -m 30 --min-coverage 6 | +For information regarding the freebayes parameters, please see the documentation. | +
PAR_vartrix_umi | +TRUE | +Whether or no to consider UMI information when populating coverage matrices. | +
PAR_vartrix_mapq | +30 | +Minimum read mapping quality. | +
PAR_vartrix_threads | +8 | +Number of threads for computing. | +
PAR_souporcell_k | +NULL | +Number of pooled samples. | +
PAR_souporcell_t | +8 | +Number of threads for computing. | +
NOTE: We are currently working on expanding the execution parameters for Souporcell.
+The following parameters are adjustable for the Ensemblex algorithm:
+Parameter | +Default | +Description | +
---|---|---|
Pool parameters | ++ | + |
PAR_ensemblex_sample_size | +NULL | +Number of samples multiplexed in the pool. | +
PAR_ensemblex_expected_doublet_rate | +NULL | +Expected doublet rate for the pool. If using 10X Genomics, the expected doublet rate can be estimated based on the number of recovered cells. For more information see 10X Genomics Documentation. | +
Set up parameters | ++ | + |
PAR_ensemblex_merge_constituents | +Yes | +Whether or not to merge the output files of the constituent demultiplexing tools. If running Ensemblex on a pool for the first time, this parameter should be set to "Yes". Subsequent runs of Ensemblex (e.g., parameter optimization) can have this parameter set to "No" as the pipeline will automatically detect the previously generated merged file. | +
Step 1 parameters: Probabilistic-weighted ensemble | ++ | + |
PAR_ensemblex_probabilistic_weighted_ensemble | +Yes | +Whether or not to perform Step 1: Probabilistic-weighted ensemble. If running Ensemblex on a pool for the first time, this parameter should be set to "Yes". Subsequent runs of Ensemblex (e.g., parameter optimization) can have this parameter set to "No" as the pipeline will automatically detect the previously generated Step 1 output file. | +
Step 2 parameters: Graph-based doublet detection | ++ | + |
PAR_ensemblex_preliminary_parameter_sweep | +No | +Whether or not to perform a preliminary parameter sweep for Step 2: Graph-based doublet detection. Users should utilize the preliminary parameter sweep if they wish to manually define the number of confident doublets in the pool (nCD) and the percentile threshold of the nearest neighour frequency (pT), which can be defined in the following two parameters, respectively. | +
PAR_ensemblex_nCD | +NULL | +Manually defined number of confident doublets in the pool (nCD). Value can be informed by the output files generated by setting PAR_ensemblex_preliminary_parameter_sweep to "Yes". | +
PAR_ensemblex_pT | +NULL | +Manually defined percentile threshold of the nearest neighour frequency (pT). Value can be informed by the output files generated by setting PAR_ensemblex_preliminary_parameter_sweep to "Yes". | +
PAR_ensemblex_graph_based_doublet_detection | +Yes | +Whether or not to perform Step 2: Graph-based doublet detection. If PAR_ensemblex_nCD and PAR_ensemblex_pT are not defined by the user (NULL), Ensemblex will automatically determine the optimal parameter values using an unsupervised parameter sweep. If PAR_ensemblex_nCD and PAR_ensemblex_pT are defined by the user, graph-based doublet detection will be performed with the user-defined values. | +
Step 3 parameters: Ensemble-independent doublet detection | ++ | + |
PAR_ensemblex_preliminary_ensemble_independent_doublet | +No | +Whether or not to perform a preliminary parameter sweep for Step 3: Ensemble-independent doublet detection. Users should utilize the preliminary parameter sweep if they wish to manually define which constituent tools to utilize for ensemble-independent doublet detection. Users can define which tools to utilize for ensemble-independent doublet detection in the following parameters. | +
PAR_ensemblex_ensemble_independent_doublet | +Yes | +Whether or not to perform Step 3: Ensemble-independent doublet detection. | +
PAR_ensemblex_doublet_Demuxalot_threshold | +Yes | +Whether or not to label doublets identified by Demuxalot as doublets. Only doublets with assignment probabilities exceeding Demuxalot's recommended probability threshold will be labeled as doublets by Ensemblex. | +
PAR_ensemblex_doublet_Demuxalot_no_threshold | +No | +Whether or not to label doublets identified by Demuxalot as doublets, regardless of the corresponding assignment probability. | +
PAR_ensemblex_doublet_Demuxlet_threshold | +No | +Whether or not to label doublets identified by Demuxlet as doublets. Only doublets with assignment probabilities exceeding Demuxlet's recommended probability threshold will be labeled as doublets by Ensemblex. | +
PAR_ensemblex_doublet_Demuxlet_no_threshold | +No | +Whether or not to label doublets identified by Demuxlet as doublets, regardless of the corresponding assignment probability. | +
PAR_ensemblex_doublet_Souporcell_threshold | +No | +Whether or not to label doublets identified by Souporcell as doublets. Only doublets with assignment probabilities exceeding Souporcell's recommended probability threshold will be labeled as doublets by Ensemblex. | +
PAR_ensemblex_doublet_Souporcell_no_threshold | +No | +Whether or not to label doublets identified by Souporcell as doublets, regardless of the corresponding assignment probability. | +
PAR_ensemblex_doublet_Vireo_threshold | +Yes | +Whether or not to label doublets identified by Vireo as doublets. Only doublets with assignment probabilities exceeding Vireo's recommended probability threshold will be labeled as doublets by Ensemblex. | +
PAR_ensemblex_doublet_Vireo_no_threshold | +No | +Whether or not to label doublets identified by Vireo as doublets, regardless of the corresponding assignment probability. | +
Confidence score parameters | ++ | + |
PAR_ensemblex_compute_singlet_confidence | +Yes | +Whether or not to compute Ensemblex's singlet confidence score. This will define low confidence assignments which should be removed from downstream analyses. | +
' + escapeHtml(summary) +'
' + noResultsText + '
'); + } +} + +function doSearch () { + var query = document.getElementById('mkdocs-search-query').value; + if (query.length > min_search_length) { + if (!window.Worker) { + displayResults(search(query)); + } else { + searchWorker.postMessage({query: query}); + } + } else { + // Clear results for short queries + displayResults([]); + } +} + +function initSearch () { + var search_input = document.getElementById('mkdocs-search-query'); + if (search_input) { + search_input.addEventListener("keyup", doSearch); + } + var term = getSearchTermFromLocation(); + if (term) { + search_input.value = term; + doSearch(); + } +} + +function onWorkerMessage (e) { + if (e.data.allowSearch) { + initSearch(); + } else if (e.data.results) { + var results = e.data.results; + displayResults(results); + } else if (e.data.config) { + min_search_length = e.data.config.min_search_length-1; + } +} + +if (!window.Worker) { + console.log('Web Worker API not supported'); + // load index in main thread + $.getScript(joinUrl(base_url, "search/worker.js")).done(function () { + console.log('Loaded worker'); + init(); + window.postMessage = function (msg) { + onWorkerMessage({data: msg}); + }; + }).fail(function (jqxhr, settings, exception) { + console.error('Could not load worker.js'); + }); +} else { + // Wrap search in a web worker + var searchWorker = new Worker(joinUrl(base_url, "search/worker.js")); + searchWorker.postMessage({init: true}); + searchWorker.onmessage = onWorkerMessage; +} diff --git a/site/search/search_index.json b/site/search/search_index.json new file mode 100644 index 0000000..9ccbead --- /dev/null +++ b/site/search/search_index.json @@ -0,0 +1 @@ +{"config":{"indexing":"full","lang":["en"],"min_search_length":3,"prebuild_index":false,"separator":"[\\s\\-]+"},"docs":[{"location":"","text":"Welcome to the Ensemblex documentation! Ensemblex is an accuracy-weighted ensemble framework for genetic demultiplexing of pooled single-cell RNA seqeuncing (scRNAseq) data. By addressing the limitiations of individual genetic demultiplexing tools, we demonstrated that Ensemblex: Achieves higher demultiplexing accuracy Limits the introduction of technical noise into scRNAseq analysis Retains a high proportion of cells for downstream analyses. The ensemble method capitalizes on the added confidence of combining distinct statistical frameworks for genetic demultiplexing, but the modular algorithm can adapt to the overall performance of its constituent tools on the respective dataset, making it resilient against a poorly performing constituent tool. Ensemblex can be used to demultiplex pools with or without prior genotype information. When demultiplexing with prior genotype information, Ensemblex leverages the sample assignments of four individual, constituent genetic demultiplexing tools: Demuxalot ( Rogozhnikov et al. ) Demuxlet ( Kang et al. ) Souporcell ( Heaton et al. ) Vireo-GT ( Huang et al. ) When demultiplexing without prior genotype information, Ensemblex leverages the sample assignments of four individual, constituent genetic demultiplexing tools: Demuxalot ( Rogozhnikov et al. ) Freemuxlet ( Kang et al. ) Souporcell ( Heaton et al. ) Vireo ( Huang et al. ) Upon demultiplexing pools with each of the four constituent genetic demultiplexing tools, Ensemblex processes the output files in a three-step pipeline to identify the most probable sample label for each cell based on the predictions of the constituent tools: Step 1 : Probabilistic-weighted ensemble Step 2 : Graph-based doublet detection Step 3 : Ensemble-independent doublet detection As output, Ensemblex returns its own cell-specific sample labels and corresponding assignment probabilities and singlet confidence score, as well as the sample labels and corresponding assignment probabilities for each of its constituents. The demultiplexed sample labels could then be used to perform downstream analyses. Figure 1. Overview of the Ensemblex worflow. A) The Ensemblex workflow begins with demultiplexing pooled samples by each of the constituent tools. The outputs from each individual demultiplexing tool are then used as input into the Ensemblex framework. B) The Ensemblex framework comprises three distinct steps that are assembled into a pipeline: 1) accuracy-weighted probabilistic ensemble, 2) graph-based doublet detection, and 3) ensemble-independent doublet detection. C) As output, Ensemblex returns its own sample-cell assignments as well as the sample-cell assignments of each of its constituent tools. D) Ensemblex's sample-cell assignments can be used to perform downstream analysis on the pooled scRNAseq data. To facilitate the application of Ensemblex, we provide a pipeline that demultiplexes pooled cells by each of the individual constituent genetic demultiplexing tools and processes the outputs with the Ensemblex algorithm. In this documentation, we outline each step of the Ensemblex pipeline, illustrate how to run the pipeline, define best practices, and provide a tutorial with pubicly available datasets. For a comprehensive descripttion of Ensemblex, ground-truth benchmarking, and application to real-world datasets, see our pre-print manuscript: Pre-print Contents The Ensemblex Algorithm: Ensemblex algorithm overview The Ensemblex Pipeline: Ensemblex pipeline overview Installation Step 1: Set up Step 2: Preparation of inpute files Step 3: Genetic demultiplexing by constituent tools Step 4: Application of Ensemblex Documentation: Execution parameters Ensemblex outputs Tutorial: Downloading data Ensemblex with prior genotype information About: Help and Feedback Acknowledgement License","title":"Home"},{"location":"#welcome-to-the-ensemblex-documentation","text":"Ensemblex is an accuracy-weighted ensemble framework for genetic demultiplexing of pooled single-cell RNA seqeuncing (scRNAseq) data. By addressing the limitiations of individual genetic demultiplexing tools, we demonstrated that Ensemblex: Achieves higher demultiplexing accuracy Limits the introduction of technical noise into scRNAseq analysis Retains a high proportion of cells for downstream analyses. The ensemble method capitalizes on the added confidence of combining distinct statistical frameworks for genetic demultiplexing, but the modular algorithm can adapt to the overall performance of its constituent tools on the respective dataset, making it resilient against a poorly performing constituent tool. Ensemblex can be used to demultiplex pools with or without prior genotype information. When demultiplexing with prior genotype information, Ensemblex leverages the sample assignments of four individual, constituent genetic demultiplexing tools: Demuxalot ( Rogozhnikov et al. ) Demuxlet ( Kang et al. ) Souporcell ( Heaton et al. ) Vireo-GT ( Huang et al. ) When demultiplexing without prior genotype information, Ensemblex leverages the sample assignments of four individual, constituent genetic demultiplexing tools: Demuxalot ( Rogozhnikov et al. ) Freemuxlet ( Kang et al. ) Souporcell ( Heaton et al. ) Vireo ( Huang et al. ) Upon demultiplexing pools with each of the four constituent genetic demultiplexing tools, Ensemblex processes the output files in a three-step pipeline to identify the most probable sample label for each cell based on the predictions of the constituent tools: Step 1 : Probabilistic-weighted ensemble Step 2 : Graph-based doublet detection Step 3 : Ensemble-independent doublet detection As output, Ensemblex returns its own cell-specific sample labels and corresponding assignment probabilities and singlet confidence score, as well as the sample labels and corresponding assignment probabilities for each of its constituents. The demultiplexed sample labels could then be used to perform downstream analyses. Figure 1. Overview of the Ensemblex worflow. A) The Ensemblex workflow begins with demultiplexing pooled samples by each of the constituent tools. The outputs from each individual demultiplexing tool are then used as input into the Ensemblex framework. B) The Ensemblex framework comprises three distinct steps that are assembled into a pipeline: 1) accuracy-weighted probabilistic ensemble, 2) graph-based doublet detection, and 3) ensemble-independent doublet detection. C) As output, Ensemblex returns its own sample-cell assignments as well as the sample-cell assignments of each of its constituent tools. D) Ensemblex's sample-cell assignments can be used to perform downstream analysis on the pooled scRNAseq data. To facilitate the application of Ensemblex, we provide a pipeline that demultiplexes pooled cells by each of the individual constituent genetic demultiplexing tools and processes the outputs with the Ensemblex algorithm. In this documentation, we outline each step of the Ensemblex pipeline, illustrate how to run the pipeline, define best practices, and provide a tutorial with pubicly available datasets. For a comprehensive descripttion of Ensemblex, ground-truth benchmarking, and application to real-world datasets, see our pre-print manuscript: Pre-print","title":"Welcome to the Ensemblex documentation!"},{"location":"#contents","text":"The Ensemblex Algorithm: Ensemblex algorithm overview The Ensemblex Pipeline: Ensemblex pipeline overview Installation Step 1: Set up Step 2: Preparation of inpute files Step 3: Genetic demultiplexing by constituent tools Step 4: Application of Ensemblex Documentation: Execution parameters Ensemblex outputs Tutorial: Downloading data Ensemblex with prior genotype information About: Help and Feedback Acknowledgement License","title":"Contents"},{"location":"Acknowledgement/","text":"Acknowledgement The Ensemblex pipeline was produced for projects funded by the Canadian Institute of Health Research and Michael J. Fox Foundation Parkinson's Progression Markers Initiative (MJFF PPMI) in collaboration with The Neuro's Early Drug Discovery Unit (EDDU), McGill University. It is written by Michael Fiorini and Saeid Amiri with supervision from Rhalena Thomas and Sali Farhan at the Montreal Neurological Institute-Hospital. Copyright belongs MNI BIOINFO CORE .","title":"Acknowledgement"},{"location":"Acknowledgement/#acknowledgement","text":"The Ensemblex pipeline was produced for projects funded by the Canadian Institute of Health Research and Michael J. Fox Foundation Parkinson's Progression Markers Initiative (MJFF PPMI) in collaboration with The Neuro's Early Drug Discovery Unit (EDDU), McGill University. It is written by Michael Fiorini and Saeid Amiri with supervision from Rhalena Thomas and Sali Farhan at the Montreal Neurological Institute-Hospital. Copyright belongs MNI BIOINFO CORE .","title":"Acknowledgement"},{"location":"Dataset1/","text":"Ensemblex pipeline with prior genotype information Introduction Installation Step 1: Set up Step 2: Preparation of input files Step 3: Genetic demultiplexing by constituent tools Step 4: Application of Ensemblex Resource requirements Introduction This guide illustrates how to use the Ensemblex pipeline to demultiplexed pooled scRNAseq samples with prior genotype information. Here, we will leverage a pooled scRNAseq dataset produced by Jerber et al. . This pool contains induced pluripotent cell lines (iPSC) from 9 healthy controls that were differentiated towards a dopaminergic neuron state. The Ensemblex pipeline is illustrated in the diagram below: NOTE : To download the necessary files for the tutorial please see the Downloading data section of the Ensemblex documentation. Installation [to be completed] module load StdEnv/2023 module load apptainer/1.2.4 Step 1: Set up In Step 1, we will set up the working directory for the Ensemblex pipeline and decide which version of the pipeline we want to use. First, create a dedicated folder for the analysis (hereafter referred to as the working directory). Then, define the path to the working directory and the path to ensemblex.pip: ## Create and navigate to the working directory cd ensemblex_tutorial mkdir working_directory cd ~/ensemblex_tutorial/working_directory ## Define the path to ensemblex.pip ensemblex_HOME=~/ensemblex.pip ## Define the path to the working directory ensemblex_PWD=~/ensemblex_tutorial/working_directory Next, we can set up the working directory and choose the Ensemblex pipeline for demultiplexing with prior genotype information ( --step init-GT ) using the following code: bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step init-GT After running the above code, the working directory should have the following structure: ensemblex_tutorial \u2514\u2500\u2500 working_directory \u251c\u2500\u2500 demuxalot \u251c\u2500\u2500 demuxlet \u251c\u2500\u2500 ensemblex_gt \u251c\u2500\u2500 input_files \u251c\u2500\u2500 job_info \u2502 \u251c\u2500\u2500 configs \u2502 \u2502 \u2514\u2500\u2500 ensemblex_config.ini \u2502 \u251c\u2500\u2500 logs \u2502 \u2514\u2500\u2500 summary_report.txt \u251c\u2500\u2500 souporcell \u2514\u2500\u2500 vireo_gt Upon setting up the Ensemblex pipeline, we can proceed to Step 2 where we will prepare the input files for Ensemblex's constituent genetic demultiplexing tools. Step 2: Preparation of input files In Step 2, we will define the necessary files needed for ensemblex's constituent genetic demultiplexing tools and will place them within the working directory. Note : For the tutorial we will be using the data downloaded in the Downloading data section of the Ensemblex documentation. First, define all of the required files: BAM=~/ensemblex_tutorial/CellRanger/outs/possorted_genome_bam.bam BAM_INDEX=~/ensemblex_tutorial/CellRanger/outs/possorted_genome_bam.bam.bai BARCODES=~/ensemblex_tutorial/CellRanger/outs/filtered_gene_bc_matrices/refdata-cellranger-GRCh37/barcodes.tsv SAMPLE_VCF=~/ensemblex_tutorial/sample_genotype/sample_genotype_merge.vcf REFERENCE_VCF=~/ensemblex_tutorial/reference_files/common_SNPs_only.recode.vcf REFERENCE_FASTA=~/ensemblex_tutorial/reference_files/genome.fa REFERENCE_FASTA_INDEX=~/ensemblex_tutorial/reference_files/genome.fa.fai Next, we will sort the pooled samples and reference .vcf files according to the .bam file and place them within the working directory: ## Sort pooled samples .vcf file bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD/input_files/pooled_samples.vcf --step sort --vcf $SAMPLE_VCF --bam $ensemblex_PWD/input_files/pooled_bam.bam ## Sort reference .vcf file bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD/input_files/reference.vcf --step sort --vcf $SAMPLE_VCF --bam $ensemblex_PWD/input_files/pooled_bam.bam NOTE : To sort the vcf files we use the pipeline produced by the authors of Demuxlet/Freemuxlet ( Kang et al. ). Next, we will place the remaining necessary files within the working directory: cp $BAM $ensemblex_PWD/input_files/pooled_bam.bam cp $BAM_INDEX $ensemblex_PWD/input_files/pooled_bam.bam.bai cp $BARCODES $ensemblex_PWD/input_files/pooled_barcodes.tsv cp $REFERENCE_FASTA $ensemblex_PWD/input_files/reference.fa cp $REFERENCE_FASTA_INDEX $ensemblex_PWD/input_files/reference.fa.fai After running the above code, $ensemblex_PWD/input_files should contain the following files: input_files \u251c\u2500\u2500 pooled_bam.bam \u251c\u2500\u2500 pooled_bam.bam.bai \u251c\u2500\u2500 pooled_barcodes.tsv \u251c\u2500\u2500 pooled_samples.vcf \u251c\u2500\u2500 reference.fa \u251c\u2500\u2500 reference.fa.fai \u2514\u2500\u2500 reference.vcf NOTE : It is important that the file names match those listed above as they are necessary for the Ensemblex pipeline to recognize them. Step 3: Genetic demultiplexing by constituent tools In Step 3, we will demultiplex the pooled samples with each of Ensemblex's constituent genetic demultiplexing tools: Demuxalot Demuxlet Souporcell Vireo-GT First, we will navigate to the ensemblex_config.ini file to adjust the demultiplexing parameters for each of the constituent genetic demultiplexing tools: ## Navigate to the .ini file cd $ensemblex_PWD/job_info/configs ## Open the .ini file and adjust parameters directly in the terminal nano ensemblex_config.ini For the tutorial, we set the following parameters for the constituent genetic demultiplexing tools: Parameter Value PAR_demuxalot_genotype_names 'HPSI0115i-hecn_6,HPSI0214i-pelm_3,HPSI0314i-sojd_3,HPSI0414i-sebn_3,HPSI0514i-uenn_3,HPSI0714i-pipw_4,HPSI0715i-meue_5,HPSI0914i-vaka_5,HPSI1014i-quls_2' PAR_demuxalot_prior_strength 100 PAR_demuxalot_minimum_coverage 200 PAR_demuxalot_minimum_alternative_coverage 10 PAR_demuxalot_n_best_snps_per_donor 100 PAR_demuxalot_genotypes_prior_strength 1 PAR_demuxalot_doublet_prior 0.25 PAR_demuxlet_field GT PAR_vireo_N 9 PAR_vireo_type GT PAR_vireo_processes 20 PAR_vireo_minMAF 0.1 PAR_vireo_minCOUNT 20 PAR_vireo_forcelearnGT T PAR_minimap2 '-ax splice -t 8 -G50k -k 21 -w 11 --sr -A2 -B8 -O12,32 -E2,1 -r200 -p.5 -N20 -f1000,5000 -n2 -m20 -s40 -g2000 -2K50m --secondary=no' PAR_freebayes '-iXu -C 2 -q 20 -n 3 -E 1 -m 30 --min-coverage 6' PAR_vartrix_umi TRUE PAR_vartrix_mapq 30 PAR_vartrix_threads 8 PAR_souporcell_k 9 PAR_souporcell_t 8 Now that the parameters have been defined, we can demultiplex the pools with the constituent genetic demultiplexing tools. Demuxalot To run Demuxalot use the following code: bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step demuxalot If Demuxalot completed successfully, the following files should be available in $ensemblex_PWD/demuxalot : demuxalot \u251c\u2500\u2500 Demuxalot_result.csv \u2514\u2500\u2500 new_snps_single_file.betas Demuxlet To run Demuxlet use the following code: bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step demuxlet If Demuxlet completed successfully, the following files should be available in $ensemblex_PWD/demuxlet : demuxlet \u251c\u2500\u2500 outs.best \u251c\u2500\u2500 pileup.cel.gz \u251c\u2500\u2500 pileup.plp.gz \u251c\u2500\u2500 pileup.umi.gz \u2514\u2500\u2500 pileup.var.gz Souporcell To run Souporcell use the following code: bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step souporcell If Souporcell completed successfully, the following files should be available in $ensemblex_PWD/souporcell : souporcell \u251c\u2500\u2500 alt.mtx \u251c\u2500\u2500 cluster_genotypes.vcf \u251c\u2500\u2500 clusters_tmp.tsv \u251c\u2500\u2500 clusters.tsv \u251c\u2500\u2500 fq.fq \u251c\u2500\u2500 minimap.sam \u251c\u2500\u2500 minitagged.bam \u251c\u2500\u2500 minitagged_sorted.bam \u251c\u2500\u2500 minitagged_sorted.bam.bai \u251c\u2500\u2500 Pool.vcf \u251c\u2500\u2500 ref.mtx \u2514\u2500\u2500 soup.txt Vireo To run Vireo-GT use the following code: bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step vireo If Vireo-GT completed successfully, the following files should be available in $ensemblex_PWD/vireo_gt : vireo_gt \u251c\u2500\u2500 cellSNP.base.vcf.gz \u251c\u2500\u2500 cellSNP.cells.vcf.gz \u251c\u2500\u2500 cellSNP.samples.tsv \u251c\u2500\u2500 cellSNP.tag.AD.mtx \u251c\u2500\u2500 cellSNP.tag.DP.mtx \u251c\u2500\u2500 cellSNP.tag.OTH.mtx \u251c\u2500\u2500 donor_ids.tsv \u251c\u2500\u2500 fig_GT_distance_estimated.pdf \u251c\u2500\u2500 fig_GT_distance_input.pdf \u251c\u2500\u2500 GT_donors.vireo.vcf.gz \u251c\u2500\u2500 _log.txt \u251c\u2500\u2500 prob_doublet.tsv.gz \u251c\u2500\u2500 prob_singlet.tsv.gz \u2514\u2500\u2500 summary.tsv Upon demultiplexing the pooled samples with each of Ensemblex's constituent genetic demultiplexing tools, we can proceed to Step 4 where we will process the output files of the consituent tools with the Ensemblex algorithm to generate the ensemble sample classifications NOTE : To minimize computation time for the tutorial, we have provided the necessary outpu files from the constituent tools here . To access the files and place them in the working directory, use the following code: ## Demuxalot cd $ensemblex_PWD/demuxalot wget https://github.com/neurobioinfo/ensemblex/blob/caad8c250566bfa9a6d7a78b77d2cc338468a58e/tutorial/Demuxalot_result.csv ## Demuxlet cd $ensemblex_PWD/demuxlet wget https://github.com/neurobioinfo/ensemblex/blob/caad8c250566bfa9a6d7a78b77d2cc338468a58e/tutorial/outs.best ## Souporcell cd $ensemblex_PWD/souporcell wget https://github.com/neurobioinfo/ensemblex/blob/caad8c250566bfa9a6d7a78b77d2cc338468a58e/tutorial/clusters.tsv ## Vireo cd $ensemblex_PWD/vireo_gt wget https://github.com/neurobioinfo/ensemblex/blob/caad8c250566bfa9a6d7a78b77d2cc338468a58e/tutorial/donor_ids.tsv Step 4: Application of Ensemblex In Step 4, we will process the output files of the four constituent genetic demultiplexing tools with the three-step Ensemblex algorithm: Step 1: Probabilistic-weighted ensemble Step 2: Graph-based doublet detection Step 3: Step 3: Ensemble-independent doublet detection First, we will navigate to the ensemblex_config.ini file to adjust the demultiplexing parameters for the Ensemblex algorithm: ## Navigate to the .ini file cd $ensemblex_PWD/job_info/configs ## Open the .ini file and adjust parameters directly in the terminal nano ensemblex_config.ini For the tutorial, we set the following parameters for the Ensemblex algorithm: Parameter Value Pool parameters PAR_ensemblex_sample_size 9 PAR_ensemblex_expected_doublet_rate 0.10 Set up parameters PAR_ensemblex_merge_constituents Yes Step 1 parameters: Probabilistic-weighted ensemble PAR_ensemblex_probabilistic_weighted_ensemble Yes Step 2 parameters: Graph-based doublet detection PAR_ensemblex_preliminary_parameter_sweep No PAR_ensemblex_nCD NULL PAR_ensemblex_pT NULL PAR_ensemblex_graph_based_doublet_detection Yes Step 3 parameters: Ensemble-independent doublet detection PAR_ensemblex_preliminary_ensemble_independent_doublet No PAR_ensemblex_ensemble_independent_doublet Yes PAR_ensemblex_doublet_Demuxalot_threshold Yes PAR_ensemblex_doublet_Demuxalot_no_threshold No PAR_ensemblex_doublet_Demuxlet_threshold No PAR_ensemblex_doublet_Demuxlet_no_threshold No PAR_ensemblex_doublet_Souporcell_threshold No PAR_ensemblex_doublet_Souporcell_no_threshold No PAR_ensemblex_doublet_Vireo_threshold Yes PAR_ensemblex_doublet_Vireo_no_threshold No Confidence score parameters PAR_ensemblex_compute_singlet_confidence Yes If Ensemblex completed successfully, the following files should be available in $ensemblex_PWD/ensemblex_gt : ensemblex_gt \u251c\u2500\u2500 confidence \u2502 \u2514\u2500\u2500 ensemblex_final_cell_assignment.csv \u251c\u2500\u2500 constituent_tool_merge.csv \u251c\u2500\u2500 step1 \u2502 \u251c\u2500\u2500 ARI_demultiplexing_tools.pdf \u2502 \u251c\u2500\u2500 BA_demultiplexing_tools.pdf \u2502 \u251c\u2500\u2500 Balanced_accuracy_summary.csv \u2502 \u2514\u2500\u2500 step1_cell_assignment.csv \u251c\u2500\u2500 step2 \u2502 \u251c\u2500\u2500 optimal_nCD.pdf \u2502 \u251c\u2500\u2500 optimal_pT.pdf \u2502 \u251c\u2500\u2500 PC1_var_contrib.pdf \u2502 \u251c\u2500\u2500 PC2_var_contrib.pdf \u2502 \u251c\u2500\u2500 PCA1_graph_based_doublet_detection.pdf \u2502 \u251c\u2500\u2500 PCA2_graph_based_doublet_detection.pdf \u2502 \u251c\u2500\u2500 PCA3_graph_based_doublet_detection.pdf \u2502 \u251c\u2500\u2500 PCA_plot.pdf \u2502 \u251c\u2500\u2500 PCA_scree_plot.pdf \u2502 \u2514\u2500\u2500 Step2_cell_assignment.csv \u2514\u2500\u2500 step3 \u251c\u2500\u2500 Doublet_overlap_no_threshold.pdf \u251c\u2500\u2500 Doublet_overlap_threshold.pdf \u251c\u2500\u2500 Number_ensemblex_doublets_EID_no_threshold.pdf \u251c\u2500\u2500 Number_ensemblex_doublets_EID_threshold.pdf \u2514\u2500\u2500 Step3_cell_assignment.csv Ensemblex's final assignments are described in the ensemblex_final_cell_assignment.csv file. Specifically, the ensemblex_assignment column describes Ensemblex's final assignments after application of the singlet confidence threshold (i.e., singlets that fail to meet a singlet confidence of 1.0 are labelled as unassigned); we recomment that users use this column to label their cells for downstream analyses. The ensemblex_best_assignment column describes Ensemblex's best assignments, independent of the singlets confidence threshold (i.e., singlets that fail to meet a singlet confidence of 1.0 are NOT labelled as unassigned). The cell barcodes listed under the barcode column can be used to add the ensemblex_final_cell_assignment.csv information to the metadata of a Seurat object. Resource requirements The following table describes the computational resources used in this tutorial for genetic demultiplexing by the constituent tools and application of the Ensemblex algorithm. Tool Time CPU Memory Demuxalot 01:34:59 6 12.95 GB Demuxlet 03:16:03 6 138.32 GB Souporcell 2-14:49:21 1 21.83 GB Vireo 2-01:30:24 6 29.42 GB Ensemblex 02:05:27 1 5.67 GB","title":"Ensemblex with prior genotype information"},{"location":"Dataset1/#ensemblex-pipeline-with-prior-genotype-information","text":"Introduction Installation Step 1: Set up Step 2: Preparation of input files Step 3: Genetic demultiplexing by constituent tools Step 4: Application of Ensemblex Resource requirements","title":"Ensemblex pipeline with prior genotype information"},{"location":"Dataset1/#introduction","text":"This guide illustrates how to use the Ensemblex pipeline to demultiplexed pooled scRNAseq samples with prior genotype information. Here, we will leverage a pooled scRNAseq dataset produced by Jerber et al. . This pool contains induced pluripotent cell lines (iPSC) from 9 healthy controls that were differentiated towards a dopaminergic neuron state. The Ensemblex pipeline is illustrated in the diagram below: NOTE : To download the necessary files for the tutorial please see the Downloading data section of the Ensemblex documentation.","title":"Introduction"},{"location":"Dataset1/#installation","text":"[to be completed] module load StdEnv/2023 module load apptainer/1.2.4","title":"Installation"},{"location":"Dataset1/#step-1-set-up","text":"In Step 1, we will set up the working directory for the Ensemblex pipeline and decide which version of the pipeline we want to use. First, create a dedicated folder for the analysis (hereafter referred to as the working directory). Then, define the path to the working directory and the path to ensemblex.pip: ## Create and navigate to the working directory cd ensemblex_tutorial mkdir working_directory cd ~/ensemblex_tutorial/working_directory ## Define the path to ensemblex.pip ensemblex_HOME=~/ensemblex.pip ## Define the path to the working directory ensemblex_PWD=~/ensemblex_tutorial/working_directory Next, we can set up the working directory and choose the Ensemblex pipeline for demultiplexing with prior genotype information ( --step init-GT ) using the following code: bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step init-GT After running the above code, the working directory should have the following structure: ensemblex_tutorial \u2514\u2500\u2500 working_directory \u251c\u2500\u2500 demuxalot \u251c\u2500\u2500 demuxlet \u251c\u2500\u2500 ensemblex_gt \u251c\u2500\u2500 input_files \u251c\u2500\u2500 job_info \u2502 \u251c\u2500\u2500 configs \u2502 \u2502 \u2514\u2500\u2500 ensemblex_config.ini \u2502 \u251c\u2500\u2500 logs \u2502 \u2514\u2500\u2500 summary_report.txt \u251c\u2500\u2500 souporcell \u2514\u2500\u2500 vireo_gt Upon setting up the Ensemblex pipeline, we can proceed to Step 2 where we will prepare the input files for Ensemblex's constituent genetic demultiplexing tools.","title":"Step 1: Set up"},{"location":"Dataset1/#step-2-preparation-of-input-files","text":"In Step 2, we will define the necessary files needed for ensemblex's constituent genetic demultiplexing tools and will place them within the working directory. Note : For the tutorial we will be using the data downloaded in the Downloading data section of the Ensemblex documentation. First, define all of the required files: BAM=~/ensemblex_tutorial/CellRanger/outs/possorted_genome_bam.bam BAM_INDEX=~/ensemblex_tutorial/CellRanger/outs/possorted_genome_bam.bam.bai BARCODES=~/ensemblex_tutorial/CellRanger/outs/filtered_gene_bc_matrices/refdata-cellranger-GRCh37/barcodes.tsv SAMPLE_VCF=~/ensemblex_tutorial/sample_genotype/sample_genotype_merge.vcf REFERENCE_VCF=~/ensemblex_tutorial/reference_files/common_SNPs_only.recode.vcf REFERENCE_FASTA=~/ensemblex_tutorial/reference_files/genome.fa REFERENCE_FASTA_INDEX=~/ensemblex_tutorial/reference_files/genome.fa.fai Next, we will sort the pooled samples and reference .vcf files according to the .bam file and place them within the working directory: ## Sort pooled samples .vcf file bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD/input_files/pooled_samples.vcf --step sort --vcf $SAMPLE_VCF --bam $ensemblex_PWD/input_files/pooled_bam.bam ## Sort reference .vcf file bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD/input_files/reference.vcf --step sort --vcf $SAMPLE_VCF --bam $ensemblex_PWD/input_files/pooled_bam.bam NOTE : To sort the vcf files we use the pipeline produced by the authors of Demuxlet/Freemuxlet ( Kang et al. ). Next, we will place the remaining necessary files within the working directory: cp $BAM $ensemblex_PWD/input_files/pooled_bam.bam cp $BAM_INDEX $ensemblex_PWD/input_files/pooled_bam.bam.bai cp $BARCODES $ensemblex_PWD/input_files/pooled_barcodes.tsv cp $REFERENCE_FASTA $ensemblex_PWD/input_files/reference.fa cp $REFERENCE_FASTA_INDEX $ensemblex_PWD/input_files/reference.fa.fai After running the above code, $ensemblex_PWD/input_files should contain the following files: input_files \u251c\u2500\u2500 pooled_bam.bam \u251c\u2500\u2500 pooled_bam.bam.bai \u251c\u2500\u2500 pooled_barcodes.tsv \u251c\u2500\u2500 pooled_samples.vcf \u251c\u2500\u2500 reference.fa \u251c\u2500\u2500 reference.fa.fai \u2514\u2500\u2500 reference.vcf NOTE : It is important that the file names match those listed above as they are necessary for the Ensemblex pipeline to recognize them.","title":"Step 2: Preparation of input files"},{"location":"Dataset1/#step-3-genetic-demultiplexing-by-constituent-tools","text":"In Step 3, we will demultiplex the pooled samples with each of Ensemblex's constituent genetic demultiplexing tools: Demuxalot Demuxlet Souporcell Vireo-GT First, we will navigate to the ensemblex_config.ini file to adjust the demultiplexing parameters for each of the constituent genetic demultiplexing tools: ## Navigate to the .ini file cd $ensemblex_PWD/job_info/configs ## Open the .ini file and adjust parameters directly in the terminal nano ensemblex_config.ini For the tutorial, we set the following parameters for the constituent genetic demultiplexing tools: Parameter Value PAR_demuxalot_genotype_names 'HPSI0115i-hecn_6,HPSI0214i-pelm_3,HPSI0314i-sojd_3,HPSI0414i-sebn_3,HPSI0514i-uenn_3,HPSI0714i-pipw_4,HPSI0715i-meue_5,HPSI0914i-vaka_5,HPSI1014i-quls_2' PAR_demuxalot_prior_strength 100 PAR_demuxalot_minimum_coverage 200 PAR_demuxalot_minimum_alternative_coverage 10 PAR_demuxalot_n_best_snps_per_donor 100 PAR_demuxalot_genotypes_prior_strength 1 PAR_demuxalot_doublet_prior 0.25 PAR_demuxlet_field GT PAR_vireo_N 9 PAR_vireo_type GT PAR_vireo_processes 20 PAR_vireo_minMAF 0.1 PAR_vireo_minCOUNT 20 PAR_vireo_forcelearnGT T PAR_minimap2 '-ax splice -t 8 -G50k -k 21 -w 11 --sr -A2 -B8 -O12,32 -E2,1 -r200 -p.5 -N20 -f1000,5000 -n2 -m20 -s40 -g2000 -2K50m --secondary=no' PAR_freebayes '-iXu -C 2 -q 20 -n 3 -E 1 -m 30 --min-coverage 6' PAR_vartrix_umi TRUE PAR_vartrix_mapq 30 PAR_vartrix_threads 8 PAR_souporcell_k 9 PAR_souporcell_t 8 Now that the parameters have been defined, we can demultiplex the pools with the constituent genetic demultiplexing tools.","title":"Step 3: Genetic demultiplexing by constituent tools"},{"location":"Dataset1/#demuxalot","text":"To run Demuxalot use the following code: bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step demuxalot If Demuxalot completed successfully, the following files should be available in $ensemblex_PWD/demuxalot : demuxalot \u251c\u2500\u2500 Demuxalot_result.csv \u2514\u2500\u2500 new_snps_single_file.betas","title":"Demuxalot"},{"location":"Dataset1/#demuxlet","text":"To run Demuxlet use the following code: bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step demuxlet If Demuxlet completed successfully, the following files should be available in $ensemblex_PWD/demuxlet : demuxlet \u251c\u2500\u2500 outs.best \u251c\u2500\u2500 pileup.cel.gz \u251c\u2500\u2500 pileup.plp.gz \u251c\u2500\u2500 pileup.umi.gz \u2514\u2500\u2500 pileup.var.gz","title":"Demuxlet"},{"location":"Dataset1/#souporcell","text":"To run Souporcell use the following code: bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step souporcell If Souporcell completed successfully, the following files should be available in $ensemblex_PWD/souporcell : souporcell \u251c\u2500\u2500 alt.mtx \u251c\u2500\u2500 cluster_genotypes.vcf \u251c\u2500\u2500 clusters_tmp.tsv \u251c\u2500\u2500 clusters.tsv \u251c\u2500\u2500 fq.fq \u251c\u2500\u2500 minimap.sam \u251c\u2500\u2500 minitagged.bam \u251c\u2500\u2500 minitagged_sorted.bam \u251c\u2500\u2500 minitagged_sorted.bam.bai \u251c\u2500\u2500 Pool.vcf \u251c\u2500\u2500 ref.mtx \u2514\u2500\u2500 soup.txt","title":"Souporcell"},{"location":"Dataset1/#vireo","text":"To run Vireo-GT use the following code: bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step vireo If Vireo-GT completed successfully, the following files should be available in $ensemblex_PWD/vireo_gt : vireo_gt \u251c\u2500\u2500 cellSNP.base.vcf.gz \u251c\u2500\u2500 cellSNP.cells.vcf.gz \u251c\u2500\u2500 cellSNP.samples.tsv \u251c\u2500\u2500 cellSNP.tag.AD.mtx \u251c\u2500\u2500 cellSNP.tag.DP.mtx \u251c\u2500\u2500 cellSNP.tag.OTH.mtx \u251c\u2500\u2500 donor_ids.tsv \u251c\u2500\u2500 fig_GT_distance_estimated.pdf \u251c\u2500\u2500 fig_GT_distance_input.pdf \u251c\u2500\u2500 GT_donors.vireo.vcf.gz \u251c\u2500\u2500 _log.txt \u251c\u2500\u2500 prob_doublet.tsv.gz \u251c\u2500\u2500 prob_singlet.tsv.gz \u2514\u2500\u2500 summary.tsv Upon demultiplexing the pooled samples with each of Ensemblex's constituent genetic demultiplexing tools, we can proceed to Step 4 where we will process the output files of the consituent tools with the Ensemblex algorithm to generate the ensemble sample classifications NOTE : To minimize computation time for the tutorial, we have provided the necessary outpu files from the constituent tools here . To access the files and place them in the working directory, use the following code: ## Demuxalot cd $ensemblex_PWD/demuxalot wget https://github.com/neurobioinfo/ensemblex/blob/caad8c250566bfa9a6d7a78b77d2cc338468a58e/tutorial/Demuxalot_result.csv ## Demuxlet cd $ensemblex_PWD/demuxlet wget https://github.com/neurobioinfo/ensemblex/blob/caad8c250566bfa9a6d7a78b77d2cc338468a58e/tutorial/outs.best ## Souporcell cd $ensemblex_PWD/souporcell wget https://github.com/neurobioinfo/ensemblex/blob/caad8c250566bfa9a6d7a78b77d2cc338468a58e/tutorial/clusters.tsv ## Vireo cd $ensemblex_PWD/vireo_gt wget https://github.com/neurobioinfo/ensemblex/blob/caad8c250566bfa9a6d7a78b77d2cc338468a58e/tutorial/donor_ids.tsv","title":"Vireo"},{"location":"Dataset1/#step-4-application-of-ensemblex","text":"In Step 4, we will process the output files of the four constituent genetic demultiplexing tools with the three-step Ensemblex algorithm: Step 1: Probabilistic-weighted ensemble Step 2: Graph-based doublet detection Step 3: Step 3: Ensemble-independent doublet detection First, we will navigate to the ensemblex_config.ini file to adjust the demultiplexing parameters for the Ensemblex algorithm: ## Navigate to the .ini file cd $ensemblex_PWD/job_info/configs ## Open the .ini file and adjust parameters directly in the terminal nano ensemblex_config.ini For the tutorial, we set the following parameters for the Ensemblex algorithm: Parameter Value Pool parameters PAR_ensemblex_sample_size 9 PAR_ensemblex_expected_doublet_rate 0.10 Set up parameters PAR_ensemblex_merge_constituents Yes Step 1 parameters: Probabilistic-weighted ensemble PAR_ensemblex_probabilistic_weighted_ensemble Yes Step 2 parameters: Graph-based doublet detection PAR_ensemblex_preliminary_parameter_sweep No PAR_ensemblex_nCD NULL PAR_ensemblex_pT NULL PAR_ensemblex_graph_based_doublet_detection Yes Step 3 parameters: Ensemble-independent doublet detection PAR_ensemblex_preliminary_ensemble_independent_doublet No PAR_ensemblex_ensemble_independent_doublet Yes PAR_ensemblex_doublet_Demuxalot_threshold Yes PAR_ensemblex_doublet_Demuxalot_no_threshold No PAR_ensemblex_doublet_Demuxlet_threshold No PAR_ensemblex_doublet_Demuxlet_no_threshold No PAR_ensemblex_doublet_Souporcell_threshold No PAR_ensemblex_doublet_Souporcell_no_threshold No PAR_ensemblex_doublet_Vireo_threshold Yes PAR_ensemblex_doublet_Vireo_no_threshold No Confidence score parameters PAR_ensemblex_compute_singlet_confidence Yes If Ensemblex completed successfully, the following files should be available in $ensemblex_PWD/ensemblex_gt : ensemblex_gt \u251c\u2500\u2500 confidence \u2502 \u2514\u2500\u2500 ensemblex_final_cell_assignment.csv \u251c\u2500\u2500 constituent_tool_merge.csv \u251c\u2500\u2500 step1 \u2502 \u251c\u2500\u2500 ARI_demultiplexing_tools.pdf \u2502 \u251c\u2500\u2500 BA_demultiplexing_tools.pdf \u2502 \u251c\u2500\u2500 Balanced_accuracy_summary.csv \u2502 \u2514\u2500\u2500 step1_cell_assignment.csv \u251c\u2500\u2500 step2 \u2502 \u251c\u2500\u2500 optimal_nCD.pdf \u2502 \u251c\u2500\u2500 optimal_pT.pdf \u2502 \u251c\u2500\u2500 PC1_var_contrib.pdf \u2502 \u251c\u2500\u2500 PC2_var_contrib.pdf \u2502 \u251c\u2500\u2500 PCA1_graph_based_doublet_detection.pdf \u2502 \u251c\u2500\u2500 PCA2_graph_based_doublet_detection.pdf \u2502 \u251c\u2500\u2500 PCA3_graph_based_doublet_detection.pdf \u2502 \u251c\u2500\u2500 PCA_plot.pdf \u2502 \u251c\u2500\u2500 PCA_scree_plot.pdf \u2502 \u2514\u2500\u2500 Step2_cell_assignment.csv \u2514\u2500\u2500 step3 \u251c\u2500\u2500 Doublet_overlap_no_threshold.pdf \u251c\u2500\u2500 Doublet_overlap_threshold.pdf \u251c\u2500\u2500 Number_ensemblex_doublets_EID_no_threshold.pdf \u251c\u2500\u2500 Number_ensemblex_doublets_EID_threshold.pdf \u2514\u2500\u2500 Step3_cell_assignment.csv Ensemblex's final assignments are described in the ensemblex_final_cell_assignment.csv file. Specifically, the ensemblex_assignment column describes Ensemblex's final assignments after application of the singlet confidence threshold (i.e., singlets that fail to meet a singlet confidence of 1.0 are labelled as unassigned); we recomment that users use this column to label their cells for downstream analyses. The ensemblex_best_assignment column describes Ensemblex's best assignments, independent of the singlets confidence threshold (i.e., singlets that fail to meet a singlet confidence of 1.0 are NOT labelled as unassigned). The cell barcodes listed under the barcode column can be used to add the ensemblex_final_cell_assignment.csv information to the metadata of a Seurat object.","title":"Step 4: Application of Ensemblex"},{"location":"Dataset1/#resource-requirements","text":"The following table describes the computational resources used in this tutorial for genetic demultiplexing by the constituent tools and application of the Ensemblex algorithm. Tool Time CPU Memory Demuxalot 01:34:59 6 12.95 GB Demuxlet 03:16:03 6 138.32 GB Souporcell 2-14:49:21 1 21.83 GB Vireo 2-01:30:24 6 29.42 GB Ensemblex 02:05:27 1 5.67 GB","title":"Resource requirements"},{"location":"Dataset2/","text":"HTO analysis track: PBMC dataset Contents Introduction Downloading the pbmc dataset Installation scrnabox.slurm installation CellRanger installation R library preparation and R package installation scRNAbox: HTO Analysis Track Step 0: Set up Step 1: FASTQ to gene expression matrix Step 2: Create Seurat object and remove ambient RNA Step 3: Quality control and filtering Step 4: Demultiplexing and doublet detection Publication-ready figures Job Configurations Introduction This guide illustrates the steps taken for our analysis of the PBMC dataset in our pre-print manuscript . Here, we are using the HTO analysis track of scRNAbox to analyze a publicly available scRNAseq dataset produced by Stoeckius et al. . This data set describes peripheral blood mononuclear cells (PBMC) from eight human donors, which were tagged with sample-specific barcodes, pooled, and sequenced together in a single run. Downloading the PBMC dataset In you want to use the PBMC dataset to test the scRNAbox pipeline, please see here for detialed instructions on how to download the publicly available data. Installation scrnabox.slurm installation To download the latest version of scrnabox.slurm (v0.1.52.50) run the following command: wget https://github.com/neurobioinfo/scrnabox/releases/download/v0.1.52.5/scrnabox.slurm.zip unzip scrnabox.slurm.zip For a description of the options for running scrnabox.slurm run the following command: bash /pathway/to/scrnabox.slurm/launch_scrnabox.sh -h If the scrnabox.slurm has been installed properly, the above command should return the folllowing: scrnabox pipeline version 0.1.52.50 ------------------- mandatory arguments: -d (--dir) = Working directory (where all the outputs will be printed) (give full path) --steps = Specify what steps, e.g., 2 to run step 2. 2-6, run steps 2 through 6 optional arguments: -h (--help) = See helps regarding the pipeline arguments. --method = Select your preferred method: HTO and SCRNA for hashtag, and Standard scRNA, respectively. --msd = You can get the hashtag labels by running the following code (HTO Step 4). --markergsea = Identify marker genes for each cluster and run marker gene set enrichment analysis (GSEA) using EnrichR libraries (Step 7). --knownmarkers = Profile the individual or aggregated expression of known marker genes. --referenceannotation = Generate annotation predictions based on the annotations of a reference Seurat object (Step 7). --annotate = Add clustering annotations to Seurat object metadata (Step 7). --addmeta = Add metadata columns to the Seurat object (Step 8). --rundge = Perform differential gene expression contrasts (Step 8). --seulist = You can directly call the list of Seurat objects to the pipeline. --rcheck = You can identify which libraries are not installed. ------------------- For a comprehensive help, visit https://neurobioinfo.github.io/scrnabox/site/ for documentation. CellRanger installation For information regarding the installation of CellRanger, please visit the 10X Genomics documentation . If CellRanger is already installed on your HPC system, you may skip the CellRanger installation procedures. For our analysis of the midbrain dataset we used the 10XGenomics GRCh38-3.0.0 reference genome and CellRanger v5.0.1. For more information regarding how to prepare reference genomes for the CellRanger counts pipeline, please see the 10X Genomics documentation . R library preparation and R package installation We must prepapre a common R library where we will load all of the required R packages. If the required R packages are already installed on your HPC system in a common R library, you may skip the following procedures. We will first install R . The analyses presented in our pre-print manuscript were conducted using v4.2.1. # install R module load r/4.2.1 Then, we will run the installation code, which creates a directory where the R packages will be loaded and will install the required R packages: # Folder for R packages R_PATH=~/path/to/R/library mkdir -p $R_PATH # Install package Rscript ./scrnabox.slurm/soft/R/install_packages.R $R_PATH scRNAbox pipeline Step 0: Set up Now that scrnabox.slurm , CellRanger , R , and the required R packages have been installed, we can proceed to our analysis with the scRNAbox pipeline. We will create a pipeline folder designated for the analysis and run Step 0, selecting the HTO analysis track ( --method HTO ), using the following code: mkdir pipeline cd pipeline export SCRNABOX_HOME=~/scrnabox/scrnabox.slurm export SCRNABOX_PWD=~/pipeline bash $SCRNABOX_HOME/launch_scrnabox.sh \\ -d ${SCRNABOX_PWD} \\ --steps 0 \\ --method HTO Next, we will navigate to the scrnabox_config.ini file in ~/pipeline/job_info/configs to define the HPC account holder ( ACCOUNT ), the path to the environmental module ( MODULEUSE ), the path to CellRanger from the environmental module directory ( CELLRANGER ), CellRanger version ( CELLRANGER_VERSION ), R version ( R_VERSION ), and the path to the R library ( R_LIB_PATH ): cd ~/pipeline/job_info/configs nano scrnabox_config.ini ACCOUNT=account-name MODULEUSE=/path/to/environmental/module CELLRANGER=/path/to/cellranger/from/module/directory CELLRANGER_VERSION=5.0.1 R_VERSION=4.2.1 R_LIB_PATH=/path/to/R/library Next, we can check to see if all of the required R packages have been properly installed using the following command: bash $SCRNABOX_HOME/launch_scrnabox.sh \\ -d ${SCRNABOX_PWD} \\ --steps 0 \\ --rcheck Step 1: FASTQ to gene expression matrix In Step 1, we will run the CellRanger counts pipeline to generate feature-barcode expression matrices from the FASTQ files. While it is possible to manually prepare the library.csv and feature_ref.csv files for the sequencing run prior to running Step 1, for this analysis we are going to opt for automated library preparation. For more information regarding the manual prepartion of library.csv and feature_ref.csv files, please see the the CellRanger library preparation tutorial. For our analysis of the PBMC dataset we set the following execution parameters for Step 1 ( ~/pipeline/job_info/parameters/step1_par.txt ): Parameter Value par_automated_library_prep Yes par_fastq_directory /path/to/directory/contaning/fastqs par_RNA_run_names run1GEX par_HTO_run_names run1HTO par_seq_run_names run1 par_paired_end_seq Yes par_id Hash1, Hash2, Hash3, Hash4, Hash5, Hash6, Hash7, Hash8 par_name A_TotalSeqA, B_TotalSeqA, C_TotalSeqA, D_TotalSeqA, E_TotalSeqA, F_TotalSeqA, G_TotalSeqA, H_TotalSeqA par_read R2 par_pattern 5P(BC) par_sequence AGGACCATCCAA, ACATGTTACCGT, AGCTTACTATCC, TCGATAATGCGA, GAGGCTGAGCTA, GTGTGACGTATT, ACTGTCTAACGG, TATCACATCGGT par_ref_dir_grch ~/genome/10xGenomics/refdata-cellranger-GRCh38-3.0.0 par_r1_length NULL (commented out) par_r2_length NULL (commented out) par_mempercode 30 par_include_introns NULL (commented out) par_no_target_umi_filter NULL (commented out) par_expect_cells NULL (commented out) par_force_cells NULL (commented out) par_no_bam NULL (commented out) Note: The parameters file for each step is located in ~/pipeline/job_info/parameters . For a comprehensive description of the execution parameters for each step see here . Given that CellRanger runs a user interface and is not submitted as a Job, it is recommended to run Step 1 in a 'screen' which will allow the the task to keep running if the connection is broken. To run Step 1, use the following command: export SCRNABOX_HOME=~/scrnabox/scrnabox.slurm export SCRNABOX_PWD=~/pipeline screen -S run_PBMC_application_case bash $SCRNABOX_HOME/launch_scrnabox.sh \\ -d ${SCRNABOX_PWD} \\ --steps 1 The outputs of the CellRanger counts pipeline are deposited into ~/pipeline/step1 . Step 2: Create Seurat object and remove ambient RNA In Step 2, we are going to begin by correcting the RNA assay for ambient RNA removal using SoupX ( Young et al. 2020 ). We will then use the the ambient RNA-corrected feature-barcode matrices to create a Seurat object. For our analysis of the PBMC dataset we set the following execution parameters for Step 2 ( ~/pipeline/job_info/parameters/step2_par.txt ): Parameter Value par_save_RNA Yes par_save_metadata Yes par_ambient_RNA Yes par_normalization.method LogNormalize par_scale.factor 10000 par_selection.method vst par_nfeatures 2500 We can run Step 2 using the following code: export SCRNABOX_HOME=~/scrnabox/scrnabox.slurm export SCRNABOX_PWD=~/pipeline bash $SCRNABOX_HOME/launch_scrnabox.sh \\ -d ${SCRNABOX_PWD} \\ --steps 2 Step 2 produces the following outputs: ~/pipeline step2 \u251c\u2500\u2500 figs2 \u2502 \u251c\u2500\u2500 ambient_RNA_estimation_run1.pdf \u2502 \u251c\u2500\u2500 ambient_RNA_markers_run1.pdf \u2502 \u251c\u2500\u2500 cell_cyle_dim_plot_run1.pdf \u2502 \u251c\u2500\u2500 vioplot_run1.pdf \u2502 \u2514\u2500\u2500 zoomed_in_vioplot_run1.pdf \u251c\u2500\u2500 info2 \u2502 \u251c\u2500\u2500 estimated_ambient_RNA_run1.txt \u2502 \u251c\u2500\u2500 MetaData_1.txt \u2502 \u251c\u2500\u2500 meta_info_1.txt \u2502 \u251c\u2500\u2500 run1_ambient_rna_summary.rds \u2502 \u251c\u2500\u2500 sessionInfo.txt \u2502 \u251c\u2500\u2500 seu1_RNA.txt \u2502 \u2514\u2500\u2500 summary_seu1.txt \u251c\u2500\u2500 objs2 \u2502 \u2514\u2500\u2500 run1.rds \u2514\u2500\u2500 step2_ambient \u2514\u2500\u2500 run1 \u251c\u2500\u2500 barcodes.tsv \u251c\u2500\u2500 genes.tsv \u2514\u2500\u2500 matrix.mtxs Note: For a comprehensive description of the outputs for each analytical step, please see the Outputs section of the scRNAbox documentation. Figure 1. Figures produced by Step 2 of the scRNAbox pipeline. A) Estimated ambient RNA contamination rate (Rho) by SoupX. Estimates of the RNA contamination rate using various estimators are visualized via a frequency distribution; the true contamination rate is assigned as the most frequent estimate (red line; 8.7%). B) Log10 ratios of observed counts to expected counts for marker genes from each cluster. Clusters are defined by the CellRanger counts pipeline. The red line displays the estimated RNA contamination rate if the estimation was based entirely on the corresponding gene. C) Principal component analysis (PCA) of Seurat S and G2M cell cycle reference genes. D) Violin plots showing the distribution of cells according to quality control metrics calculated in Step 2. E) Zoomed in violin plots, from the minimum to the mean, showing the distribution of cells according to quality control metrics calculated in Step 2. Step 3: Quality control and filtering In Step 3, we are going to perform quality control procedures and filter out low quality cells. We are going to filter out cells with < 50 unique RNA transcripts, > 6000 unique RNA transcripts, < 200 total RNA transcripts, > 7000 total RNA transcripts, and > 50% mitochondria. For our analysis of the PBMC dataset we set the following execution parameters for Step 3 ( ~/pipeline/job_info/parameters/step2_par.txt ): Parameter Value par_save_RNA Yes par_save_metadata Yes par_seurat_object NULL par_nFeature_RNA_L 50 par_nFeature_RNA_U 6000 par_nCount_RNA_L 200 par_nCount_RNA_U 7000 par_mitochondria_percent_L 0 par_mitochondria_percent_U 50 par_ribosomal_percent_L 0 par_ribosomal_percent_U 100 par_remove_mitochondrial_genes No par_remove_ribosomal_genes No par_remove_genes NULL par_regress_cell_cycle_genes Yes par_normalization.method LogNormalize par_scale.factor 10000 par_selection.method vst par_nfeatures 2500 par_top 10 par_npcs_pca 30 We can run Step 3 using the following code: export SCRNABOX_HOME=~/scrnabox/scrnabox.slurm export SCRNABOX_PWD=~/pipeline bash $SCRNABOX_HOME/launch_scrnabox.sh \\ -d ${SCRNABOX_PWD} \\ --steps 3 Step 3 produces the following outputs. step3 \u251c\u2500\u2500 figs3 \u2502 \u251c\u2500\u2500 dimplot_pca_run1.pdf \u2502 \u251c\u2500\u2500 elbowplot_run1.pdf \u2502 \u251c\u2500\u2500 filtered_QC_vioplot_run1.pdf \u2502 \u2514\u2500\u2500 VariableFeaturePlot_run1.pdf \u251c\u2500\u2500 info3 \u2502 \u251c\u2500\u2500 MetaData_run1.txt \u2502 \u251c\u2500\u2500 meta_info_run1.txt \u2502 \u251c\u2500\u2500 most_variable_genes_run1.txt \u2502 \u251c\u2500\u2500 run1_RNA.txt \u2502 \u251c\u2500\u2500 sessionInfo.txt \u2502 \u2514\u2500\u2500 summary_run1.txt \u2514\u2500\u2500 objs3 \u2514\u2500\u2500 run1.rds Figure 2. Figures produced by Step 3 of the scRNAbox pipeline. A) Violin plots showing the distribution of cells according to quality control metrics after filtering by user-defined thresholds. B) Scatter plot showing the top 2500 most variable features; the top 10 most variable features are labelled. C) Principal component analysis (PCA) visualizing the first two principal component (PC). D) Elbow plot to visualize the percentage of variance explained by each PC. Step 4: Demultiplexing and doublet detection In Step 4, we are going to demultiplex the pooled samples and remove doublets (erroneous libraries produced by two or more cells) based on the expression of the sample-specific barcodes (antibody assay). If the barcode labels used in the analysis are unknown, the first step is to retrieve them from the Seurat object. To do this, we do not need to modify the execution parameters and can go straight to running the following code: export SCRNABOX_HOME=~/scrnabox/scrnabox.slurm export SCRNABOX_PWD=~/pipeline bash $SCRNABOX_HOME/launch_scrnabox.sh \\ -d ${SCRNABOX_PWD} \\ --steps 4 \\ --msd T The above code produces the following file: step4 \u251c\u2500\u2500 figs4 \u251c\u2500\u2500 info4 \u2502 \u2514\u2500\u2500 seu1.rds_old_antibody_label_MULTIseqDemuxHTOcounts.csv \u2514\u2500\u2500 objs4 Which contains the names of the barcode labels (i.e. A_TotalSeqA , B_TotalSeqA , C_TotalSeqA , D_TotalSeqA , E_TotalSeqA , F_TotalSeqA , G_TotalSeqA , H_TotalSeqA , Doublet , Negative ). Now that we know the barcode labels used in the PBMC dataset, we can perform demultiplexing and doublet detection. For our analysis of the PBMC dataset we set the following execution parameters for Step 4 ( ~/pipeline/job_info/parameters/step4_par.txt ): Parameter Value par_save_RNA Yes par_save_metadata Yes par_normalization.method CLR par_scale.factor 10000 par_selection.method vst par_nfeatures 2500 par_dimensionality_reduction Yes par_npcs_pca 30 par_dims_umap 3 par_n.neighbor 65 par_dropDN Yes par_label_dropDN Doublet, Negative par_quantile 0.9 par_autoThresh TRUE par_maxiter 5 par_RidgePlot_ncol 3 par_old_antibody_label A-TotalSeqA, B-TotalSeqA, C-TotalSeqA, D-TotalSeqA, E-TotalSeqA, F-TotalSeqA, G-TotalSeqA, H-TotalSeqA, Doublet par_new_antibody_label sample-A, sample-B, sample-C, sample-D, sample-E, sample-F, sample-G, sample-H, Doublet We can run Step 4 using the following code: export SCRNABOX_HOME=~/scrnabox/scrnabox.slurm export SCRNABOX_PWD=~/pipeline bash $SCRNABOX_HOME/launch_scrnabox.sh \\ -d ${SCRNABOX_PWD} \\ --steps 4 Step 4 produces the following outputs. step4 \u251c\u2500\u2500 figs4 \u2502 \u251c\u2500\u2500 run1_DotPlot_HTO_MSD.pdf \u2502 \u251c\u2500\u2500 run1_Heatmap_HTO_MSD.pdf \u2502 \u251c\u2500\u2500 run1_HTO_dimplot_pca.pdf \u2502 \u251c\u2500\u2500 run1_HTO_dimplot_umap.pdf \u2502 \u251c\u2500\u2500 run1_nCounts_RNA_MSD.pdf \u2502 \u2514\u2500\u2500 run1_Ridgeplot_HTO_MSD.pdf \u251c\u2500\u2500 info4 \u2502 \u251c\u2500\u2500 run1_filtered_MULTIseqDemuxHTOcounts.csv \u2502 \u251c\u2500\u2500 run1_MetaData.txt \u2502 \u251c\u2500\u2500 run1_meta_info_.txt \u2502 \u251c\u2500\u2500 run1_MULTIseqDemuxHTOcounts.csv \u2502 \u251c\u2500\u2500 run1_RNA.txt \u2502 \u2514\u2500\u2500 sessionInfo.txt \u2514\u2500\u2500 objs4 \u2514\u2500\u2500 run1.rds Figure 3. Figures produced by Step 4 of the Cell Hashtag Analysis Track. A) Uniform Manifold Approximation and Projections (UMAP) plot, taking the first three pricipal components (PC) of the antibody assay as input. B) Principal component analysis (PCA) showing the first two PCs of the antibody assay. C) Ridgeplot visualizing the enrichment of barcode labels across sample assignments at the sample level. D) Dot plot visualizing the enrichment of barcode labels across sample assignments at the sample level. E) Heatmap visualizing the enrichment of barcode labels across sample assignments at the cel level. D) Violin plot visualizing the distribution of the number of total RNA transcripts identified per cell, startified by sample assignment. Publication-ready figures The code used to produce the publication-ready figures used in our pre-print manuscript is avaliable here here . Job Configurations The following job configurations were used for our analysis of the PBMC dataset. Job Configurations can be modified for each analytical step in the scrnabox_config.ini file in ~/pipeline/job_info/configs Step THREADS_ARRAY MEM_ARRAY WALLTIME_ARRAY Step2 4 16g 00-05:00 Step3 4 16g 00-05:00 Step4 4 16g 00-05:00","title":"Run pipeline on processed data"},{"location":"Dataset2/#hto-analysis-track-pbmc-dataset","text":"","title":"HTO analysis track: PBMC dataset"},{"location":"Dataset2/#contents","text":"Introduction Downloading the pbmc dataset Installation scrnabox.slurm installation CellRanger installation R library preparation and R package installation scRNAbox: HTO Analysis Track Step 0: Set up Step 1: FASTQ to gene expression matrix Step 2: Create Seurat object and remove ambient RNA Step 3: Quality control and filtering Step 4: Demultiplexing and doublet detection Publication-ready figures Job Configurations","title":"Contents"},{"location":"Dataset2/#introduction","text":"This guide illustrates the steps taken for our analysis of the PBMC dataset in our pre-print manuscript . Here, we are using the HTO analysis track of scRNAbox to analyze a publicly available scRNAseq dataset produced by Stoeckius et al. . This data set describes peripheral blood mononuclear cells (PBMC) from eight human donors, which were tagged with sample-specific barcodes, pooled, and sequenced together in a single run.","title":"Introduction"},{"location":"Dataset2/#downloading-the-pbmc-dataset","text":"In you want to use the PBMC dataset to test the scRNAbox pipeline, please see here for detialed instructions on how to download the publicly available data.","title":"Downloading the PBMC dataset"},{"location":"Dataset2/#installation","text":"","title":"Installation"},{"location":"Dataset2/#scrnaboxslurm-installation","text":"To download the latest version of scrnabox.slurm (v0.1.52.50) run the following command: wget https://github.com/neurobioinfo/scrnabox/releases/download/v0.1.52.5/scrnabox.slurm.zip unzip scrnabox.slurm.zip For a description of the options for running scrnabox.slurm run the following command: bash /pathway/to/scrnabox.slurm/launch_scrnabox.sh -h If the scrnabox.slurm has been installed properly, the above command should return the folllowing: scrnabox pipeline version 0.1.52.50 ------------------- mandatory arguments: -d (--dir) = Working directory (where all the outputs will be printed) (give full path) --steps = Specify what steps, e.g., 2 to run step 2. 2-6, run steps 2 through 6 optional arguments: -h (--help) = See helps regarding the pipeline arguments. --method = Select your preferred method: HTO and SCRNA for hashtag, and Standard scRNA, respectively. --msd = You can get the hashtag labels by running the following code (HTO Step 4). --markergsea = Identify marker genes for each cluster and run marker gene set enrichment analysis (GSEA) using EnrichR libraries (Step 7). --knownmarkers = Profile the individual or aggregated expression of known marker genes. --referenceannotation = Generate annotation predictions based on the annotations of a reference Seurat object (Step 7). --annotate = Add clustering annotations to Seurat object metadata (Step 7). --addmeta = Add metadata columns to the Seurat object (Step 8). --rundge = Perform differential gene expression contrasts (Step 8). --seulist = You can directly call the list of Seurat objects to the pipeline. --rcheck = You can identify which libraries are not installed. ------------------- For a comprehensive help, visit https://neurobioinfo.github.io/scrnabox/site/ for documentation.","title":"scrnabox.slurm installation"},{"location":"Dataset2/#cellranger-installation","text":"For information regarding the installation of CellRanger, please visit the 10X Genomics documentation . If CellRanger is already installed on your HPC system, you may skip the CellRanger installation procedures. For our analysis of the midbrain dataset we used the 10XGenomics GRCh38-3.0.0 reference genome and CellRanger v5.0.1. For more information regarding how to prepare reference genomes for the CellRanger counts pipeline, please see the 10X Genomics documentation .","title":"CellRanger installation"},{"location":"Dataset2/#r-library-preparation-and-r-package-installation","text":"We must prepapre a common R library where we will load all of the required R packages. If the required R packages are already installed on your HPC system in a common R library, you may skip the following procedures. We will first install R . The analyses presented in our pre-print manuscript were conducted using v4.2.1. # install R module load r/4.2.1 Then, we will run the installation code, which creates a directory where the R packages will be loaded and will install the required R packages: # Folder for R packages R_PATH=~/path/to/R/library mkdir -p $R_PATH # Install package Rscript ./scrnabox.slurm/soft/R/install_packages.R $R_PATH","title":"R library preparation and R package installation"},{"location":"Dataset2/#scrnabox-pipeline","text":"","title":"scRNAbox pipeline"},{"location":"Dataset2/#step-0-set-up","text":"Now that scrnabox.slurm , CellRanger , R , and the required R packages have been installed, we can proceed to our analysis with the scRNAbox pipeline. We will create a pipeline folder designated for the analysis and run Step 0, selecting the HTO analysis track ( --method HTO ), using the following code: mkdir pipeline cd pipeline export SCRNABOX_HOME=~/scrnabox/scrnabox.slurm export SCRNABOX_PWD=~/pipeline bash $SCRNABOX_HOME/launch_scrnabox.sh \\ -d ${SCRNABOX_PWD} \\ --steps 0 \\ --method HTO Next, we will navigate to the scrnabox_config.ini file in ~/pipeline/job_info/configs to define the HPC account holder ( ACCOUNT ), the path to the environmental module ( MODULEUSE ), the path to CellRanger from the environmental module directory ( CELLRANGER ), CellRanger version ( CELLRANGER_VERSION ), R version ( R_VERSION ), and the path to the R library ( R_LIB_PATH ): cd ~/pipeline/job_info/configs nano scrnabox_config.ini ACCOUNT=account-name MODULEUSE=/path/to/environmental/module CELLRANGER=/path/to/cellranger/from/module/directory CELLRANGER_VERSION=5.0.1 R_VERSION=4.2.1 R_LIB_PATH=/path/to/R/library Next, we can check to see if all of the required R packages have been properly installed using the following command: bash $SCRNABOX_HOME/launch_scrnabox.sh \\ -d ${SCRNABOX_PWD} \\ --steps 0 \\ --rcheck","title":"Step 0: Set up"},{"location":"Dataset2/#step-1-fastq-to-gene-expression-matrix","text":"In Step 1, we will run the CellRanger counts pipeline to generate feature-barcode expression matrices from the FASTQ files. While it is possible to manually prepare the library.csv and feature_ref.csv files for the sequencing run prior to running Step 1, for this analysis we are going to opt for automated library preparation. For more information regarding the manual prepartion of library.csv and feature_ref.csv files, please see the the CellRanger library preparation tutorial. For our analysis of the PBMC dataset we set the following execution parameters for Step 1 ( ~/pipeline/job_info/parameters/step1_par.txt ): Parameter Value par_automated_library_prep Yes par_fastq_directory /path/to/directory/contaning/fastqs par_RNA_run_names run1GEX par_HTO_run_names run1HTO par_seq_run_names run1 par_paired_end_seq Yes par_id Hash1, Hash2, Hash3, Hash4, Hash5, Hash6, Hash7, Hash8 par_name A_TotalSeqA, B_TotalSeqA, C_TotalSeqA, D_TotalSeqA, E_TotalSeqA, F_TotalSeqA, G_TotalSeqA, H_TotalSeqA par_read R2 par_pattern 5P(BC) par_sequence AGGACCATCCAA, ACATGTTACCGT, AGCTTACTATCC, TCGATAATGCGA, GAGGCTGAGCTA, GTGTGACGTATT, ACTGTCTAACGG, TATCACATCGGT par_ref_dir_grch ~/genome/10xGenomics/refdata-cellranger-GRCh38-3.0.0 par_r1_length NULL (commented out) par_r2_length NULL (commented out) par_mempercode 30 par_include_introns NULL (commented out) par_no_target_umi_filter NULL (commented out) par_expect_cells NULL (commented out) par_force_cells NULL (commented out) par_no_bam NULL (commented out) Note: The parameters file for each step is located in ~/pipeline/job_info/parameters . For a comprehensive description of the execution parameters for each step see here . Given that CellRanger runs a user interface and is not submitted as a Job, it is recommended to run Step 1 in a 'screen' which will allow the the task to keep running if the connection is broken. To run Step 1, use the following command: export SCRNABOX_HOME=~/scrnabox/scrnabox.slurm export SCRNABOX_PWD=~/pipeline screen -S run_PBMC_application_case bash $SCRNABOX_HOME/launch_scrnabox.sh \\ -d ${SCRNABOX_PWD} \\ --steps 1 The outputs of the CellRanger counts pipeline are deposited into ~/pipeline/step1 .","title":"Step 1: FASTQ to gene expression matrix"},{"location":"Dataset2/#step-2-create-seurat-object-and-remove-ambient-rna","text":"In Step 2, we are going to begin by correcting the RNA assay for ambient RNA removal using SoupX ( Young et al. 2020 ). We will then use the the ambient RNA-corrected feature-barcode matrices to create a Seurat object. For our analysis of the PBMC dataset we set the following execution parameters for Step 2 ( ~/pipeline/job_info/parameters/step2_par.txt ): Parameter Value par_save_RNA Yes par_save_metadata Yes par_ambient_RNA Yes par_normalization.method LogNormalize par_scale.factor 10000 par_selection.method vst par_nfeatures 2500 We can run Step 2 using the following code: export SCRNABOX_HOME=~/scrnabox/scrnabox.slurm export SCRNABOX_PWD=~/pipeline bash $SCRNABOX_HOME/launch_scrnabox.sh \\ -d ${SCRNABOX_PWD} \\ --steps 2 Step 2 produces the following outputs: ~/pipeline step2 \u251c\u2500\u2500 figs2 \u2502 \u251c\u2500\u2500 ambient_RNA_estimation_run1.pdf \u2502 \u251c\u2500\u2500 ambient_RNA_markers_run1.pdf \u2502 \u251c\u2500\u2500 cell_cyle_dim_plot_run1.pdf \u2502 \u251c\u2500\u2500 vioplot_run1.pdf \u2502 \u2514\u2500\u2500 zoomed_in_vioplot_run1.pdf \u251c\u2500\u2500 info2 \u2502 \u251c\u2500\u2500 estimated_ambient_RNA_run1.txt \u2502 \u251c\u2500\u2500 MetaData_1.txt \u2502 \u251c\u2500\u2500 meta_info_1.txt \u2502 \u251c\u2500\u2500 run1_ambient_rna_summary.rds \u2502 \u251c\u2500\u2500 sessionInfo.txt \u2502 \u251c\u2500\u2500 seu1_RNA.txt \u2502 \u2514\u2500\u2500 summary_seu1.txt \u251c\u2500\u2500 objs2 \u2502 \u2514\u2500\u2500 run1.rds \u2514\u2500\u2500 step2_ambient \u2514\u2500\u2500 run1 \u251c\u2500\u2500 barcodes.tsv \u251c\u2500\u2500 genes.tsv \u2514\u2500\u2500 matrix.mtxs Note: For a comprehensive description of the outputs for each analytical step, please see the Outputs section of the scRNAbox documentation. Figure 1. Figures produced by Step 2 of the scRNAbox pipeline. A) Estimated ambient RNA contamination rate (Rho) by SoupX. Estimates of the RNA contamination rate using various estimators are visualized via a frequency distribution; the true contamination rate is assigned as the most frequent estimate (red line; 8.7%). B) Log10 ratios of observed counts to expected counts for marker genes from each cluster. Clusters are defined by the CellRanger counts pipeline. The red line displays the estimated RNA contamination rate if the estimation was based entirely on the corresponding gene. C) Principal component analysis (PCA) of Seurat S and G2M cell cycle reference genes. D) Violin plots showing the distribution of cells according to quality control metrics calculated in Step 2. E) Zoomed in violin plots, from the minimum to the mean, showing the distribution of cells according to quality control metrics calculated in Step 2.","title":"Step 2: Create Seurat object and remove ambient RNA"},{"location":"Dataset2/#step-3-quality-control-and-filtering","text":"In Step 3, we are going to perform quality control procedures and filter out low quality cells. We are going to filter out cells with < 50 unique RNA transcripts, > 6000 unique RNA transcripts, < 200 total RNA transcripts, > 7000 total RNA transcripts, and > 50% mitochondria. For our analysis of the PBMC dataset we set the following execution parameters for Step 3 ( ~/pipeline/job_info/parameters/step2_par.txt ): Parameter Value par_save_RNA Yes par_save_metadata Yes par_seurat_object NULL par_nFeature_RNA_L 50 par_nFeature_RNA_U 6000 par_nCount_RNA_L 200 par_nCount_RNA_U 7000 par_mitochondria_percent_L 0 par_mitochondria_percent_U 50 par_ribosomal_percent_L 0 par_ribosomal_percent_U 100 par_remove_mitochondrial_genes No par_remove_ribosomal_genes No par_remove_genes NULL par_regress_cell_cycle_genes Yes par_normalization.method LogNormalize par_scale.factor 10000 par_selection.method vst par_nfeatures 2500 par_top 10 par_npcs_pca 30 We can run Step 3 using the following code: export SCRNABOX_HOME=~/scrnabox/scrnabox.slurm export SCRNABOX_PWD=~/pipeline bash $SCRNABOX_HOME/launch_scrnabox.sh \\ -d ${SCRNABOX_PWD} \\ --steps 3 Step 3 produces the following outputs. step3 \u251c\u2500\u2500 figs3 \u2502 \u251c\u2500\u2500 dimplot_pca_run1.pdf \u2502 \u251c\u2500\u2500 elbowplot_run1.pdf \u2502 \u251c\u2500\u2500 filtered_QC_vioplot_run1.pdf \u2502 \u2514\u2500\u2500 VariableFeaturePlot_run1.pdf \u251c\u2500\u2500 info3 \u2502 \u251c\u2500\u2500 MetaData_run1.txt \u2502 \u251c\u2500\u2500 meta_info_run1.txt \u2502 \u251c\u2500\u2500 most_variable_genes_run1.txt \u2502 \u251c\u2500\u2500 run1_RNA.txt \u2502 \u251c\u2500\u2500 sessionInfo.txt \u2502 \u2514\u2500\u2500 summary_run1.txt \u2514\u2500\u2500 objs3 \u2514\u2500\u2500 run1.rds Figure 2. Figures produced by Step 3 of the scRNAbox pipeline. A) Violin plots showing the distribution of cells according to quality control metrics after filtering by user-defined thresholds. B) Scatter plot showing the top 2500 most variable features; the top 10 most variable features are labelled. C) Principal component analysis (PCA) visualizing the first two principal component (PC). D) Elbow plot to visualize the percentage of variance explained by each PC.","title":"Step 3: Quality control and filtering"},{"location":"Dataset2/#step-4-demultiplexing-and-doublet-detection","text":"In Step 4, we are going to demultiplex the pooled samples and remove doublets (erroneous libraries produced by two or more cells) based on the expression of the sample-specific barcodes (antibody assay). If the barcode labels used in the analysis are unknown, the first step is to retrieve them from the Seurat object. To do this, we do not need to modify the execution parameters and can go straight to running the following code: export SCRNABOX_HOME=~/scrnabox/scrnabox.slurm export SCRNABOX_PWD=~/pipeline bash $SCRNABOX_HOME/launch_scrnabox.sh \\ -d ${SCRNABOX_PWD} \\ --steps 4 \\ --msd T The above code produces the following file: step4 \u251c\u2500\u2500 figs4 \u251c\u2500\u2500 info4 \u2502 \u2514\u2500\u2500 seu1.rds_old_antibody_label_MULTIseqDemuxHTOcounts.csv \u2514\u2500\u2500 objs4 Which contains the names of the barcode labels (i.e. A_TotalSeqA , B_TotalSeqA , C_TotalSeqA , D_TotalSeqA , E_TotalSeqA , F_TotalSeqA , G_TotalSeqA , H_TotalSeqA , Doublet , Negative ). Now that we know the barcode labels used in the PBMC dataset, we can perform demultiplexing and doublet detection. For our analysis of the PBMC dataset we set the following execution parameters for Step 4 ( ~/pipeline/job_info/parameters/step4_par.txt ): Parameter Value par_save_RNA Yes par_save_metadata Yes par_normalization.method CLR par_scale.factor 10000 par_selection.method vst par_nfeatures 2500 par_dimensionality_reduction Yes par_npcs_pca 30 par_dims_umap 3 par_n.neighbor 65 par_dropDN Yes par_label_dropDN Doublet, Negative par_quantile 0.9 par_autoThresh TRUE par_maxiter 5 par_RidgePlot_ncol 3 par_old_antibody_label A-TotalSeqA, B-TotalSeqA, C-TotalSeqA, D-TotalSeqA, E-TotalSeqA, F-TotalSeqA, G-TotalSeqA, H-TotalSeqA, Doublet par_new_antibody_label sample-A, sample-B, sample-C, sample-D, sample-E, sample-F, sample-G, sample-H, Doublet We can run Step 4 using the following code: export SCRNABOX_HOME=~/scrnabox/scrnabox.slurm export SCRNABOX_PWD=~/pipeline bash $SCRNABOX_HOME/launch_scrnabox.sh \\ -d ${SCRNABOX_PWD} \\ --steps 4 Step 4 produces the following outputs. step4 \u251c\u2500\u2500 figs4 \u2502 \u251c\u2500\u2500 run1_DotPlot_HTO_MSD.pdf \u2502 \u251c\u2500\u2500 run1_Heatmap_HTO_MSD.pdf \u2502 \u251c\u2500\u2500 run1_HTO_dimplot_pca.pdf \u2502 \u251c\u2500\u2500 run1_HTO_dimplot_umap.pdf \u2502 \u251c\u2500\u2500 run1_nCounts_RNA_MSD.pdf \u2502 \u2514\u2500\u2500 run1_Ridgeplot_HTO_MSD.pdf \u251c\u2500\u2500 info4 \u2502 \u251c\u2500\u2500 run1_filtered_MULTIseqDemuxHTOcounts.csv \u2502 \u251c\u2500\u2500 run1_MetaData.txt \u2502 \u251c\u2500\u2500 run1_meta_info_.txt \u2502 \u251c\u2500\u2500 run1_MULTIseqDemuxHTOcounts.csv \u2502 \u251c\u2500\u2500 run1_RNA.txt \u2502 \u2514\u2500\u2500 sessionInfo.txt \u2514\u2500\u2500 objs4 \u2514\u2500\u2500 run1.rds Figure 3. Figures produced by Step 4 of the Cell Hashtag Analysis Track. A) Uniform Manifold Approximation and Projections (UMAP) plot, taking the first three pricipal components (PC) of the antibody assay as input. B) Principal component analysis (PCA) showing the first two PCs of the antibody assay. C) Ridgeplot visualizing the enrichment of barcode labels across sample assignments at the sample level. D) Dot plot visualizing the enrichment of barcode labels across sample assignments at the sample level. E) Heatmap visualizing the enrichment of barcode labels across sample assignments at the cel level. D) Violin plot visualizing the distribution of the number of total RNA transcripts identified per cell, startified by sample assignment.","title":"Step 4: Demultiplexing and doublet detection"},{"location":"Dataset2/#publication-ready-figures","text":"The code used to produce the publication-ready figures used in our pre-print manuscript is avaliable here here .","title":"Publication-ready figures"},{"location":"Dataset2/#job-configurations","text":"The following job configurations were used for our analysis of the PBMC dataset. Job Configurations can be modified for each analytical step in the scrnabox_config.ini file in ~/pipeline/job_info/configs Step THREADS_ARRAY MEM_ARRAY WALLTIME_ARRAY Step2 4 16g 00-05:00 Step3 4 16g 00-05:00 Step4 4 16g 00-05:00","title":"Job Configurations"},{"location":"LICENSE/","text":"License MIT License Copyright (c) 2022 The Neuro Bioinformatics Core Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.","title":"License"},{"location":"LICENSE/#license","text":"MIT License Copyright (c) 2022 The Neuro Bioinformatics Core Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.","title":"License"},{"location":"Step0/","text":"Step 1: Setting up the Ensemblex pipeline In Step 1, we will set up the working directory for the Ensemblex pipeline and decide which version of the pipeline we want to use: Demultiplexing with prior genotype information Demultiplexing without prior genotype information Demultiplexing with prior genotype information First, create a dedicated folder for the analysis (hereafter referred to as the working directory). Then, define the path to the working directory and the path to ensemblex.pip: ## Create and navigate to the working directory mkdir working_directory cd /path/to/working_directory ## Define the path to ensemblex.pip ensemblex_HOME=/path/to/ensemblex.pip ## Define the path to the working directory ensemblex_PWD=/path/to/working_directory Next, we can set up the working directory for demultiplexing with prior genotype information using the following code: bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step init-GT After running the above code, the working directory should have the following structure working_directory \u251c\u2500\u2500 demuxalot \u251c\u2500\u2500 demuxlet \u251c\u2500\u2500 ensemblex_gt \u251c\u2500\u2500 input_files \u251c\u2500\u2500 job_info \u2502 \u251c\u2500\u2500 configs \u2502 \u2502 \u2514\u2500\u2500 ensemblex_config.ini \u2502 \u251c\u2500\u2500 logs \u2502 \u2514\u2500\u2500 summary_report.txt \u251c\u2500\u2500 souporcell \u2514\u2500\u2500 vireo_gt Upon setting up the Ensemblex pipeline, we can proceed to Step 2 where we will prepare the input files for Ensemblex's constituent genetic demultiplexing tools: Preparation of input files Demultiplexing without prior genotype information First, create a dedicated folder for the analysis (hereafter referred to as the working directory). Then, define the path to the working directory and the path to ensemblex.pip: ## Create and navigate to the working directory mkdir working_directory cd /path/to/working_directory ## Define the path to ensemblex.pip ensemblex_HOME=/path/to/ensemblex.pip ## Define the path to the working directory ensemblex_PWD=/path/to/working_directory Next, we can set up the working directory for demultiplexing without prior genotype information using the following code: bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step init-noGT After running the above code, the working directory should have the following structure working_directory \u251c\u2500\u2500 demuxalot \u251c\u2500\u2500 freemuxlet \u251c\u2500\u2500 ensemblex \u251c\u2500\u2500 input_files \u251c\u2500\u2500 job_info \u2502 \u251c\u2500\u2500 configs \u2502 \u2502 \u2514\u2500\u2500 ensemblex_config.ini \u2502 \u251c\u2500\u2500 logs \u2502 \u2514\u2500\u2500 summary_report.txt \u251c\u2500\u2500 souporcell \u2514\u2500\u2500 vireo Upon setting up the Ensemblex pipeline, we can proceed to Step 2 where we will prepare the input files for Ensemblex's constituent genetic demultiplexing tools: Preparation of input files","title":"Step 1: Set up"},{"location":"Step0/#step-1-setting-up-the-ensemblex-pipeline","text":"In Step 1, we will set up the working directory for the Ensemblex pipeline and decide which version of the pipeline we want to use: Demultiplexing with prior genotype information Demultiplexing without prior genotype information","title":"Step 1: Setting up the Ensemblex pipeline"},{"location":"Step0/#demultiplexing-with-prior-genotype-information","text":"First, create a dedicated folder for the analysis (hereafter referred to as the working directory). Then, define the path to the working directory and the path to ensemblex.pip: ## Create and navigate to the working directory mkdir working_directory cd /path/to/working_directory ## Define the path to ensemblex.pip ensemblex_HOME=/path/to/ensemblex.pip ## Define the path to the working directory ensemblex_PWD=/path/to/working_directory Next, we can set up the working directory for demultiplexing with prior genotype information using the following code: bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step init-GT After running the above code, the working directory should have the following structure working_directory \u251c\u2500\u2500 demuxalot \u251c\u2500\u2500 demuxlet \u251c\u2500\u2500 ensemblex_gt \u251c\u2500\u2500 input_files \u251c\u2500\u2500 job_info \u2502 \u251c\u2500\u2500 configs \u2502 \u2502 \u2514\u2500\u2500 ensemblex_config.ini \u2502 \u251c\u2500\u2500 logs \u2502 \u2514\u2500\u2500 summary_report.txt \u251c\u2500\u2500 souporcell \u2514\u2500\u2500 vireo_gt Upon setting up the Ensemblex pipeline, we can proceed to Step 2 where we will prepare the input files for Ensemblex's constituent genetic demultiplexing tools: Preparation of input files","title":"Demultiplexing with prior genotype information"},{"location":"Step0/#demultiplexing-without-prior-genotype-information","text":"First, create a dedicated folder for the analysis (hereafter referred to as the working directory). Then, define the path to the working directory and the path to ensemblex.pip: ## Create and navigate to the working directory mkdir working_directory cd /path/to/working_directory ## Define the path to ensemblex.pip ensemblex_HOME=/path/to/ensemblex.pip ## Define the path to the working directory ensemblex_PWD=/path/to/working_directory Next, we can set up the working directory for demultiplexing without prior genotype information using the following code: bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step init-noGT After running the above code, the working directory should have the following structure working_directory \u251c\u2500\u2500 demuxalot \u251c\u2500\u2500 freemuxlet \u251c\u2500\u2500 ensemblex \u251c\u2500\u2500 input_files \u251c\u2500\u2500 job_info \u2502 \u251c\u2500\u2500 configs \u2502 \u2502 \u2514\u2500\u2500 ensemblex_config.ini \u2502 \u251c\u2500\u2500 logs \u2502 \u2514\u2500\u2500 summary_report.txt \u251c\u2500\u2500 souporcell \u2514\u2500\u2500 vireo Upon setting up the Ensemblex pipeline, we can proceed to Step 2 where we will prepare the input files for Ensemblex's constituent genetic demultiplexing tools: Preparation of input files","title":"Demultiplexing without prior genotype information"},{"location":"Step1/","text":"Step 2: Preparing input files for genetic demultiplexing In Step 2, we will define the necessary files needed for Ensemblex's constituent genetic demultiplexing tools and will place them within the working directory. The necessary files vary depending on the version of the Ensemblex pipeline being used: Demultiplexing with prior genotype information Demultiplexing without prior genotype information Demultiplexing with prior genotype information Required files To demultiplex the pooled samples with prior genotype information, the following files are required: File Description gene_expression.bam Gene expression bam file of the pooled samples (e.g., 10X Genomics possorted_genome_bam.bam) gene_expression.bam.bai Gene expression bam index file of the pooled samples (e.g., 10X Genomics possorted_genome_bam.bam.bai) barcodes.tsv Barcodes tsv file of the pooled cells (e.g., 10X Genomics barcodes.tsv) pooled_samples.vcf vcf file describing the genotypes of the pooled samples genome_reference.fa Genome reference fasta file (e.g., 10X Genomics: ~/Homo_sapiens.GRCh37/genome/10xGenomics/refdata-cellranger-GRCh37/fasta/genome.fa) genome_reference.fa.fai Genome reference fasta index file (e.g., 10X Genomics: ~/Homo_sapiens.GRCh37/genome/10xGenomics/refdata-cellranger-GRCh37/fasta/genome.fa.fai) genotype_reference.vcf Population reference vcf file (e.g., 1000 Genomes Project) NOTE: We demonstrate how to download reference vcf and fasta files in the Tutorial section of the Ensemblex documentation. Placing files into the Ensemblex pipeline working directory First, define all of the required files: BAM=/path/to/possorted_genome_bam.bam BAM_INDEX=/path/to/possorted_genome_bam.bam.bai BARCODES=/path/to/barcodes.tsv SAMPLE_VCF=/path/to/pooled_samples.vcf REFERENCE_VCF=/path/to/genotype_reference.vcf REFERENCE_FASTA=/path/to/genome.fa REFERENCE_FASTA_INDEX=/path/to/genome.fa.fai Then, place the required files in the Ensemblex pipeline working directory: ## Define the path to the working directory ensemblex_PWD=/path/to/working_directory ## Copy the files to the input_files directory in the working directory cp $BAM $ensemblex_PWD/input_files/pooled_bam.bam cp $BAM_INDEX $ensemblex_PWD/input_files/pooled_bam.bam.bai cp $BARCODES $ensemblex_PWD/input_files/pooled_barcodes.tsv cp $SAMPLE_VCF $ensemblex_PWD/input_files/pooled_samples.vcf cp $REFERENCE_VCF $ensemblex_PWD/input_files/reference.vcf cp $REFERENCE_FASTA $ensemblex_PWD/input_files/reference.fa cp $REFERENCE_FASTA_INDEX $ensemblex_PWD/input_files/reference.fa.fai If the file transfer was successful, the input_files directory of the Ensemblex pipeline working directory will contain the following files: working_directory \u2514\u2500\u2500 input_files \u251c\u2500\u2500 pooled_bam.bam \u251c\u2500\u2500 pooled_bam.bam.bai \u251c\u2500\u2500 pooled_barcodes.tsv \u251c\u2500\u2500 pooled_samples.vcf \u251c\u2500\u2500 reference.fa \u251c\u2500\u2500 reference.fa.fai \u2514\u2500\u2500 reference.vcf NOTE: You will notice that the names of the input files have been standardized, it is important that the input files have the corresonding name for the Ensemblex pipeline to work properly. Upon placing the required files in the Ensemblex pipeline, we can proceed to Step 3 where we will demultiplex the pooled samples using Ensemblex's constituent genetic demultiplexing tools: Genetic demultiplexing by consituent tools Demultiplexing without prior genotype information Required files To demultiplex the pooled samples without prior genotype information, the following files are required: File Description gene_expression.bam Gene expression bam file of the pooled samples (e.g., 10X Genomics possorted_genome_bam.bam) gene_expression.bam.bai Gene expression bam index file of the pooled samples (e.g., 10X Genomics possorted_genome_bam.bam.bai) barcodes.tsv Barcodes tsv file of the pooled cells (e.g., 10X Genomics barcodes.tsv) genome_reference.fa Genome reference fasta file (e.g., 10X Genomics: ~/Homo_sapiens.GRCh37/genome/10xGenomics/refdata-cellranger-GRCh37/fasta/genome.fa) genome_reference.fa.fai Genome reference fasta index file (e.g., 10X Genomics: ~/Homo_sapiens.GRCh37/genome/10xGenomics/refdata-cellranger-GRCh37/fasta/genome.fa.fai) genotype_reference.vcf Population reference vcf file (e.g., 1000 Genomes Project) NOTE: We demonstrate how to download reference vcf and fasta files in the Tutorial section of the Ensemblex documentation. Placing files into the Ensemblex pipeline working directory First, define all of the required files: BAM=/path/to/possorted_genome_bam.bam BAM_INDEX=/path/to/possorted_genome_bam.bam.bai BARCODES=/path/to/barcodes.tsv REFERENCE_VCF=/path/to/genotype_reference.vcf REFERENCE_FASTA=/path/to/genome.fa REFERENCE_FASTA_INDEX=/path/to/genome.fa.fai Then, place the required files in the Ensemblex pipeline working directory: ## Define the path to the working directory ensemblex_PWD=/path/to/working_directory ## Copy the files to the input_files directory in the working directory cp $BAM $ensemblex_PWD/input_files/pooled_bam.bam cp $BAM_INDEX $ensemblex_PWD/input_files/pooled_bam.bam.bai cp $BARCODES $ensemblex_PWD/input_files/pooled_barcodes.tsv cp $REFERENCE_VCF $ensemblex_PWD/input_files/reference.vcf cp $REFERENCE_FASTA $ensemblex_PWD/input_files/reference.fa cp $REFERENCE_FASTA_INDEX $ensemblex_PWD/input_files/reference.fa.fai If the file transfer was successful, the input_files directory of the Ensemblex pipeline working directory will contain the following files: working_directory \u2514\u2500\u2500 input_files \u251c\u2500\u2500 pooled_bam.bam \u251c\u2500\u2500 pooled_bam.bam.bai \u251c\u2500\u2500 pooled_barcodes.tsv \u251c\u2500\u2500 reference.fa \u251c\u2500\u2500 reference.fa.fai \u2514\u2500\u2500 reference.vcf NOTE: You will notice that the names of the input files have been standardized, it is important that the input files have the corresonding name for the Ensemblex pipeline to work properly. Upon placing the required files in the Ensemblex pipeline, we can proceed to Step 3 where we will demultiplex the pooled samples using Ensemblex's constituent genetic demultiplexing tools: Genetic demultiplexing by consituent tools","title":"Step 2: Preparation of input files"},{"location":"Step1/#step-2-preparing-input-files-for-genetic-demultiplexing","text":"In Step 2, we will define the necessary files needed for Ensemblex's constituent genetic demultiplexing tools and will place them within the working directory. The necessary files vary depending on the version of the Ensemblex pipeline being used: Demultiplexing with prior genotype information Demultiplexing without prior genotype information","title":"Step 2: Preparing input files for genetic demultiplexing"},{"location":"Step1/#demultiplexing-with-prior-genotype-information","text":"","title":"Demultiplexing with prior genotype information"},{"location":"Step1/#required-files","text":"To demultiplex the pooled samples with prior genotype information, the following files are required: File Description gene_expression.bam Gene expression bam file of the pooled samples (e.g., 10X Genomics possorted_genome_bam.bam) gene_expression.bam.bai Gene expression bam index file of the pooled samples (e.g., 10X Genomics possorted_genome_bam.bam.bai) barcodes.tsv Barcodes tsv file of the pooled cells (e.g., 10X Genomics barcodes.tsv) pooled_samples.vcf vcf file describing the genotypes of the pooled samples genome_reference.fa Genome reference fasta file (e.g., 10X Genomics: ~/Homo_sapiens.GRCh37/genome/10xGenomics/refdata-cellranger-GRCh37/fasta/genome.fa) genome_reference.fa.fai Genome reference fasta index file (e.g., 10X Genomics: ~/Homo_sapiens.GRCh37/genome/10xGenomics/refdata-cellranger-GRCh37/fasta/genome.fa.fai) genotype_reference.vcf Population reference vcf file (e.g., 1000 Genomes Project) NOTE: We demonstrate how to download reference vcf and fasta files in the Tutorial section of the Ensemblex documentation.","title":"Required files"},{"location":"Step1/#placing-files-into-the-ensemblex-pipeline-working-directory","text":"First, define all of the required files: BAM=/path/to/possorted_genome_bam.bam BAM_INDEX=/path/to/possorted_genome_bam.bam.bai BARCODES=/path/to/barcodes.tsv SAMPLE_VCF=/path/to/pooled_samples.vcf REFERENCE_VCF=/path/to/genotype_reference.vcf REFERENCE_FASTA=/path/to/genome.fa REFERENCE_FASTA_INDEX=/path/to/genome.fa.fai Then, place the required files in the Ensemblex pipeline working directory: ## Define the path to the working directory ensemblex_PWD=/path/to/working_directory ## Copy the files to the input_files directory in the working directory cp $BAM $ensemblex_PWD/input_files/pooled_bam.bam cp $BAM_INDEX $ensemblex_PWD/input_files/pooled_bam.bam.bai cp $BARCODES $ensemblex_PWD/input_files/pooled_barcodes.tsv cp $SAMPLE_VCF $ensemblex_PWD/input_files/pooled_samples.vcf cp $REFERENCE_VCF $ensemblex_PWD/input_files/reference.vcf cp $REFERENCE_FASTA $ensemblex_PWD/input_files/reference.fa cp $REFERENCE_FASTA_INDEX $ensemblex_PWD/input_files/reference.fa.fai If the file transfer was successful, the input_files directory of the Ensemblex pipeline working directory will contain the following files: working_directory \u2514\u2500\u2500 input_files \u251c\u2500\u2500 pooled_bam.bam \u251c\u2500\u2500 pooled_bam.bam.bai \u251c\u2500\u2500 pooled_barcodes.tsv \u251c\u2500\u2500 pooled_samples.vcf \u251c\u2500\u2500 reference.fa \u251c\u2500\u2500 reference.fa.fai \u2514\u2500\u2500 reference.vcf NOTE: You will notice that the names of the input files have been standardized, it is important that the input files have the corresonding name for the Ensemblex pipeline to work properly. Upon placing the required files in the Ensemblex pipeline, we can proceed to Step 3 where we will demultiplex the pooled samples using Ensemblex's constituent genetic demultiplexing tools: Genetic demultiplexing by consituent tools","title":"Placing files into the Ensemblex pipeline working directory"},{"location":"Step1/#demultiplexing-without-prior-genotype-information","text":"","title":"Demultiplexing without prior genotype information"},{"location":"Step1/#required-files_1","text":"To demultiplex the pooled samples without prior genotype information, the following files are required: File Description gene_expression.bam Gene expression bam file of the pooled samples (e.g., 10X Genomics possorted_genome_bam.bam) gene_expression.bam.bai Gene expression bam index file of the pooled samples (e.g., 10X Genomics possorted_genome_bam.bam.bai) barcodes.tsv Barcodes tsv file of the pooled cells (e.g., 10X Genomics barcodes.tsv) genome_reference.fa Genome reference fasta file (e.g., 10X Genomics: ~/Homo_sapiens.GRCh37/genome/10xGenomics/refdata-cellranger-GRCh37/fasta/genome.fa) genome_reference.fa.fai Genome reference fasta index file (e.g., 10X Genomics: ~/Homo_sapiens.GRCh37/genome/10xGenomics/refdata-cellranger-GRCh37/fasta/genome.fa.fai) genotype_reference.vcf Population reference vcf file (e.g., 1000 Genomes Project) NOTE: We demonstrate how to download reference vcf and fasta files in the Tutorial section of the Ensemblex documentation.","title":"Required files"},{"location":"Step1/#placing-files-into-the-ensemblex-pipeline-working-directory_1","text":"First, define all of the required files: BAM=/path/to/possorted_genome_bam.bam BAM_INDEX=/path/to/possorted_genome_bam.bam.bai BARCODES=/path/to/barcodes.tsv REFERENCE_VCF=/path/to/genotype_reference.vcf REFERENCE_FASTA=/path/to/genome.fa REFERENCE_FASTA_INDEX=/path/to/genome.fa.fai Then, place the required files in the Ensemblex pipeline working directory: ## Define the path to the working directory ensemblex_PWD=/path/to/working_directory ## Copy the files to the input_files directory in the working directory cp $BAM $ensemblex_PWD/input_files/pooled_bam.bam cp $BAM_INDEX $ensemblex_PWD/input_files/pooled_bam.bam.bai cp $BARCODES $ensemblex_PWD/input_files/pooled_barcodes.tsv cp $REFERENCE_VCF $ensemblex_PWD/input_files/reference.vcf cp $REFERENCE_FASTA $ensemblex_PWD/input_files/reference.fa cp $REFERENCE_FASTA_INDEX $ensemblex_PWD/input_files/reference.fa.fai If the file transfer was successful, the input_files directory of the Ensemblex pipeline working directory will contain the following files: working_directory \u2514\u2500\u2500 input_files \u251c\u2500\u2500 pooled_bam.bam \u251c\u2500\u2500 pooled_bam.bam.bai \u251c\u2500\u2500 pooled_barcodes.tsv \u251c\u2500\u2500 reference.fa \u251c\u2500\u2500 reference.fa.fai \u2514\u2500\u2500 reference.vcf NOTE: You will notice that the names of the input files have been standardized, it is important that the input files have the corresonding name for the Ensemblex pipeline to work properly. Upon placing the required files in the Ensemblex pipeline, we can proceed to Step 3 where we will demultiplex the pooled samples using Ensemblex's constituent genetic demultiplexing tools: Genetic demultiplexing by consituent tools","title":"Placing files into the Ensemblex pipeline working directory"},{"location":"Step2/","text":"Step 3: Genetic demultiplexing by constituent demultiplexing tools In Step 3, we will demultiplex the pooled samples with each of Ensemblex's constituent genetic demultiplexing tools. The constituent genetic demultiplexing tools will vary depending on the version of the Ensemblex pipeline being used: Demultiplexing with prior genotype information Demultiplexing without prior genotype information NOTE : The analytical parameters for each constiuent tool can be adjusted using the the ensemblex_config.ini file located in ~/working_directory/job_info/configs . For a comprehensive description of how to adjust the analytical parameters of the Ensemblex pipeline please see Execution parameters . Demultiplexing with prior genotype information When demultiplexing with prior genotype information, Ensemblex leverages the sample labels from Demuxalot Demuxlet Souporcell Vireo-GT Demuxalot To run Demuxalot use the following code: ensemblex_HOME=/path/to/ensemblex.pip ensemblex_PWD=/path/to/working_directory bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step demuxalot If Demuxalot completed successfully, the following files should be available in ~/working_directory/demuxalot working_directory \u2514\u2500\u2500 demuxalot \u251c\u2500\u2500 Demuxalot_result.csv \u2514\u2500\u2500 new_snps_single_file.betas Demuxlet To run Demuxlet use the following code: ensemblex_HOME=/path/to/ensemblex.pip ensemblex_PWD=/path/to/working_directory bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step demuxlet If Demuxlet completed successfully, the following files should be available in ~/working_directory/demuxlet working_directory \u2514\u2500\u2500 demuxlet \u251c\u2500\u2500 outs.best \u251c\u2500\u2500 pileup.cel.gz \u251c\u2500\u2500 pileup.plp.gz \u251c\u2500\u2500 pileup.umi.gz \u2514\u2500\u2500 pileup.var.gz Souporcell To run Souporcell use the following code: ensemblex_HOME=/path/to/ensemblex.pip ensemblex_PWD=/path/to/working_directory bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step souporcell If Souporcell completed successfully, the following files should be available in ~/working_directory/souporcell working_directory \u2514\u2500\u2500 souporcell \u251c\u2500\u2500 alt.mtx \u251c\u2500\u2500 cluster_genotypes.vcf \u251c\u2500\u2500 clusters_tmp.tsv \u251c\u2500\u2500 clusters.tsv \u251c\u2500\u2500 fq.fq \u251c\u2500\u2500 minimap.sam \u251c\u2500\u2500 minitagged.bam \u251c\u2500\u2500 minitagged_sorted.bam \u251c\u2500\u2500 minitagged_sorted.bam.bai \u251c\u2500\u2500 Pool.vcf \u251c\u2500\u2500 ref.mtx \u2514\u2500\u2500 soup.txt Vireo-GT To run Vireo-GT use the following code: ensemblex_HOME=/path/to/ensemblex.pip ensemblex_PWD=/path/to/working_directory bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step vireo If Vireo-GT completed successfully, the following files should be available in ~/working_directory/vireo_gt working_directory \u2514\u2500\u2500 vireo_gt \u251c\u2500\u2500 cellSNP.base.vcf.gz \u251c\u2500\u2500 cellSNP.cells.vcf.gz \u251c\u2500\u2500 cellSNP.samples.tsv \u251c\u2500\u2500 cellSNP.tag.AD.mtx \u251c\u2500\u2500 cellSNP.tag.DP.mtx \u251c\u2500\u2500 cellSNP.tag.OTH.mtx \u251c\u2500\u2500 donor_ids.tsv \u251c\u2500\u2500 fig_GT_distance_estimated.pdf \u251c\u2500\u2500 fig_GT_distance_input.pdf \u251c\u2500\u2500 GT_donors.vireo.vcf.gz \u251c\u2500\u2500 _log.txt \u251c\u2500\u2500 prob_doublet.tsv.gz \u251c\u2500\u2500 prob_singlet.tsv.gz \u2514\u2500\u2500 summary.tsv Upon demultiplexing the pooled samples with each of Ensemblex's constituent genetic demultiplexing tools, we can proceed to Step 4 where we will process the output files of the consituent tools with the Ensemblex algorithm to generate the ensemble sample classifications: Application of Ensemblex Demultiplexing without prior genotype information When demultiplexing without prior genotype information, Ensemblex leverages the sample labels from Freemuxlet Souporcell Vireo Demuxalot Freemuxlet To run Freemuxlet use the following code: ensemblex_HOME=/path/to/ensemblex.pip ensemblex_PWD=/path/to/working_directory bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step freemuxlet If Freemuxlet completed successfully, the following files should be available in ~/working_directory/freemuxlet working_directory \u2514\u2500\u2500 freemuxlet \u251c\u2500\u2500 outs.clust1.samples.gz \u251c\u2500\u2500 outs.clust1.vcf \u251c\u2500\u2500 outs.lmix \u251c\u2500\u2500 pileup.cel.gz \u251c\u2500\u2500 pileup.plp.gz \u251c\u2500\u2500 pileup.umi.gz \u2514\u2500\u2500 pileup.var.gz Souporcell To run Souporcell use the following code: ensemblex_HOME=/path/to/ensemblex.pip ensemblex_PWD=/path/to/working_directory bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step souporcell If Souporcell completed successfully, the following files should be available in ~/working_directory/souporcell working_directory \u2514\u2500\u2500 souporcell \u251c\u2500\u2500 alt.mtx \u251c\u2500\u2500 cluster_genotypes.vcf \u251c\u2500\u2500 clusters_tmp.tsv \u251c\u2500\u2500 clusters.tsv \u251c\u2500\u2500 fq.fq \u251c\u2500\u2500 minimap.sam \u251c\u2500\u2500 minitagged.bam \u251c\u2500\u2500 minitagged_sorted.bam \u251c\u2500\u2500 minitagged_sorted.bam.bai \u251c\u2500\u2500 Pool.vcf \u251c\u2500\u2500 ref.mtx \u2514\u2500\u2500 soup.txt Vireo To run Vireo use the following code: ensemblex_HOME=/path/to/ensemblex.pip ensemblex_PWD=/path/to/working_directory bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step vireo If Vireo completed successfully, the following files should be available in ~/working_directory/vireo working_directory \u2514\u2500\u2500 vireo \u251c\u2500\u2500 cellSNP.base.vcf.gz \u251c\u2500\u2500 cellSNP.cells.vcf.gz \u251c\u2500\u2500 cellSNP.samples.tsv \u251c\u2500\u2500 cellSNP.tag.AD.mtx \u251c\u2500\u2500 cellSNP.tag.DP.mtx \u251c\u2500\u2500 cellSNP.tag.OTH.mtx \u251c\u2500\u2500 donor_ids.tsv \u251c\u2500\u2500 fig_GT_distance_estimated.pdf \u251c\u2500\u2500 GT_donors.vireo.vcf.gz \u251c\u2500\u2500 _log.txt \u251c\u2500\u2500 prob_doublet.tsv.gz \u251c\u2500\u2500 prob_singlet.tsv.gz \u2514\u2500\u2500 summary.tsv Demuxalot NOTE : Because the Demuxalot algorithm requires prior genotype information, the Ensemblex pipeline uses the predicted vcf file generated by Freemuxlet as input into Demuxalot when prior genotype information is not available. Therefore, it is important to wait for Freemuxlet to complete before running Demuxalot. To check if the required Freemuxlet-generated vcf file is available prior to running Demuxalot, you can use the following code: if test -f /path/to/working_directory/freemuxlet/outs.clust1.vcf; then echo \"File exists.\" fi Upon confirming that the required Freemuxlet-generated file exists, we can run Demuxalot using the following code: ensemblex_HOME=/path/to/ensemblex.pip ensemblex_PWD=/path/to/working_directory bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step demuxalot If Demuxalot completed successfully, the following files should be available in ~/working_directory/demuxalot working_directory \u2514\u2500\u2500 demuxalot \u251c\u2500\u2500 Demuxalot_result.csv \u2514\u2500\u2500 new_snps_single_file.betas Upon demultiplexing the pooled samples with each of Ensemblex's constituent genetic demultiplexing tools, we can proceed to Step 4 where we will process the output files of the consituent tools with the Ensemblex algorithm to generate the ensemble sample classifications: Application of Ensemblex","title":"Step 3: Genetic demultiplexing by constituent tools"},{"location":"Step2/#step-3-genetic-demultiplexing-by-constituent-demultiplexing-tools","text":"In Step 3, we will demultiplex the pooled samples with each of Ensemblex's constituent genetic demultiplexing tools. The constituent genetic demultiplexing tools will vary depending on the version of the Ensemblex pipeline being used: Demultiplexing with prior genotype information Demultiplexing without prior genotype information NOTE : The analytical parameters for each constiuent tool can be adjusted using the the ensemblex_config.ini file located in ~/working_directory/job_info/configs . For a comprehensive description of how to adjust the analytical parameters of the Ensemblex pipeline please see Execution parameters .","title":"Step 3: Genetic demultiplexing by constituent demultiplexing tools"},{"location":"Step2/#demultiplexing-with-prior-genotype-information","text":"When demultiplexing with prior genotype information, Ensemblex leverages the sample labels from Demuxalot Demuxlet Souporcell Vireo-GT","title":"Demultiplexing with prior genotype information"},{"location":"Step2/#demuxalot","text":"To run Demuxalot use the following code: ensemblex_HOME=/path/to/ensemblex.pip ensemblex_PWD=/path/to/working_directory bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step demuxalot If Demuxalot completed successfully, the following files should be available in ~/working_directory/demuxalot working_directory \u2514\u2500\u2500 demuxalot \u251c\u2500\u2500 Demuxalot_result.csv \u2514\u2500\u2500 new_snps_single_file.betas","title":"Demuxalot"},{"location":"Step2/#demuxlet","text":"To run Demuxlet use the following code: ensemblex_HOME=/path/to/ensemblex.pip ensemblex_PWD=/path/to/working_directory bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step demuxlet If Demuxlet completed successfully, the following files should be available in ~/working_directory/demuxlet working_directory \u2514\u2500\u2500 demuxlet \u251c\u2500\u2500 outs.best \u251c\u2500\u2500 pileup.cel.gz \u251c\u2500\u2500 pileup.plp.gz \u251c\u2500\u2500 pileup.umi.gz \u2514\u2500\u2500 pileup.var.gz","title":"Demuxlet"},{"location":"Step2/#souporcell","text":"To run Souporcell use the following code: ensemblex_HOME=/path/to/ensemblex.pip ensemblex_PWD=/path/to/working_directory bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step souporcell If Souporcell completed successfully, the following files should be available in ~/working_directory/souporcell working_directory \u2514\u2500\u2500 souporcell \u251c\u2500\u2500 alt.mtx \u251c\u2500\u2500 cluster_genotypes.vcf \u251c\u2500\u2500 clusters_tmp.tsv \u251c\u2500\u2500 clusters.tsv \u251c\u2500\u2500 fq.fq \u251c\u2500\u2500 minimap.sam \u251c\u2500\u2500 minitagged.bam \u251c\u2500\u2500 minitagged_sorted.bam \u251c\u2500\u2500 minitagged_sorted.bam.bai \u251c\u2500\u2500 Pool.vcf \u251c\u2500\u2500 ref.mtx \u2514\u2500\u2500 soup.txt","title":"Souporcell"},{"location":"Step2/#vireo-gt","text":"To run Vireo-GT use the following code: ensemblex_HOME=/path/to/ensemblex.pip ensemblex_PWD=/path/to/working_directory bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step vireo If Vireo-GT completed successfully, the following files should be available in ~/working_directory/vireo_gt working_directory \u2514\u2500\u2500 vireo_gt \u251c\u2500\u2500 cellSNP.base.vcf.gz \u251c\u2500\u2500 cellSNP.cells.vcf.gz \u251c\u2500\u2500 cellSNP.samples.tsv \u251c\u2500\u2500 cellSNP.tag.AD.mtx \u251c\u2500\u2500 cellSNP.tag.DP.mtx \u251c\u2500\u2500 cellSNP.tag.OTH.mtx \u251c\u2500\u2500 donor_ids.tsv \u251c\u2500\u2500 fig_GT_distance_estimated.pdf \u251c\u2500\u2500 fig_GT_distance_input.pdf \u251c\u2500\u2500 GT_donors.vireo.vcf.gz \u251c\u2500\u2500 _log.txt \u251c\u2500\u2500 prob_doublet.tsv.gz \u251c\u2500\u2500 prob_singlet.tsv.gz \u2514\u2500\u2500 summary.tsv Upon demultiplexing the pooled samples with each of Ensemblex's constituent genetic demultiplexing tools, we can proceed to Step 4 where we will process the output files of the consituent tools with the Ensemblex algorithm to generate the ensemble sample classifications: Application of Ensemblex","title":"Vireo-GT"},{"location":"Step2/#demultiplexing-without-prior-genotype-information","text":"When demultiplexing without prior genotype information, Ensemblex leverages the sample labels from Freemuxlet Souporcell Vireo Demuxalot","title":"Demultiplexing without prior genotype information"},{"location":"Step2/#freemuxlet","text":"To run Freemuxlet use the following code: ensemblex_HOME=/path/to/ensemblex.pip ensemblex_PWD=/path/to/working_directory bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step freemuxlet If Freemuxlet completed successfully, the following files should be available in ~/working_directory/freemuxlet working_directory \u2514\u2500\u2500 freemuxlet \u251c\u2500\u2500 outs.clust1.samples.gz \u251c\u2500\u2500 outs.clust1.vcf \u251c\u2500\u2500 outs.lmix \u251c\u2500\u2500 pileup.cel.gz \u251c\u2500\u2500 pileup.plp.gz \u251c\u2500\u2500 pileup.umi.gz \u2514\u2500\u2500 pileup.var.gz","title":"Freemuxlet"},{"location":"Step2/#souporcell_1","text":"To run Souporcell use the following code: ensemblex_HOME=/path/to/ensemblex.pip ensemblex_PWD=/path/to/working_directory bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step souporcell If Souporcell completed successfully, the following files should be available in ~/working_directory/souporcell working_directory \u2514\u2500\u2500 souporcell \u251c\u2500\u2500 alt.mtx \u251c\u2500\u2500 cluster_genotypes.vcf \u251c\u2500\u2500 clusters_tmp.tsv \u251c\u2500\u2500 clusters.tsv \u251c\u2500\u2500 fq.fq \u251c\u2500\u2500 minimap.sam \u251c\u2500\u2500 minitagged.bam \u251c\u2500\u2500 minitagged_sorted.bam \u251c\u2500\u2500 minitagged_sorted.bam.bai \u251c\u2500\u2500 Pool.vcf \u251c\u2500\u2500 ref.mtx \u2514\u2500\u2500 soup.txt","title":"Souporcell"},{"location":"Step2/#vireo","text":"To run Vireo use the following code: ensemblex_HOME=/path/to/ensemblex.pip ensemblex_PWD=/path/to/working_directory bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step vireo If Vireo completed successfully, the following files should be available in ~/working_directory/vireo working_directory \u2514\u2500\u2500 vireo \u251c\u2500\u2500 cellSNP.base.vcf.gz \u251c\u2500\u2500 cellSNP.cells.vcf.gz \u251c\u2500\u2500 cellSNP.samples.tsv \u251c\u2500\u2500 cellSNP.tag.AD.mtx \u251c\u2500\u2500 cellSNP.tag.DP.mtx \u251c\u2500\u2500 cellSNP.tag.OTH.mtx \u251c\u2500\u2500 donor_ids.tsv \u251c\u2500\u2500 fig_GT_distance_estimated.pdf \u251c\u2500\u2500 GT_donors.vireo.vcf.gz \u251c\u2500\u2500 _log.txt \u251c\u2500\u2500 prob_doublet.tsv.gz \u251c\u2500\u2500 prob_singlet.tsv.gz \u2514\u2500\u2500 summary.tsv","title":"Vireo"},{"location":"Step2/#demuxalot_1","text":"NOTE : Because the Demuxalot algorithm requires prior genotype information, the Ensemblex pipeline uses the predicted vcf file generated by Freemuxlet as input into Demuxalot when prior genotype information is not available. Therefore, it is important to wait for Freemuxlet to complete before running Demuxalot. To check if the required Freemuxlet-generated vcf file is available prior to running Demuxalot, you can use the following code: if test -f /path/to/working_directory/freemuxlet/outs.clust1.vcf; then echo \"File exists.\" fi Upon confirming that the required Freemuxlet-generated file exists, we can run Demuxalot using the following code: ensemblex_HOME=/path/to/ensemblex.pip ensemblex_PWD=/path/to/working_directory bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step demuxalot If Demuxalot completed successfully, the following files should be available in ~/working_directory/demuxalot working_directory \u2514\u2500\u2500 demuxalot \u251c\u2500\u2500 Demuxalot_result.csv \u2514\u2500\u2500 new_snps_single_file.betas Upon demultiplexing the pooled samples with each of Ensemblex's constituent genetic demultiplexing tools, we can proceed to Step 4 where we will process the output files of the consituent tools with the Ensemblex algorithm to generate the ensemble sample classifications: Application of Ensemblex","title":"Demuxalot"},{"location":"Step3/","text":"Step 4: Application of Ensemblex Introduction Ensemblex parameters Applying the Ensemblex algorithm Introduction In Step 4, we will process the output files from the constituent genetic demultiplexing tools with the Ensemblex framework. Ensemblex processes the output files in a three-step pipeline to identify the most probable sample label for each cell based on the predictions of the constituent tools: Step 1: Probabilistic-weighted ensemble In Step 1, Ensemblex utilizes an unsupervised weighting model to identify the most probable sample label for each cell. Ensemblex weighs each constituent tool\u2019s assignment probability distribution by its estimated balanced accuracy for the dataset. The weighted assignment probabilities across all four constituent tools are then used to inform the most probable sample label for each cell. Step 2: Graph-based doublet detection In Step 2, Ensemblex utilizes a graph-based approach to identify doublets that were incorrectly labeled as singlets in Step 1. Pooled cells are embedded into PCA space and the most confident doublets in the pool (nCD) are identified. Then, based on the Euclidean distance in PCA space, the pooled cells that surpass the percentile threshold (pT) of the nearest neighbour frequency to the confident doublets are labelled as doublets by Ensemblex. Ensemblex performs an automated parameter sweep to identify the optimal nCD and pT values; however, user can opt to manually define these parameters. Step 3: Ensemble-independent doublet detection In Step 3, Ensemblex utilizes an ensemble-independent approach to further improve doublet detection. Here, cells that are labelled as doublets by Demuxalot or Vireo are labelled as doublets by Ensemblex; however, users can nominate different tools to utilize for Step 3, depending on the desired doublet detection stringency. Ensemblex parameters Users can choose to run each step of the Ensemblex framework sequentially (Steps 1 to 3) or can opt to skip certain steps. While Step 1 is necessary to generate the ensemble sample labels, Steps 2 and 3 were implemented to improve Ensemblex's ability to identify doublets; thus, if users do not want to prioritize doublet detection, they may skip Steps 2 and/or 3. Nonetheless, we demonstrated in our pre-print manuscript that utilizing the entire Ensemblex framework is important for maximizing the demultiplexing accuracy. Users can define which steps of the Ensemblex framework they want to utilize in the adjustable parameters file. The adjustable parameters file ( ensemblex_config.ini ) is located in ~/working_directory/job_info/configs/ . For a comprehensive description of how to adjust the analytical parameters of the Ensemblex pipeline please see Execution parameters . The following parameters are adjustable when applying the Ensemblex algorithm: Parameter Default Description Pool parameters PAR_ensemblex_sample_size NULL Number of samples multiplexed in the pool. PAR_ensemblex_expected_doublet_rate NULL Expected doublet rate for the pool. If using 10X Genomics, the expected doublet rate can be estimated based on the number of recovered cells. For more information see 10X Genomics Documentation . Set up parameters PAR_ensemblex_merge_constituents Yes Whether or not to merge the output files of the constituent demultiplexing tools. If running Ensemblex on a pool for the first time, this parameter should be set to \"Yes\". Subsequent runs of ensemblex (e.g., parameter optimization) can have this parameter set to \"No\" as the pipeline will automatically detect the previously generated merged file. Step 1 parameters: Probabilistic-weighted ensemble PAR_ensemblex_probabilistic_weighted_ensemble Yes Whether or not to perform Step 1: Probabilistic-weighted ensemble. If running Ensemblex on a pool for the first time, this parameter should be set to \"Yes\". Subsequent runs of ensemblex (e.g., parameter optimization) can have this parameter set to \"No\" as the pipeline will automatically detect the previously generated Step 1 output file. Step 2 parameters: Graph-based doublet detection PAR_ensemblex_preliminary_parameter_sweep No Whether or not to perform a preliminary parameter sweep for Step 2: Graph-based doublet detection. Users should utilize the preliminary parameter sweep if they wish to manually define the number of confident doublets in the pool (nCD) and the percentile threshold of the nearest neighour frequency (pT), which can be defined in the following two parameters, respectively. PAR_ensemblex_nCD NULL Manually defined number of confident doublets in the pool (nCD). Value can be informed by the output files generated by setting PAR_ensemblex_preliminary_parameter_sweep to \"Yes\". PAR_ensemblex_pT NULL Manually defined percentile threshold of the nearest neighour frequency (pT). Value can be informed by the output files generated by setting PAR_ensemblex_preliminary_parameter_sweep to \"Yes\". PAR_ensemblex_graph_based_doublet_detection Yes Whether or not to perform Step 2: Graph-based doublet detection. If PAR_ensemblex_nCD and PAR_ensemblex_pT are not defined by the user (NULL), Ensemblex will automatically determine the optimal parameter values using an unsupervised parameter sweep. If PAR_ensemblex_nCD and PAR_ensemblex_pT are defined by the user, graph-based doublet detection will be performed with the user-defined values. Step 3 parameters: Ensemble-independent doublet detection PAR_ensemblex_preliminary_ensemble_independent_doublet No Whether or not to perform a preliminary parameter sweep for Step 3: Ensemble-independent doublet detection. Users should utilize the preliminary parameter sweep if they wish to manually define which constituent tools to utilize for ensemble-independent doublet detection. Users can define which tools to utilize for ensemble-independent doublet detection in the following parameters. PAR_ensemblex_ensemble_independent_doublet Yes Whether or not to perform Step 3: Ensemble-independent doublet detection. PAR_ensemblex_doublet_Demuxalot_threshold Yes Whether or not to label doublets identified by Demuxalot as doublets. Only doublets with assignment probabilities exceeding Demuxalot's recommended probability threshold will be labeled as doublets by Ensemblex. PAR_ensemblex_doublet_Demuxalot_no_threshold No Whether or not to label doublets identified by Demuxalot as doublets, regardless of the corresponding assignment probability. PAR_ensemblex_doublet_Demuxlet_threshold No Whether or not to label doublets identified by Demuxlet as doublets. Only doublets with assignment probabilities exceeding Demuxlet's recommended probability threshold will be labeled as doublets by Ensemblex. PAR_ensemblex_doublet_Demuxlet_no_threshold No Whether or not to label doublets identified by Demuxlet as doublets, regardless of the corresponding assignment probability. PAR_ensemblex_doublet_Souporcell_threshold No Whether or not to label doublets identified by Souporcell as doublets. Only doublets with assignment probabilities exceeding Souporcell's recommended probability threshold will be labeled as doublets by Ensemblex. PAR_ensemblex_doublet_Souporcell_no_threshold No Whether or not to label doublets identified by Souporcell as doublets, regardless of the corresponding assignment probability. PAR_ensemblex_doublet_Vireo_threshold Yes Whether or not to label doublets identified by Vireo as doublets. Only doublets with assignment probabilities exceeding Vireo's recommended probability threshold will be labeled as doublets by Ensemblex. PAR_ensemblex_doublet_Vireo_no_threshold No Whether or not to label doublets identified by Vireo as doublets, regardless of the corresponding assignment probability. Confidence score parameters PAR_ensemblex_compute_singlet_confidence Yes Whether or not to compute Ensemblex's singlet confidence score. This will define low confidence assignments which should be removed from downstream analyses. Applying the Ensemblex algorithm To apply the Ensemblex algorithm use the following code: ensemblex_HOME=/path/to/ensemblex.pip ensemblex_PWD=/path/to/working_directory bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step ensemblexing If the ensemblex algorithm completed successfully, the following files should be available in ~/working_directory/ensemblex working_directory \u2514\u2500\u2500 ensemblex \u251c\u2500\u2500 confidence \u2502 \u2514\u2500\u2500 ensemblex_final_cell_assignment.csv \u251c\u2500\u2500 constituent_tool_merge.csv \u251c\u2500\u2500 step1 \u2502 \u251c\u2500\u2500 ARI_demultiplexing_tools.pdf \u2502 \u251c\u2500\u2500 BA_demultiplexing_tools.pdf \u2502 \u251c\u2500\u2500 Balanced_accuracy_summary.csv \u2502 \u2514\u2500\u2500 step1_cell_assignment.csv \u251c\u2500\u2500 step2 \u2502 \u251c\u2500\u2500 optimal_nCD.pdf \u2502 \u251c\u2500\u2500 optimal_pT.pdf \u2502 \u251c\u2500\u2500 PC1_var_contrib.pdf \u2502 \u251c\u2500\u2500 PC2_var_contrib.pdf \u2502 \u251c\u2500\u2500 PCA1_graph_based_doublet_detection.pdf \u2502 \u251c\u2500\u2500 PCA2_graph_based_doublet_detection.pdf \u2502 \u251c\u2500\u2500 PCA3_graph_based_doublet_detection.pdf \u2502 \u251c\u2500\u2500 PCA_plot.pdf \u2502 \u251c\u2500\u2500 PCA_scree_plot.pdf \u2502 \u2514\u2500\u2500 Step2_cell_assignment.csv \u2514\u2500\u2500 step3 \u251c\u2500\u2500 Doublet_overlap_no_threshold.pdf \u251c\u2500\u2500 Doublet_overlap_threshold.pdf \u251c\u2500\u2500 Number_Ensemblux_doublets_EID_no_threshold.pdf \u251c\u2500\u2500 Number_Ensemblux_doublets_EID_threshold.pdf \u2514\u2500\u2500 Step3_cell_assignment.csv For a comprehensive description of the Ensemblex algorithm output files, please see Ensemblex outputs .","title":"Step 4: Application of Ensemblex"},{"location":"Step3/#step-4-application-of-ensemblex","text":"Introduction Ensemblex parameters Applying the Ensemblex algorithm","title":"Step 4: Application of Ensemblex"},{"location":"Step3/#introduction","text":"In Step 4, we will process the output files from the constituent genetic demultiplexing tools with the Ensemblex framework. Ensemblex processes the output files in a three-step pipeline to identify the most probable sample label for each cell based on the predictions of the constituent tools: Step 1: Probabilistic-weighted ensemble In Step 1, Ensemblex utilizes an unsupervised weighting model to identify the most probable sample label for each cell. Ensemblex weighs each constituent tool\u2019s assignment probability distribution by its estimated balanced accuracy for the dataset. The weighted assignment probabilities across all four constituent tools are then used to inform the most probable sample label for each cell. Step 2: Graph-based doublet detection In Step 2, Ensemblex utilizes a graph-based approach to identify doublets that were incorrectly labeled as singlets in Step 1. Pooled cells are embedded into PCA space and the most confident doublets in the pool (nCD) are identified. Then, based on the Euclidean distance in PCA space, the pooled cells that surpass the percentile threshold (pT) of the nearest neighbour frequency to the confident doublets are labelled as doublets by Ensemblex. Ensemblex performs an automated parameter sweep to identify the optimal nCD and pT values; however, user can opt to manually define these parameters. Step 3: Ensemble-independent doublet detection In Step 3, Ensemblex utilizes an ensemble-independent approach to further improve doublet detection. Here, cells that are labelled as doublets by Demuxalot or Vireo are labelled as doublets by Ensemblex; however, users can nominate different tools to utilize for Step 3, depending on the desired doublet detection stringency.","title":"Introduction"},{"location":"Step3/#ensemblex-parameters","text":"Users can choose to run each step of the Ensemblex framework sequentially (Steps 1 to 3) or can opt to skip certain steps. While Step 1 is necessary to generate the ensemble sample labels, Steps 2 and 3 were implemented to improve Ensemblex's ability to identify doublets; thus, if users do not want to prioritize doublet detection, they may skip Steps 2 and/or 3. Nonetheless, we demonstrated in our pre-print manuscript that utilizing the entire Ensemblex framework is important for maximizing the demultiplexing accuracy. Users can define which steps of the Ensemblex framework they want to utilize in the adjustable parameters file. The adjustable parameters file ( ensemblex_config.ini ) is located in ~/working_directory/job_info/configs/ . For a comprehensive description of how to adjust the analytical parameters of the Ensemblex pipeline please see Execution parameters . The following parameters are adjustable when applying the Ensemblex algorithm: Parameter Default Description Pool parameters PAR_ensemblex_sample_size NULL Number of samples multiplexed in the pool. PAR_ensemblex_expected_doublet_rate NULL Expected doublet rate for the pool. If using 10X Genomics, the expected doublet rate can be estimated based on the number of recovered cells. For more information see 10X Genomics Documentation . Set up parameters PAR_ensemblex_merge_constituents Yes Whether or not to merge the output files of the constituent demultiplexing tools. If running Ensemblex on a pool for the first time, this parameter should be set to \"Yes\". Subsequent runs of ensemblex (e.g., parameter optimization) can have this parameter set to \"No\" as the pipeline will automatically detect the previously generated merged file. Step 1 parameters: Probabilistic-weighted ensemble PAR_ensemblex_probabilistic_weighted_ensemble Yes Whether or not to perform Step 1: Probabilistic-weighted ensemble. If running Ensemblex on a pool for the first time, this parameter should be set to \"Yes\". Subsequent runs of ensemblex (e.g., parameter optimization) can have this parameter set to \"No\" as the pipeline will automatically detect the previously generated Step 1 output file. Step 2 parameters: Graph-based doublet detection PAR_ensemblex_preliminary_parameter_sweep No Whether or not to perform a preliminary parameter sweep for Step 2: Graph-based doublet detection. Users should utilize the preliminary parameter sweep if they wish to manually define the number of confident doublets in the pool (nCD) and the percentile threshold of the nearest neighour frequency (pT), which can be defined in the following two parameters, respectively. PAR_ensemblex_nCD NULL Manually defined number of confident doublets in the pool (nCD). Value can be informed by the output files generated by setting PAR_ensemblex_preliminary_parameter_sweep to \"Yes\". PAR_ensemblex_pT NULL Manually defined percentile threshold of the nearest neighour frequency (pT). Value can be informed by the output files generated by setting PAR_ensemblex_preliminary_parameter_sweep to \"Yes\". PAR_ensemblex_graph_based_doublet_detection Yes Whether or not to perform Step 2: Graph-based doublet detection. If PAR_ensemblex_nCD and PAR_ensemblex_pT are not defined by the user (NULL), Ensemblex will automatically determine the optimal parameter values using an unsupervised parameter sweep. If PAR_ensemblex_nCD and PAR_ensemblex_pT are defined by the user, graph-based doublet detection will be performed with the user-defined values. Step 3 parameters: Ensemble-independent doublet detection PAR_ensemblex_preliminary_ensemble_independent_doublet No Whether or not to perform a preliminary parameter sweep for Step 3: Ensemble-independent doublet detection. Users should utilize the preliminary parameter sweep if they wish to manually define which constituent tools to utilize for ensemble-independent doublet detection. Users can define which tools to utilize for ensemble-independent doublet detection in the following parameters. PAR_ensemblex_ensemble_independent_doublet Yes Whether or not to perform Step 3: Ensemble-independent doublet detection. PAR_ensemblex_doublet_Demuxalot_threshold Yes Whether or not to label doublets identified by Demuxalot as doublets. Only doublets with assignment probabilities exceeding Demuxalot's recommended probability threshold will be labeled as doublets by Ensemblex. PAR_ensemblex_doublet_Demuxalot_no_threshold No Whether or not to label doublets identified by Demuxalot as doublets, regardless of the corresponding assignment probability. PAR_ensemblex_doublet_Demuxlet_threshold No Whether or not to label doublets identified by Demuxlet as doublets. Only doublets with assignment probabilities exceeding Demuxlet's recommended probability threshold will be labeled as doublets by Ensemblex. PAR_ensemblex_doublet_Demuxlet_no_threshold No Whether or not to label doublets identified by Demuxlet as doublets, regardless of the corresponding assignment probability. PAR_ensemblex_doublet_Souporcell_threshold No Whether or not to label doublets identified by Souporcell as doublets. Only doublets with assignment probabilities exceeding Souporcell's recommended probability threshold will be labeled as doublets by Ensemblex. PAR_ensemblex_doublet_Souporcell_no_threshold No Whether or not to label doublets identified by Souporcell as doublets, regardless of the corresponding assignment probability. PAR_ensemblex_doublet_Vireo_threshold Yes Whether or not to label doublets identified by Vireo as doublets. Only doublets with assignment probabilities exceeding Vireo's recommended probability threshold will be labeled as doublets by Ensemblex. PAR_ensemblex_doublet_Vireo_no_threshold No Whether or not to label doublets identified by Vireo as doublets, regardless of the corresponding assignment probability. Confidence score parameters PAR_ensemblex_compute_singlet_confidence Yes Whether or not to compute Ensemblex's singlet confidence score. This will define low confidence assignments which should be removed from downstream analyses.","title":"Ensemblex parameters"},{"location":"Step3/#applying-the-ensemblex-algorithm","text":"To apply the Ensemblex algorithm use the following code: ensemblex_HOME=/path/to/ensemblex.pip ensemblex_PWD=/path/to/working_directory bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step ensemblexing If the ensemblex algorithm completed successfully, the following files should be available in ~/working_directory/ensemblex working_directory \u2514\u2500\u2500 ensemblex \u251c\u2500\u2500 confidence \u2502 \u2514\u2500\u2500 ensemblex_final_cell_assignment.csv \u251c\u2500\u2500 constituent_tool_merge.csv \u251c\u2500\u2500 step1 \u2502 \u251c\u2500\u2500 ARI_demultiplexing_tools.pdf \u2502 \u251c\u2500\u2500 BA_demultiplexing_tools.pdf \u2502 \u251c\u2500\u2500 Balanced_accuracy_summary.csv \u2502 \u2514\u2500\u2500 step1_cell_assignment.csv \u251c\u2500\u2500 step2 \u2502 \u251c\u2500\u2500 optimal_nCD.pdf \u2502 \u251c\u2500\u2500 optimal_pT.pdf \u2502 \u251c\u2500\u2500 PC1_var_contrib.pdf \u2502 \u251c\u2500\u2500 PC2_var_contrib.pdf \u2502 \u251c\u2500\u2500 PCA1_graph_based_doublet_detection.pdf \u2502 \u251c\u2500\u2500 PCA2_graph_based_doublet_detection.pdf \u2502 \u251c\u2500\u2500 PCA3_graph_based_doublet_detection.pdf \u2502 \u251c\u2500\u2500 PCA_plot.pdf \u2502 \u251c\u2500\u2500 PCA_scree_plot.pdf \u2502 \u2514\u2500\u2500 Step2_cell_assignment.csv \u2514\u2500\u2500 step3 \u251c\u2500\u2500 Doublet_overlap_no_threshold.pdf \u251c\u2500\u2500 Doublet_overlap_threshold.pdf \u251c\u2500\u2500 Number_Ensemblux_doublets_EID_no_threshold.pdf \u251c\u2500\u2500 Number_Ensemblux_doublets_EID_threshold.pdf \u2514\u2500\u2500 Step3_cell_assignment.csv For a comprehensive description of the Ensemblex algorithm output files, please see Ensemblex outputs .","title":"Applying the Ensemblex algorithm"},{"location":"contributing/","text":"Help and Feedback Any contributions or suggestions for improving the ensemblex pipeline are welcomed and appreciated. You may directly contact Michael Fiorini or Saeid Amiri . If you encounter any issues, please open an issue in the GitHub repository . Alternatively, you are welcomed to email the developers directly; for any questions please contact Michael Fiorini: michael.fiorini@mail.mcgill.ca","title":"Help and Feedback"},{"location":"contributing/#help-and-feedback","text":"Any contributions or suggestions for improving the ensemblex pipeline are welcomed and appreciated. You may directly contact Michael Fiorini or Saeid Amiri . If you encounter any issues, please open an issue in the GitHub repository . Alternatively, you are welcomed to email the developers directly; for any questions please contact Michael Fiorini: michael.fiorini@mail.mcgill.ca","title":"Help and Feedback"},{"location":"installation/","text":"Installation The Ensemblex container is freely available under an MIT open-source license at https://zenodo.org/records/11639103 . The Ensemblex container can be downloaded using the following code: ## Download the Ensemblex container curl \"https://zenodo.org/records/11639103/files/ensemblex.pip.zip?download=1\" --output ensemblex.pip.zip ## Unzip the Ensemblex container unzip ensemblex.pip.zip If installation was successful the following will be available: ensemblex.pip \u251c\u2500\u2500 gt \u2502 \u251c\u2500\u2500 configs \u2502 \u2502 \u2514\u2500\u2500 ensemblex_config.ini \u2502 \u2514\u2500\u2500 scripts \u2502 \u251c\u2500\u2500 demuxalot \u2502 \u2502 \u251c\u2500\u2500 pipeline_demuxalot.sh \u2502 \u2502 \u2514\u2500\u2500 pipline_demuxalot.py \u2502 \u251c\u2500\u2500 demuxlet \u2502 \u2502 \u2514\u2500\u2500 pipeline_demuxlet.sh \u2502 \u251c\u2500\u2500 ensemblexing \u2502 \u2502 \u251c\u2500\u2500 ensemblexing.R \u2502 \u2502 \u251c\u2500\u2500 functions.R \u2502 \u2502 \u2514\u2500\u2500 pipeline_ensemblexing.sh \u2502 \u251c\u2500\u2500 souporcell \u2502 \u2502 \u2514\u2500\u2500 pipeline_souporcell_generate.sh \u2502 \u2514\u2500\u2500 vireo \u2502 \u2514\u2500\u2500 pipeline_vireo.sh \u251c\u2500\u2500 launch \u2502 \u251c\u2500\u2500 launch_gt.sh \u2502 \u2514\u2500\u2500 launch_nogt.sh \u251c\u2500\u2500 launch_ensemblex.sh \u251c\u2500\u2500 nogt \u2502 \u251c\u2500\u2500 configs \u2502 \u2502 \u2514\u2500\u2500 ensemblex_config.ini \u2502 \u2514\u2500\u2500 scripts \u2502 \u251c\u2500\u2500 demuxalot \u2502 \u2502 \u251c\u2500\u2500 pipeline_demuxalot.py \u2502 \u2502 \u2514\u2500\u2500 pipeline_demuxalot.sh \u2502 \u251c\u2500\u2500 ensemblexing \u2502 \u2502 \u251c\u2500\u2500 ensemblexing_nogt.R \u2502 \u2502 \u251c\u2500\u2500 functions_nogt.R \u2502 \u2502 \u2514\u2500\u2500 pipeline_ensemblexing.sh \u2502 \u251c\u2500\u2500 freemuxlet \u2502 \u2502 \u2514\u2500\u2500 pipeline_freemuxlet.sh \u2502 \u251c\u2500\u2500 souporcell \u2502 \u2502 \u2514\u2500\u2500 pipeline_souporcell_generate.sh \u2502 \u2514\u2500\u2500 vireo \u2502 \u2514\u2500\u2500 pipeline_vireo.sh \u251c\u2500\u2500 README \u251c\u2500\u2500 soft \u2502 \u2514\u2500\u2500 ensemblex.sif \u2514\u2500\u2500 tools \u251c\u2500\u2500 sort_vcf_same_as_bam.sh \u2514\u2500\u2500 utils.sh In addition to the Ensemblex container, users must install Apptainer . For example: ## Load Apptainer module load apptainer/1.2.4 To test if the Ensemblex container is installed properly, run the following code: ## Define the path to ensemblex.pip ensemblex_HOME=/path/to/ensemblex.pip ## Print help message bash $ensemblex_HOME/launch_ensemblex.sh -h Which should return the following help message: ------------------- Usage: /home/fiorini9/scratch/ensemblex.pip/launch_ensemblex.sh [arguments] mandatory arguments: -d (--dir) = Working directory (where all the outputs will be printed) (give full path) --steps = Specify the steps to execute. Begin by selecting either init-GT or init-noGT to establish the working directory. For GT: vireo, demuxalot, demuxlet, souporcell, ensemblexing For noGT: vireo, demuxalot, freemuxlet, souporcell, ensemblexing optional arguments: -h (--help) = See helps regarding the pipeline arguments --vcf = The path of vcf file --bam = The path of bam file --sortout = The path snd nsme of vcf generated using sort ------------------- For a comprehensive help, visit https://neurobioinfo.github.io/ensemblex/site/ for documentation. Upon installing up the Ensemblex container, we can proceed to Step 1 where we will initiate the Ensemblex pipeline for demultiplexing: Set up","title":"Installation"},{"location":"installation/#installation","text":"The Ensemblex container is freely available under an MIT open-source license at https://zenodo.org/records/11639103 . The Ensemblex container can be downloaded using the following code: ## Download the Ensemblex container curl \"https://zenodo.org/records/11639103/files/ensemblex.pip.zip?download=1\" --output ensemblex.pip.zip ## Unzip the Ensemblex container unzip ensemblex.pip.zip If installation was successful the following will be available: ensemblex.pip \u251c\u2500\u2500 gt \u2502 \u251c\u2500\u2500 configs \u2502 \u2502 \u2514\u2500\u2500 ensemblex_config.ini \u2502 \u2514\u2500\u2500 scripts \u2502 \u251c\u2500\u2500 demuxalot \u2502 \u2502 \u251c\u2500\u2500 pipeline_demuxalot.sh \u2502 \u2502 \u2514\u2500\u2500 pipline_demuxalot.py \u2502 \u251c\u2500\u2500 demuxlet \u2502 \u2502 \u2514\u2500\u2500 pipeline_demuxlet.sh \u2502 \u251c\u2500\u2500 ensemblexing \u2502 \u2502 \u251c\u2500\u2500 ensemblexing.R \u2502 \u2502 \u251c\u2500\u2500 functions.R \u2502 \u2502 \u2514\u2500\u2500 pipeline_ensemblexing.sh \u2502 \u251c\u2500\u2500 souporcell \u2502 \u2502 \u2514\u2500\u2500 pipeline_souporcell_generate.sh \u2502 \u2514\u2500\u2500 vireo \u2502 \u2514\u2500\u2500 pipeline_vireo.sh \u251c\u2500\u2500 launch \u2502 \u251c\u2500\u2500 launch_gt.sh \u2502 \u2514\u2500\u2500 launch_nogt.sh \u251c\u2500\u2500 launch_ensemblex.sh \u251c\u2500\u2500 nogt \u2502 \u251c\u2500\u2500 configs \u2502 \u2502 \u2514\u2500\u2500 ensemblex_config.ini \u2502 \u2514\u2500\u2500 scripts \u2502 \u251c\u2500\u2500 demuxalot \u2502 \u2502 \u251c\u2500\u2500 pipeline_demuxalot.py \u2502 \u2502 \u2514\u2500\u2500 pipeline_demuxalot.sh \u2502 \u251c\u2500\u2500 ensemblexing \u2502 \u2502 \u251c\u2500\u2500 ensemblexing_nogt.R \u2502 \u2502 \u251c\u2500\u2500 functions_nogt.R \u2502 \u2502 \u2514\u2500\u2500 pipeline_ensemblexing.sh \u2502 \u251c\u2500\u2500 freemuxlet \u2502 \u2502 \u2514\u2500\u2500 pipeline_freemuxlet.sh \u2502 \u251c\u2500\u2500 souporcell \u2502 \u2502 \u2514\u2500\u2500 pipeline_souporcell_generate.sh \u2502 \u2514\u2500\u2500 vireo \u2502 \u2514\u2500\u2500 pipeline_vireo.sh \u251c\u2500\u2500 README \u251c\u2500\u2500 soft \u2502 \u2514\u2500\u2500 ensemblex.sif \u2514\u2500\u2500 tools \u251c\u2500\u2500 sort_vcf_same_as_bam.sh \u2514\u2500\u2500 utils.sh In addition to the Ensemblex container, users must install Apptainer . For example: ## Load Apptainer module load apptainer/1.2.4 To test if the Ensemblex container is installed properly, run the following code: ## Define the path to ensemblex.pip ensemblex_HOME=/path/to/ensemblex.pip ## Print help message bash $ensemblex_HOME/launch_ensemblex.sh -h Which should return the following help message: ------------------- Usage: /home/fiorini9/scratch/ensemblex.pip/launch_ensemblex.sh [arguments] mandatory arguments: -d (--dir) = Working directory (where all the outputs will be printed) (give full path) --steps = Specify the steps to execute. Begin by selecting either init-GT or init-noGT to establish the working directory. For GT: vireo, demuxalot, demuxlet, souporcell, ensemblexing For noGT: vireo, demuxalot, freemuxlet, souporcell, ensemblexing optional arguments: -h (--help) = See helps regarding the pipeline arguments --vcf = The path of vcf file --bam = The path of bam file --sortout = The path snd nsme of vcf generated using sort ------------------- For a comprehensive help, visit https://neurobioinfo.github.io/ensemblex/site/ for documentation. Upon installing up the Ensemblex container, we can proceed to Step 1 where we will initiate the Ensemblex pipeline for demultiplexing: Set up","title":"Installation"},{"location":"midbrain_download/","text":"Data Download Introduction Downloading and processing scRNAseq data Downloading sample genotype data Downloading reference genotype data Downloading genome reference file Introduction For the tutorial, we will leverage a pooled scRNAseq dataset produced by Jerber et al. . This pool contains induced pluripotent cell lines (iPSC) from 9 healthy controls that were differentiated towards a dopaminergic neuron state. In this section of the tutorial, we will: Download and process the pooled scRNAseq data with the CellRanger counts pipeline Download and process the sample genotype data Download reference genotype data Download a reference genome file Before we begin, we will create a designated folder for the Ensemblex tutorial: mkdir ensemblex_tutorial cd ensemblex_tutorial Downloading and processing scRNAseq data We will begin by downloading the pooled scRNAseq data from the Sequence Read Archive (SRA): ## Create a folder to place pooled scRNAseq data mkdir pooled_scRNAseq cd pooled_scRNAseq ## Download pooled scRNAseq FASTQ files wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR470/009/ERR4700019/ERR4700019_1.fastq.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR470/009/ERR4700019/ERR4700019_2.fastq.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR470/000/ERR4700020/ERR4700020_1.fastq.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR470/000/ERR4700020/ERR4700020_2.fastq.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR470/001/ERR4700021/ERR4700021_1.fastq.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR470/001/ERR4700021/ERR4700021_2.fastq.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR470/002/ERR4700022/ERR4700022_1.fastq.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR470/002/ERR4700022/ERR4700022_2.fastq.gz ## Rename pooled scRNAseq FASTQ files mv ERR4700019_1.fastq.gz ~/ensemblex_tutorial/pooled_scRNAseq/pool_S1_L001_R1_001.fastq.gz mv ERR4700019_2.fastq.gz ~/ensemblex_tutorial/pooled_scRNAseq/pool_S1_L001_R2_001.fastq.gz mv ERR4700020_1.fastq.gz ~/ensemblex_tutorial/pooled_scRNAseq/pool_S1_L002_R1_001.fastq.gz mv ERR4700020_2.fastq.gz ~/ensemblex_tutorial/pooled_scRNAseq/pool_S1_L002_R2_001.fastq.gz mv ERR4700021_1.fastq.gz ~/ensemblex_tutorial/pooled_scRNAseq/pool_S1_L003_R1_001.fastq.gz mv ERR4700021_2.fastq.gz ~/ensemblex_tutorial/pooled_scRNAseq/pool_S1_L003_R2_001.fastq.gz mv ERR4700022_1.fastq.gz ~/ensemblex_tutorial/pooled_scRNAseq/pool_S1_L004_R1_001.fastq.gz mv ERR4700022_2.fastq.gz ~/ensemblex_tutorial/pooled_scRNAseq/pool_S1_L004_R2_001.fastq.gz Next, we will process the pooled scRNAseq data with the CellRanger counts pipeline: ## Create CellRanger directory cd ~/ensemblex_tutorial mkdir CellRanger cd CellRanger cellranger count \\ --id=pool \\ --fastqs=/home/fiorini9/scratch/ensemblex_pipeline_test/ensemblex_tutorial/pooled_scRNAseq \\ --sample=pool \\ --transcriptome=~/10xGenomics/refdata-cellranger-GRCh37 If the CellRanger counts pipeline completed successfully, it will have generated the following files that we will use for genetic demultiplexing downstream: possorted_genome_bam.bam possorted_genome_bam.bam.bai barcodes.tsv NOTE : For more information regarding the CellRanger counts pipeline, please see the 10X documentation . Downloading sample genotype data Next, we will download the whole exome .vcf files corresponding to the nine pooled individuals from which the iPSC lines derived. We will download the .vcf files from the European Nucleotide Archive (ENA): ## Create a folder to place sample genotype data cd ~/ensemblex_tutorial mkdir sample_genotype cd sample_genotype ## HPSI0115i-hecn_6 wget ftp://ftp.sra.ebi.ac.uk/vol1/analysis/ERZ487/ERZ487971/HPSI0115i-hecn_6.wes.exomeseq.SureSelect_HumanAllExon_v5.mpileup.20170327.genotypes.vcf.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/analysis/ERZ487/ERZ487971/HPSI0115i-hecn_6.wes.exomeseq.SureSelect_HumanAllExon_v5.mpileup.20170327.genotypes.vcf.gz.tbi ## HPSI0214i-pelm_3 wget ftp://ftp.sra.ebi.ac.uk/vol1/analysis/ERZ122/ERZ122924/HPSI0214i-pelm_3.wes.exomeseq.SureSelect_HumanAllExon_v5.mpileup.20150415.genotypes.vcf.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/analysis/ERZ122/ERZ122924/HPSI0214i-pelm_3.wes.exomeseq.SureSelect_HumanAllExon_v5.mpileup.20150415.genotypes.vcf.gz.tbi ## HPSI0314i-sojd_3 wget ftp://ftp.sra.ebi.ac.uk/vol1/analysis/ERZ266/ERZ266723/HPSI0314i-sojd_3.wes.exomeseq.SureSelect_HumanAllExon_v5.mpileup.20160122.genotypes.vcf.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/analysis/ERZ266/ERZ266723/HPSI0314i-sojd_3.wes.exomeseq.SureSelect_HumanAllExon_v5.mpileup.20160122.genotypes.vcf.gz.tbi ## HPSI0414i-sebn_3 wget ftp://ftp.sra.ebi.ac.uk/vol1/analysis/ERZ376/ERZ376769/HPSI0414i-sebn_3.wes.exomeseq.SureSelect_HumanAllExon_v5.mpileup.20161031.genotypes.vcf.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/analysis/ERZ376/ERZ376769/HPSI0414i-sebn_3.wes.exomeseq.SureSelect_HumanAllExon_v5.mpileup.20161031.genotypes.vcf.gz.tbi ## HPSI0514i-uenn_3 wget ftp://ftp.sra.ebi.ac.uk/vol1/analysis/ERZ488/ERZ488039/HPSI0514i-uenn_3.wes.exomeseq.SureSelect_HumanAllExon_v5.mpileup.20170327.genotypes.vcf.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/analysis/ERZ488/ERZ488039/HPSI0514i-uenn_3.wes.exomeseq.SureSelect_HumanAllExon_v5.mpileup.20170327.genotypes.vcf.gz.tbi ## HPSI0714i-pipw_4 wget ftp://ftp.sra.ebi.ac.uk/vol1/analysis/ERZ376/ERZ376869/HPSI0714i-pipw_4.wes.exomeseq.SureSelect_HumanAllExon_v5.mpileup.20161031.genotypes.vcf.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/analysis/ERZ376/ERZ376869/HPSI0714i-pipw_4.wes.exomeseq.SureSelect_HumanAllExon_v5.mpileup.20161031.genotypes.vcf.gz.tbi ## HPSI0715i-meue_5 wget ftp://ftp.sra.ebi.ac.uk/vol1/analysis/ERZ376/ERZ376787/HPSI0715i-meue_5.wes.exomeseq.SureSelect_HumanAllExon_v5.mpileup.20161031.genotypes.vcf.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/analysis/ERZ376/ERZ376787/HPSI0715i-meue_5.wes.exomeseq.SureSelect_HumanAllExon_v5.mpileup.20161031.genotypes.vcf.gz.tbi ## HPSI0914i-vaka_5 wget ftp://ftp.sra.ebi.ac.uk/vol1/analysis/ERZ487/ERZ487965/HPSI0914i-vaka_5.wes.exomeseq.SureSelect_HumanAllExon_v5.mpileup.20170327.genotypes.vcf.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/analysis/ERZ487/ERZ487965/HPSI0914i-vaka_5.wes.exomeseq.SureSelect_HumanAllExon_v5.mpileup.20170327.genotypes.vcf.gz.tbi ## HPSI1014i-quls_2 wget ftp://ftp.sra.ebi.ac.uk/vol1/analysis/ERZ487/ERZ487886/HPSI1014i-quls_2.wes.exomeseq.SureSelect_HumanAllExon_v5.mpileup.20170327.genotypes.vcf.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/analysis/ERZ487/ERZ487886/HPSI1014i-quls_2.wes.exomeseq.SureSelect_HumanAllExon_v5.mpileup.20170327.genotypes.vcf.gz.tbi Upon downloading the individual genotype data, we will merge the individual files to generate a single .vcf file. ## Merge .vcf files module load bcftools bcftools merge *.vcf.gz > sample_genotype_merge.vcf The resulting sample_genotype_merge.vcf file will be used as prior genotype information for genetic demultiplexing downstream. Downloading reference genotype data Next, we will download a reference genotype file from the 1000 Genomes Project, Phase 3 : ## Create a folder to place the reference files cd ~/ensemblex_tutorial mkdir reference_files cd reference_files ## Download reference .vcf wget https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5c.20130502.sites.vcf.gz wget https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5c.20130502.sites.vcf.gz.tbi ## Unzip .vcf file gunzip ALL.wgs.phase3_shapeit2_mvncall_integrated_v5c.20130502.sites.vcf.gz ## Only keep SNPs module load vcftools vcftools --vcf ALL.wgs.phase3_shapeit2_mvncall_integrated_v5c.20130502.sites.vcf --remove-indels --recode --recode-INFO-all --out SNPs_only ## Only keep common variants module load bcftools bcftools filter -e 'AF<0.01' SNPs_only.recode.vcf > common_SNPs_only.recode.vcf The resulting common_SNPs_only.recode.vcf file will be used as reference genotype data for genetic demultiplexing downstream. Downloading genome reference file Finally, we will prepare a reference genome. For our tutorial we will use the GRCh37 10X reference genome. For information regarding references, see the 10X documentation . ## Copy pre-built reference genome to working directory cp /cvmfs/soft.mugqic/CentOS6/genomes/species/Homo_sapiens.GRCh37/genome/10xGenomics/refdata-cellranger-GRCh37/fasta/genome.fa ~/ensemblex_pipeline_test/ensemblex_tutorial/reference_files We will use the genome.fa reference genome for genetic demultiplexing downstream. To run the Ensemblex pipeline on the downloaded data please see the Ensemblex with prior genotype information section of the Ensemblex pipeline.","title":"Downloading data"},{"location":"midbrain_download/#data-download","text":"Introduction Downloading and processing scRNAseq data Downloading sample genotype data Downloading reference genotype data Downloading genome reference file","title":"Data Download"},{"location":"midbrain_download/#introduction","text":"For the tutorial, we will leverage a pooled scRNAseq dataset produced by Jerber et al. . This pool contains induced pluripotent cell lines (iPSC) from 9 healthy controls that were differentiated towards a dopaminergic neuron state. In this section of the tutorial, we will: Download and process the pooled scRNAseq data with the CellRanger counts pipeline Download and process the sample genotype data Download reference genotype data Download a reference genome file Before we begin, we will create a designated folder for the Ensemblex tutorial: mkdir ensemblex_tutorial cd ensemblex_tutorial","title":"Introduction"},{"location":"midbrain_download/#downloading-and-processing-scrnaseq-data","text":"We will begin by downloading the pooled scRNAseq data from the Sequence Read Archive (SRA): ## Create a folder to place pooled scRNAseq data mkdir pooled_scRNAseq cd pooled_scRNAseq ## Download pooled scRNAseq FASTQ files wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR470/009/ERR4700019/ERR4700019_1.fastq.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR470/009/ERR4700019/ERR4700019_2.fastq.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR470/000/ERR4700020/ERR4700020_1.fastq.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR470/000/ERR4700020/ERR4700020_2.fastq.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR470/001/ERR4700021/ERR4700021_1.fastq.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR470/001/ERR4700021/ERR4700021_2.fastq.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR470/002/ERR4700022/ERR4700022_1.fastq.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR470/002/ERR4700022/ERR4700022_2.fastq.gz ## Rename pooled scRNAseq FASTQ files mv ERR4700019_1.fastq.gz ~/ensemblex_tutorial/pooled_scRNAseq/pool_S1_L001_R1_001.fastq.gz mv ERR4700019_2.fastq.gz ~/ensemblex_tutorial/pooled_scRNAseq/pool_S1_L001_R2_001.fastq.gz mv ERR4700020_1.fastq.gz ~/ensemblex_tutorial/pooled_scRNAseq/pool_S1_L002_R1_001.fastq.gz mv ERR4700020_2.fastq.gz ~/ensemblex_tutorial/pooled_scRNAseq/pool_S1_L002_R2_001.fastq.gz mv ERR4700021_1.fastq.gz ~/ensemblex_tutorial/pooled_scRNAseq/pool_S1_L003_R1_001.fastq.gz mv ERR4700021_2.fastq.gz ~/ensemblex_tutorial/pooled_scRNAseq/pool_S1_L003_R2_001.fastq.gz mv ERR4700022_1.fastq.gz ~/ensemblex_tutorial/pooled_scRNAseq/pool_S1_L004_R1_001.fastq.gz mv ERR4700022_2.fastq.gz ~/ensemblex_tutorial/pooled_scRNAseq/pool_S1_L004_R2_001.fastq.gz Next, we will process the pooled scRNAseq data with the CellRanger counts pipeline: ## Create CellRanger directory cd ~/ensemblex_tutorial mkdir CellRanger cd CellRanger cellranger count \\ --id=pool \\ --fastqs=/home/fiorini9/scratch/ensemblex_pipeline_test/ensemblex_tutorial/pooled_scRNAseq \\ --sample=pool \\ --transcriptome=~/10xGenomics/refdata-cellranger-GRCh37 If the CellRanger counts pipeline completed successfully, it will have generated the following files that we will use for genetic demultiplexing downstream: possorted_genome_bam.bam possorted_genome_bam.bam.bai barcodes.tsv NOTE : For more information regarding the CellRanger counts pipeline, please see the 10X documentation .","title":"Downloading and processing scRNAseq data"},{"location":"midbrain_download/#downloading-sample-genotype-data","text":"Next, we will download the whole exome .vcf files corresponding to the nine pooled individuals from which the iPSC lines derived. We will download the .vcf files from the European Nucleotide Archive (ENA): ## Create a folder to place sample genotype data cd ~/ensemblex_tutorial mkdir sample_genotype cd sample_genotype ## HPSI0115i-hecn_6 wget ftp://ftp.sra.ebi.ac.uk/vol1/analysis/ERZ487/ERZ487971/HPSI0115i-hecn_6.wes.exomeseq.SureSelect_HumanAllExon_v5.mpileup.20170327.genotypes.vcf.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/analysis/ERZ487/ERZ487971/HPSI0115i-hecn_6.wes.exomeseq.SureSelect_HumanAllExon_v5.mpileup.20170327.genotypes.vcf.gz.tbi ## HPSI0214i-pelm_3 wget ftp://ftp.sra.ebi.ac.uk/vol1/analysis/ERZ122/ERZ122924/HPSI0214i-pelm_3.wes.exomeseq.SureSelect_HumanAllExon_v5.mpileup.20150415.genotypes.vcf.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/analysis/ERZ122/ERZ122924/HPSI0214i-pelm_3.wes.exomeseq.SureSelect_HumanAllExon_v5.mpileup.20150415.genotypes.vcf.gz.tbi ## HPSI0314i-sojd_3 wget ftp://ftp.sra.ebi.ac.uk/vol1/analysis/ERZ266/ERZ266723/HPSI0314i-sojd_3.wes.exomeseq.SureSelect_HumanAllExon_v5.mpileup.20160122.genotypes.vcf.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/analysis/ERZ266/ERZ266723/HPSI0314i-sojd_3.wes.exomeseq.SureSelect_HumanAllExon_v5.mpileup.20160122.genotypes.vcf.gz.tbi ## HPSI0414i-sebn_3 wget ftp://ftp.sra.ebi.ac.uk/vol1/analysis/ERZ376/ERZ376769/HPSI0414i-sebn_3.wes.exomeseq.SureSelect_HumanAllExon_v5.mpileup.20161031.genotypes.vcf.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/analysis/ERZ376/ERZ376769/HPSI0414i-sebn_3.wes.exomeseq.SureSelect_HumanAllExon_v5.mpileup.20161031.genotypes.vcf.gz.tbi ## HPSI0514i-uenn_3 wget ftp://ftp.sra.ebi.ac.uk/vol1/analysis/ERZ488/ERZ488039/HPSI0514i-uenn_3.wes.exomeseq.SureSelect_HumanAllExon_v5.mpileup.20170327.genotypes.vcf.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/analysis/ERZ488/ERZ488039/HPSI0514i-uenn_3.wes.exomeseq.SureSelect_HumanAllExon_v5.mpileup.20170327.genotypes.vcf.gz.tbi ## HPSI0714i-pipw_4 wget ftp://ftp.sra.ebi.ac.uk/vol1/analysis/ERZ376/ERZ376869/HPSI0714i-pipw_4.wes.exomeseq.SureSelect_HumanAllExon_v5.mpileup.20161031.genotypes.vcf.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/analysis/ERZ376/ERZ376869/HPSI0714i-pipw_4.wes.exomeseq.SureSelect_HumanAllExon_v5.mpileup.20161031.genotypes.vcf.gz.tbi ## HPSI0715i-meue_5 wget ftp://ftp.sra.ebi.ac.uk/vol1/analysis/ERZ376/ERZ376787/HPSI0715i-meue_5.wes.exomeseq.SureSelect_HumanAllExon_v5.mpileup.20161031.genotypes.vcf.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/analysis/ERZ376/ERZ376787/HPSI0715i-meue_5.wes.exomeseq.SureSelect_HumanAllExon_v5.mpileup.20161031.genotypes.vcf.gz.tbi ## HPSI0914i-vaka_5 wget ftp://ftp.sra.ebi.ac.uk/vol1/analysis/ERZ487/ERZ487965/HPSI0914i-vaka_5.wes.exomeseq.SureSelect_HumanAllExon_v5.mpileup.20170327.genotypes.vcf.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/analysis/ERZ487/ERZ487965/HPSI0914i-vaka_5.wes.exomeseq.SureSelect_HumanAllExon_v5.mpileup.20170327.genotypes.vcf.gz.tbi ## HPSI1014i-quls_2 wget ftp://ftp.sra.ebi.ac.uk/vol1/analysis/ERZ487/ERZ487886/HPSI1014i-quls_2.wes.exomeseq.SureSelect_HumanAllExon_v5.mpileup.20170327.genotypes.vcf.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/analysis/ERZ487/ERZ487886/HPSI1014i-quls_2.wes.exomeseq.SureSelect_HumanAllExon_v5.mpileup.20170327.genotypes.vcf.gz.tbi Upon downloading the individual genotype data, we will merge the individual files to generate a single .vcf file. ## Merge .vcf files module load bcftools bcftools merge *.vcf.gz > sample_genotype_merge.vcf The resulting sample_genotype_merge.vcf file will be used as prior genotype information for genetic demultiplexing downstream.","title":"Downloading sample genotype data"},{"location":"midbrain_download/#downloading-reference-genotype-data","text":"Next, we will download a reference genotype file from the 1000 Genomes Project, Phase 3 : ## Create a folder to place the reference files cd ~/ensemblex_tutorial mkdir reference_files cd reference_files ## Download reference .vcf wget https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5c.20130502.sites.vcf.gz wget https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5c.20130502.sites.vcf.gz.tbi ## Unzip .vcf file gunzip ALL.wgs.phase3_shapeit2_mvncall_integrated_v5c.20130502.sites.vcf.gz ## Only keep SNPs module load vcftools vcftools --vcf ALL.wgs.phase3_shapeit2_mvncall_integrated_v5c.20130502.sites.vcf --remove-indels --recode --recode-INFO-all --out SNPs_only ## Only keep common variants module load bcftools bcftools filter -e 'AF<0.01' SNPs_only.recode.vcf > common_SNPs_only.recode.vcf The resulting common_SNPs_only.recode.vcf file will be used as reference genotype data for genetic demultiplexing downstream.","title":"Downloading reference genotype data"},{"location":"midbrain_download/#downloading-genome-reference-file","text":"Finally, we will prepare a reference genome. For our tutorial we will use the GRCh37 10X reference genome. For information regarding references, see the 10X documentation . ## Copy pre-built reference genome to working directory cp /cvmfs/soft.mugqic/CentOS6/genomes/species/Homo_sapiens.GRCh37/genome/10xGenomics/refdata-cellranger-GRCh37/fasta/genome.fa ~/ensemblex_pipeline_test/ensemblex_tutorial/reference_files We will use the genome.fa reference genome for genetic demultiplexing downstream. To run the Ensemblex pipeline on the downloaded data please see the Ensemblex with prior genotype information section of the Ensemblex pipeline.","title":"Downloading genome reference file"},{"location":"outputs/","text":"Ensemblex algorithm outputs Introduction Outputs Merging constituent output files Step 1: Accuracy-weighted probabilistic ensemble Step 2: Graph-based doublet detection Step 3: Ensemble-independent doublet detection Singlet confidence score Introduction After applying the Ensemblex algorithm to the output files of the constituent genetic demultiplexing tools in Step 4, the ~/working_directory/ensemblex folder will have the following structure: working_directory \u2514\u2500\u2500 ensemblex \u251c\u2500\u2500 constituent_tool_merge.csv \u251c\u2500\u2500 step1 \u251c\u2500\u2500 step2 \u251c\u2500\u2500 step3 \u2514\u2500\u2500 confidence constituent_tool_merge.csv is the merged outputs from each constituent genetic demultiplexing tool. step1/ contains the outputs from Step 1: probabilistic-weighted ensemble. step2/ contains the outputs from Step 2: graph-based doublet detection. step3/ contains the outputs from Step 3: ensemble-independent doublet detection. confidence/ contains the final Ensemblex output file, whose sample labels have been annotate with the Ensemblex signlet confidence score. Note: If users re-run a step of the Ensemblex workflow, the outputs from the previous run will automatically be overwritten. If you do not want to lose the outputs from a previous run, it is important to copy the materials to a separate directory. Outputs Merging constituent output files Ensemblex begins by merging the output files of the constituent genetic demultiplexing tools by cell barcode, which produces the constituent_tool_merge.csv file. In this file, each constituent genetic demultiplexing tool has two columns corresponding to their sample labels: demuxalot_assignment demuxalot_best_assignment demuxlet_assignment demuxlet_best_assignment souporcell_assignment souporcell_best_assignment vireo_assignment vireo_best_assignment Taking Vireo as an example, vireo_assignment shows Vireo's sample labels after applying its recommended probability threshold; thus, cells that do not meet Vireo's recommended probability threshold will be labeled as \"unassigned\". In turn, vireo_best_assignment shows Vireo's best guess assignments with out applying the recommended probability threshold; thus, cells that do not meet Vireo's recommended probability threshold will still show the best sample label and will not be labelled as \"unassigned\". The constituent_tool_merge.csv file also contains a general_consensus column. This is not Ensemblex's sample labels . The general_consensus column simply shows the sample labels that result from a majority vote classifier; split decisions are labeled as unassigned. Step 1: Accuracy-weighted probabilistic ensemble After running Step 1 of the Ensemblex algorithm, the /PWE folder will contain the following files: working_directory \u2514\u2500\u2500 ensemblex \u2514\u2500\u2500 step1 \u251c\u2500\u2500 ARI_demultiplexing_tools.pdf \u251c\u2500\u2500 BA_demultiplexing_tools.pdf \u251c\u2500\u2500 Balanced_accuracy_summary.csv \u2514\u2500\u2500 Step1_cell_assignment.csv Output type Name Description Figure ARI_demultiplexing_tools.pdf Heatmap showing the Adjusted Rand Index (ARI) between the sample labels of the constituent genetic demultiplexing tools. Figure BA_demultiplexing_tools.pdf Barplot showing the estimated balanced accuracy for each constituent genetic demultiplexing tool. File Balanced_accuracy_summary.csv Summary file describing the estimated balanced accuracy computation for each constituent genetic demultiplexing tool. File Step1_cell_assignment.csv Data file containing Ensemblex's sample labels after Step 1: accuracy-weighted probabilistic ensemble. The Step1_cell_assignment.csv file contains the following important columns: ensemblex_assignment : Ensemblex sample labels after performing accuracy-weighted probabilistic ensemble. ensemblex_probability : Accuracy-weighted ensemble probability corresponding to Ensemblex's sample labels. NOTE : Prior to using Ensemblex's sample labels for downstream analyses, we recommend computing the Ensemblex singlet confidence score to identify low confidence singlet assignments that should be removed from the dataset to mitigate the introduction of technical artificats. Step 2: Graph-based doublet detection After running Step 2 of the Ensemblex algorithm, the /GBD folder will contain the following files: working_directory \u2514\u2500\u2500 ensemblex \u2514\u2500\u2500 step2 \u251c\u2500\u2500 optimal_nCD.pdf \u251c\u2500\u2500 optimal_pT.pdf \u251c\u2500\u2500 PC1_var_contrib.pdf \u251c\u2500\u2500 PC2_var_contrib.pdf \u251c\u2500\u2500 PCA1_graph_based_doublet_detection.pdf \u251c\u2500\u2500 PCA2_graph_based_doublet_detection.pdf \u251c\u2500\u2500 PCA3_graph_based_doublet_detection.pdf \u251c\u2500\u2500 PCA_plot.pdf \u251c\u2500\u2500 PCA_scree_plot.pdf \u2514\u2500\u2500 Step2_cell_assignment.csv Output type Name Description Figure optimal_nCD.pdf Dot plot showing the optimal nCD value. Figure optimal_pT.pdf Dot plot showing the optimal pT value. Figure PC1_var_contrib.pdf Bar plot showing the contribution of each variable to the variation across the first principal component. Figure PC2_var_contrib.pdf Bar plot showing the contribution of each variable to the variation across the second principal component. Figure PCA1_graph_based_doublet_detection.pdf PCA showing Ensemblex sample labels (singlet or doublet) prior to performing graph-based doublet detection. Figure PCA2_graph_based_doublet_detection.pdf PCA showing the cells identified as the n most confident doublets in the pool. Figure PCA3_graph_based_doublet_detection.pdf PCA showing Ensemblex sample labels (singlet or doublet) after performing graph-based doublet detection. Figure PCA_plot.pdf PCA of pooled cells. Figure PCA_scree_plot.pdf Bar plot showing the variance explained by each principal component. File Step2_cell_assignment.csv Data file containing Ensemblex's sample labels after Step 2: graph-based doublet detection. The Step2_cell_assignment.csv file contains the following important column: ensemblex_assignment : Ensemblex sample labels after performing graph-based doublet detection. NOTE : Prior to using Ensemblex's sample labels for downstream analyses, we recommend computing the Ensemblex singlet confidence score to identify low confidence singlet assignments that should be removed from the dataset to mitigate the introduction of technical artificats. Step 3: Ensemble-independent doublet detection After running Step 3 of the Ensemblex algorithm, the /EID folder will contain the following files: working_directory \u2514\u2500\u2500 ensemblex \u2514\u2500\u2500 step3 \u251c\u2500\u2500 Doublet_overlap_no_threshold.pdf \u251c\u2500\u2500 Doublet_overlap_threshold.pdf \u251c\u2500\u2500 Number_ensemblex_doublets_EID_no_threshold.pdf \u251c\u2500\u2500 Number_ensemblex_doublets_EID_threshold.pdf \u2514\u2500\u2500 Step3_cell_assignment.csv Output type Name Description Figure Doublet_overlap_no_threshold.pdf Proportion of doublet calls overlapping between constituent genetic demultiplexing tools without applying assignment probability thresholds. Figure Doublet_overlap_threshold.pdf Proportion of doublet calls overlapping between constituent genetic demultiplexing tools after applying assignment probability thresholds. Figure Number_ensemblex_doublets_EID_no_threshold.pdf Number of cells that would be labelled as doublets by Ensemblex if a constituent tool was nominated for ensemble-independent doublet detection, without applying assignment probability thresholds. Figure Number_ensemblex_doublets_EID_threshold.pdf Number of cells that would be labelled as doublets by Ensemblex if a constituent tool was nominated for ensemble-independent doublet detection, after applying assignment probability thresholds. File Step3_cell_assignment.csv Data file containing Ensemblex's sample labels after Step 3: ensemble-independent doublet detection. The Step3_cell_assignment.csv file contains the following important column: ensemblex_assignment : Ensemblex sample labels after performing ensemble-independent doublet detection. NOTE : Prior to using Ensemblex's sample labels for downstream analyses, we recommend computing the Ensemblex singlet confidence score to identify low confidence singlet assignments that should be removed from the dataset to mitigate the introduction of technical artificats. Singlet confidence score After computing the Ensemblex singlet confidence score, the /confidence folder will contain the following file: working_directory \u2514\u2500\u2500 ensemblex \u2514\u2500\u2500 confidence \u2514\u2500\u2500 ensemblex_final_cell_assignment.csv Output type Name Description File ensemblex_final_cell_assignment.csv Data file containing Ensemblex's final sample labels after computing the singlet confidence score. The ensemblex_final_cell_assignment.csv file contains the following important column: ensemblex_assignment : Ensemblex sample labels after applying the recommended singlet confidence score threshold; singlets with a confidence score < 1 are labeled as \"unassigned\". ensemblex_best_assignment : Ensemblex's best guess assignments with out applying the recommended confidence score threshold; singlets with a confidence score < 1 will still show the best sample label and will not be labelled as \"unassigned\". ensemblex_singlet_confidence : Ensemblex singlet confidence score. NOTE : We recommend using the sample labels from ensemblex_assignment for downstream analyses.","title":"Ensemblex outputs"},{"location":"outputs/#ensemblex-algorithm-outputs","text":"Introduction Outputs Merging constituent output files Step 1: Accuracy-weighted probabilistic ensemble Step 2: Graph-based doublet detection Step 3: Ensemble-independent doublet detection Singlet confidence score","title":"Ensemblex algorithm outputs"},{"location":"outputs/#introduction","text":"After applying the Ensemblex algorithm to the output files of the constituent genetic demultiplexing tools in Step 4, the ~/working_directory/ensemblex folder will have the following structure: working_directory \u2514\u2500\u2500 ensemblex \u251c\u2500\u2500 constituent_tool_merge.csv \u251c\u2500\u2500 step1 \u251c\u2500\u2500 step2 \u251c\u2500\u2500 step3 \u2514\u2500\u2500 confidence constituent_tool_merge.csv is the merged outputs from each constituent genetic demultiplexing tool. step1/ contains the outputs from Step 1: probabilistic-weighted ensemble. step2/ contains the outputs from Step 2: graph-based doublet detection. step3/ contains the outputs from Step 3: ensemble-independent doublet detection. confidence/ contains the final Ensemblex output file, whose sample labels have been annotate with the Ensemblex signlet confidence score. Note: If users re-run a step of the Ensemblex workflow, the outputs from the previous run will automatically be overwritten. If you do not want to lose the outputs from a previous run, it is important to copy the materials to a separate directory.","title":"Introduction"},{"location":"outputs/#outputs","text":"","title":"Outputs"},{"location":"outputs/#merging-constituent-output-files","text":"Ensemblex begins by merging the output files of the constituent genetic demultiplexing tools by cell barcode, which produces the constituent_tool_merge.csv file. In this file, each constituent genetic demultiplexing tool has two columns corresponding to their sample labels: demuxalot_assignment demuxalot_best_assignment demuxlet_assignment demuxlet_best_assignment souporcell_assignment souporcell_best_assignment vireo_assignment vireo_best_assignment Taking Vireo as an example, vireo_assignment shows Vireo's sample labels after applying its recommended probability threshold; thus, cells that do not meet Vireo's recommended probability threshold will be labeled as \"unassigned\". In turn, vireo_best_assignment shows Vireo's best guess assignments with out applying the recommended probability threshold; thus, cells that do not meet Vireo's recommended probability threshold will still show the best sample label and will not be labelled as \"unassigned\". The constituent_tool_merge.csv file also contains a general_consensus column. This is not Ensemblex's sample labels . The general_consensus column simply shows the sample labels that result from a majority vote classifier; split decisions are labeled as unassigned.","title":"Merging constituent output files"},{"location":"outputs/#step-1-accuracy-weighted-probabilistic-ensemble","text":"After running Step 1 of the Ensemblex algorithm, the /PWE folder will contain the following files: working_directory \u2514\u2500\u2500 ensemblex \u2514\u2500\u2500 step1 \u251c\u2500\u2500 ARI_demultiplexing_tools.pdf \u251c\u2500\u2500 BA_demultiplexing_tools.pdf \u251c\u2500\u2500 Balanced_accuracy_summary.csv \u2514\u2500\u2500 Step1_cell_assignment.csv Output type Name Description Figure ARI_demultiplexing_tools.pdf Heatmap showing the Adjusted Rand Index (ARI) between the sample labels of the constituent genetic demultiplexing tools. Figure BA_demultiplexing_tools.pdf Barplot showing the estimated balanced accuracy for each constituent genetic demultiplexing tool. File Balanced_accuracy_summary.csv Summary file describing the estimated balanced accuracy computation for each constituent genetic demultiplexing tool. File Step1_cell_assignment.csv Data file containing Ensemblex's sample labels after Step 1: accuracy-weighted probabilistic ensemble. The Step1_cell_assignment.csv file contains the following important columns: ensemblex_assignment : Ensemblex sample labels after performing accuracy-weighted probabilistic ensemble. ensemblex_probability : Accuracy-weighted ensemble probability corresponding to Ensemblex's sample labels. NOTE : Prior to using Ensemblex's sample labels for downstream analyses, we recommend computing the Ensemblex singlet confidence score to identify low confidence singlet assignments that should be removed from the dataset to mitigate the introduction of technical artificats.","title":"Step 1: Accuracy-weighted probabilistic ensemble"},{"location":"outputs/#step-2-graph-based-doublet-detection","text":"After running Step 2 of the Ensemblex algorithm, the /GBD folder will contain the following files: working_directory \u2514\u2500\u2500 ensemblex \u2514\u2500\u2500 step2 \u251c\u2500\u2500 optimal_nCD.pdf \u251c\u2500\u2500 optimal_pT.pdf \u251c\u2500\u2500 PC1_var_contrib.pdf \u251c\u2500\u2500 PC2_var_contrib.pdf \u251c\u2500\u2500 PCA1_graph_based_doublet_detection.pdf \u251c\u2500\u2500 PCA2_graph_based_doublet_detection.pdf \u251c\u2500\u2500 PCA3_graph_based_doublet_detection.pdf \u251c\u2500\u2500 PCA_plot.pdf \u251c\u2500\u2500 PCA_scree_plot.pdf \u2514\u2500\u2500 Step2_cell_assignment.csv Output type Name Description Figure optimal_nCD.pdf Dot plot showing the optimal nCD value. Figure optimal_pT.pdf Dot plot showing the optimal pT value. Figure PC1_var_contrib.pdf Bar plot showing the contribution of each variable to the variation across the first principal component. Figure PC2_var_contrib.pdf Bar plot showing the contribution of each variable to the variation across the second principal component. Figure PCA1_graph_based_doublet_detection.pdf PCA showing Ensemblex sample labels (singlet or doublet) prior to performing graph-based doublet detection. Figure PCA2_graph_based_doublet_detection.pdf PCA showing the cells identified as the n most confident doublets in the pool. Figure PCA3_graph_based_doublet_detection.pdf PCA showing Ensemblex sample labels (singlet or doublet) after performing graph-based doublet detection. Figure PCA_plot.pdf PCA of pooled cells. Figure PCA_scree_plot.pdf Bar plot showing the variance explained by each principal component. File Step2_cell_assignment.csv Data file containing Ensemblex's sample labels after Step 2: graph-based doublet detection. The Step2_cell_assignment.csv file contains the following important column: ensemblex_assignment : Ensemblex sample labels after performing graph-based doublet detection. NOTE : Prior to using Ensemblex's sample labels for downstream analyses, we recommend computing the Ensemblex singlet confidence score to identify low confidence singlet assignments that should be removed from the dataset to mitigate the introduction of technical artificats.","title":"Step 2: Graph-based doublet detection"},{"location":"outputs/#step-3-ensemble-independent-doublet-detection","text":"After running Step 3 of the Ensemblex algorithm, the /EID folder will contain the following files: working_directory \u2514\u2500\u2500 ensemblex \u2514\u2500\u2500 step3 \u251c\u2500\u2500 Doublet_overlap_no_threshold.pdf \u251c\u2500\u2500 Doublet_overlap_threshold.pdf \u251c\u2500\u2500 Number_ensemblex_doublets_EID_no_threshold.pdf \u251c\u2500\u2500 Number_ensemblex_doublets_EID_threshold.pdf \u2514\u2500\u2500 Step3_cell_assignment.csv Output type Name Description Figure Doublet_overlap_no_threshold.pdf Proportion of doublet calls overlapping between constituent genetic demultiplexing tools without applying assignment probability thresholds. Figure Doublet_overlap_threshold.pdf Proportion of doublet calls overlapping between constituent genetic demultiplexing tools after applying assignment probability thresholds. Figure Number_ensemblex_doublets_EID_no_threshold.pdf Number of cells that would be labelled as doublets by Ensemblex if a constituent tool was nominated for ensemble-independent doublet detection, without applying assignment probability thresholds. Figure Number_ensemblex_doublets_EID_threshold.pdf Number of cells that would be labelled as doublets by Ensemblex if a constituent tool was nominated for ensemble-independent doublet detection, after applying assignment probability thresholds. File Step3_cell_assignment.csv Data file containing Ensemblex's sample labels after Step 3: ensemble-independent doublet detection. The Step3_cell_assignment.csv file contains the following important column: ensemblex_assignment : Ensemblex sample labels after performing ensemble-independent doublet detection. NOTE : Prior to using Ensemblex's sample labels for downstream analyses, we recommend computing the Ensemblex singlet confidence score to identify low confidence singlet assignments that should be removed from the dataset to mitigate the introduction of technical artificats.","title":"Step 3: Ensemble-independent doublet detection"},{"location":"outputs/#singlet-confidence-score","text":"After computing the Ensemblex singlet confidence score, the /confidence folder will contain the following file: working_directory \u2514\u2500\u2500 ensemblex \u2514\u2500\u2500 confidence \u2514\u2500\u2500 ensemblex_final_cell_assignment.csv Output type Name Description File ensemblex_final_cell_assignment.csv Data file containing Ensemblex's final sample labels after computing the singlet confidence score. The ensemblex_final_cell_assignment.csv file contains the following important column: ensemblex_assignment : Ensemblex sample labels after applying the recommended singlet confidence score threshold; singlets with a confidence score < 1 are labeled as \"unassigned\". ensemblex_best_assignment : Ensemblex's best guess assignments with out applying the recommended confidence score threshold; singlets with a confidence score < 1 will still show the best sample label and will not be labelled as \"unassigned\". ensemblex_singlet_confidence : Ensemblex singlet confidence score. NOTE : We recommend using the sample labels from ensemblex_assignment for downstream analyses.","title":"Singlet confidence score"},{"location":"overview/","text":"Ensemblex algorithm overview Workflow Step 1: Accuracy-weighted probabilistic ensemble Step 2: Graph-based doublet detection Step 3: Ensemble-independent doublet detection Contribution of each step to overall demultiplexing accuracy Workflow The Ensemblex workflow begins by demultiplexing pooled cells with each of its constituent tools: Demuxalot, Demuxlet, Souporcell and Vireo-GT if using prior genotype information or Demuxalot, Freemuxlet, Souporcell and Vireo if prior genotype information is not available. Figure 1. Input into the Ensemblex framework. The Ensemblex workflow begins with demultiplexing pooled samples by each of the constituent tools. The outputs from each individual demultiplexing tool are then used as input into the Ensemblex framework. Upon demultiplexing pools with each individual constituent genetic demultiplexing tool, Ensemblex processes the outputs in a three-step pipeline: Step 1: Accuracy-weighted probabilistic-weighted ensemble Step 2: Graph-based doublet detection Step 3: Ensemble-independent doublet detection Figure 2. Overview of the three-step Ensemblex framework. The Ensemblex framework comprises three distinct steps that are assembled into a pipeline: 1) accuracy-weighted probabilistic ensemble, 2) graph-based doublet detection, and 3) ensemble-independent doublet detection. For demonstration purposes throughout this section, we leveraged simulated pools with known ground-truth sample labels that were generated with 80 independetly-sequenced induced pluripotent stem cell (iPSC) lines from individuals with Parkinson's disease and neurologically healthy controls. The lines were differentiated towards a dopaminergic cell fate as part of the Foundational Data Initiative for Parkinson's disease (FOUNDIN-PD; Bressan et al. ) Step 1: Accuracy-weighted probabilistic ensemble The accuracy-weighted probabilistic ensemble component of the Ensemblex utilizes an unsupervised weighting model to identify the most probable sample label for each cell. Ensemblex weighs each constituent tool\u2019s assignment probability distribution by its estimated balanced accuracy for the dataset in a framework that was largely inspired by the work of Large et al. . To estimate the balanced accuracy of a particular constituent tool (e.g. Demuxalot) for real-word datasets lacking ground-truth labels, Ensemblex leverages the cells with a consensus assignment across the three remaining tools (e.g. Demuxlet, Souporcell, and Vireo-GT) as a proxy for ground-truth. The weighted assignment probabilities across all four constituent tools are then used to inform the most probable sample label for each cell. Figure 3. Graphical representation of the accuracy-weighted probabilistic ensemble component of the Ensemblex framework. Step 2: Graph-based doublet detection The graph-based doublet detection component of the Ensemblex framework was implemented to identify doublets that are incorrectly labeled as singlets by the accuracy-weighted probablistic ensemble component (Step 1). To demonstrate Step 2 of the Ensemblex framework we leveraged a simulated pool comprising 24 pooled samples, 17,384 cells, and a 15% doublet rate. Figure 4. Graphical representation of the graph-based doublet detection component of the Ensemblex framework. The graph-based doublet detection component begins by leveraging select variables returned from each constituent tool: Demuxalot: doublet probability; Demuxlet/Freemuxlet: singlet log likelihood \u2013 doublet log likelihood; Demuxlet/Freemuxlet: number of single nucleotide polymorphisms (SNP) per cell; Demuxlet/Freemuxlet: number of reads per cell; Souporcell: doublet log probability; Vireo: doublet probability; Vireo: doublet log likelihood ratio Figure 5. Select variables returned by the constituent genetic demultiplexing tools used for graph-based doubet detection. Using these variables, Ensemblex screens each pooled cell to identify the n most confident doublets in the pool and performs a principal component analysis (PCA). Figure 6. PCA of pooled cells using select variables returned by the constituent genetic demultiplexing tools. A) PCA highlighting ground truth cell labels: singlet or doublet. B) PCA highlighting the n most confident doublets identified by Ensemblex. The PCA embedding is then converted into a Euclidean distance matrix and each cell is assigned a percentile rank based on their distance to each confident doublet. After performing an automated parameter sweep, Ensemblex identifies the droplets that appear most frequently amongst the nearest neighbours of confident doublets as doublets. Figure 7. PCA of pooled cells labeled according to Ensemblex labels prior to and after graph-based doublet detection. A) PCA highlighting ground truth cell labels: singlet or doublet. B) PCA highlighting Ensemblex's labels prior to graph-based doublet detection. C) PCA highlighting Ensemblex's labels after graph-based doublet detection. Step 3: Ensemble-independent doublet detection The ensemble-independent doublet detection component of the Ensemblex framework was implemented to further improve Ensemblex's ability to identify doublets. Benchmarking on simulated pools with known ground-truth sample labels revealed that certain genetic demultiplexing tools, namely Demuxalot and Vireo, showed high doublet detection specificity. Figure 8. Constituent genetic demultiplexing tool doublet specificity on computationally multiplexed pools with ground truth sample labels. Doublet specificity was evaluated on pools ranging in size from 4 to 80 multiplexed samples. However, Steps 1 and 2 of the Ensemblex workflow failed to correctly label a subset of doublet calls by these tools. To mitigate this issue and maximize the rate of doublet identification, Ensemblex labels the cells that are identified as doublets by Vireo or Demuxalot as doublets, by default; however, users can nominate different tools for the ensemble-independent doublet detection component depending on the desired doublet detection stringency. Figure 9. Graphical representation of the ensemble-independent doublet detection component of the Ensemblex framework. Contribution of each step to overall demultiplexing accuracy We sequentially applied each step of the Ensemblex framework to 96 computationally multiplexed pools with known ground truth sample labels ranging in size from 4 to 80 samples. The proportion of correctly classified singlets and doublets identified by Ensemblex after each step of the framework is shown in Figure 10. Figure 10. Contribution of each component of the Ensemblex framework to demultiplexing accuracy. The average proportion of correctly classified A) singlets and B) doublets across replicates at a given pool size is shown after sequentially applying each step of the Ensemblex framework. The right panels show the average proportion of correct classifications across all 96 pools. The blue points show the proportion of cells that were correctly classified by at least one tool: Demuxalot, Demuxlet, Souporcell, or Vireo. For detailed methodology please see our pre-print manuscript .","title":"Ensemblex algorithm overview"},{"location":"overview/#ensemblex-algorithm-overview","text":"Workflow Step 1: Accuracy-weighted probabilistic ensemble Step 2: Graph-based doublet detection Step 3: Ensemble-independent doublet detection Contribution of each step to overall demultiplexing accuracy","title":"Ensemblex algorithm overview"},{"location":"overview/#workflow","text":"The Ensemblex workflow begins by demultiplexing pooled cells with each of its constituent tools: Demuxalot, Demuxlet, Souporcell and Vireo-GT if using prior genotype information or Demuxalot, Freemuxlet, Souporcell and Vireo if prior genotype information is not available. Figure 1. Input into the Ensemblex framework. The Ensemblex workflow begins with demultiplexing pooled samples by each of the constituent tools. The outputs from each individual demultiplexing tool are then used as input into the Ensemblex framework. Upon demultiplexing pools with each individual constituent genetic demultiplexing tool, Ensemblex processes the outputs in a three-step pipeline: Step 1: Accuracy-weighted probabilistic-weighted ensemble Step 2: Graph-based doublet detection Step 3: Ensemble-independent doublet detection Figure 2. Overview of the three-step Ensemblex framework. The Ensemblex framework comprises three distinct steps that are assembled into a pipeline: 1) accuracy-weighted probabilistic ensemble, 2) graph-based doublet detection, and 3) ensemble-independent doublet detection. For demonstration purposes throughout this section, we leveraged simulated pools with known ground-truth sample labels that were generated with 80 independetly-sequenced induced pluripotent stem cell (iPSC) lines from individuals with Parkinson's disease and neurologically healthy controls. The lines were differentiated towards a dopaminergic cell fate as part of the Foundational Data Initiative for Parkinson's disease (FOUNDIN-PD; Bressan et al. )","title":"Workflow"},{"location":"overview/#step-1-accuracy-weighted-probabilistic-ensemble","text":"The accuracy-weighted probabilistic ensemble component of the Ensemblex utilizes an unsupervised weighting model to identify the most probable sample label for each cell. Ensemblex weighs each constituent tool\u2019s assignment probability distribution by its estimated balanced accuracy for the dataset in a framework that was largely inspired by the work of Large et al. . To estimate the balanced accuracy of a particular constituent tool (e.g. Demuxalot) for real-word datasets lacking ground-truth labels, Ensemblex leverages the cells with a consensus assignment across the three remaining tools (e.g. Demuxlet, Souporcell, and Vireo-GT) as a proxy for ground-truth. The weighted assignment probabilities across all four constituent tools are then used to inform the most probable sample label for each cell. Figure 3. Graphical representation of the accuracy-weighted probabilistic ensemble component of the Ensemblex framework.","title":"Step 1: Accuracy-weighted probabilistic ensemble"},{"location":"overview/#step-2-graph-based-doublet-detection","text":"The graph-based doublet detection component of the Ensemblex framework was implemented to identify doublets that are incorrectly labeled as singlets by the accuracy-weighted probablistic ensemble component (Step 1). To demonstrate Step 2 of the Ensemblex framework we leveraged a simulated pool comprising 24 pooled samples, 17,384 cells, and a 15% doublet rate. Figure 4. Graphical representation of the graph-based doublet detection component of the Ensemblex framework. The graph-based doublet detection component begins by leveraging select variables returned from each constituent tool: Demuxalot: doublet probability; Demuxlet/Freemuxlet: singlet log likelihood \u2013 doublet log likelihood; Demuxlet/Freemuxlet: number of single nucleotide polymorphisms (SNP) per cell; Demuxlet/Freemuxlet: number of reads per cell; Souporcell: doublet log probability; Vireo: doublet probability; Vireo: doublet log likelihood ratio Figure 5. Select variables returned by the constituent genetic demultiplexing tools used for graph-based doubet detection. Using these variables, Ensemblex screens each pooled cell to identify the n most confident doublets in the pool and performs a principal component analysis (PCA). Figure 6. PCA of pooled cells using select variables returned by the constituent genetic demultiplexing tools. A) PCA highlighting ground truth cell labels: singlet or doublet. B) PCA highlighting the n most confident doublets identified by Ensemblex. The PCA embedding is then converted into a Euclidean distance matrix and each cell is assigned a percentile rank based on their distance to each confident doublet. After performing an automated parameter sweep, Ensemblex identifies the droplets that appear most frequently amongst the nearest neighbours of confident doublets as doublets. Figure 7. PCA of pooled cells labeled according to Ensemblex labels prior to and after graph-based doublet detection. A) PCA highlighting ground truth cell labels: singlet or doublet. B) PCA highlighting Ensemblex's labels prior to graph-based doublet detection. C) PCA highlighting Ensemblex's labels after graph-based doublet detection.","title":"Step 2: Graph-based doublet detection"},{"location":"overview/#step-3-ensemble-independent-doublet-detection","text":"The ensemble-independent doublet detection component of the Ensemblex framework was implemented to further improve Ensemblex's ability to identify doublets. Benchmarking on simulated pools with known ground-truth sample labels revealed that certain genetic demultiplexing tools, namely Demuxalot and Vireo, showed high doublet detection specificity. Figure 8. Constituent genetic demultiplexing tool doublet specificity on computationally multiplexed pools with ground truth sample labels. Doublet specificity was evaluated on pools ranging in size from 4 to 80 multiplexed samples. However, Steps 1 and 2 of the Ensemblex workflow failed to correctly label a subset of doublet calls by these tools. To mitigate this issue and maximize the rate of doublet identification, Ensemblex labels the cells that are identified as doublets by Vireo or Demuxalot as doublets, by default; however, users can nominate different tools for the ensemble-independent doublet detection component depending on the desired doublet detection stringency. Figure 9. Graphical representation of the ensemble-independent doublet detection component of the Ensemblex framework.","title":"Step 3: Ensemble-independent doublet detection"},{"location":"overview/#contribution-of-each-step-to-overall-demultiplexing-accuracy","text":"We sequentially applied each step of the Ensemblex framework to 96 computationally multiplexed pools with known ground truth sample labels ranging in size from 4 to 80 samples. The proportion of correctly classified singlets and doublets identified by Ensemblex after each step of the framework is shown in Figure 10. Figure 10. Contribution of each component of the Ensemblex framework to demultiplexing accuracy. The average proportion of correctly classified A) singlets and B) doublets across replicates at a given pool size is shown after sequentially applying each step of the Ensemblex framework. The right panels show the average proportion of correct classifications across all 96 pools. The blue points show the proportion of cells that were correctly classified by at least one tool: Demuxalot, Demuxlet, Souporcell, or Vireo. For detailed methodology please see our pre-print manuscript .","title":"Contribution of each step to overall demultiplexing accuracy"},{"location":"overview_pipeline/","text":"Ensemblex pipeline overview The Ensemblex pipeline was developed to facilitate the application of each of Ensemblex's constituent demultiplexing tools and seamlessly integrate the output files into the Ensemblex framework. We provide two distinct, yet highly comparable pipelines: Demultiplexing with prior genotype information Demultiplexing without prior genotype information The pipelines comprise of four distinct steps: Selection of Ensemblex pipeline and establishing the working directory (Set up) Prepare input files for constituent genetic demultiplexing tools Genetic demultiplexing by constituent demultiplexing tools Application of the Ensemblex framework Each step of the pipeline is comprehensively described in the following sections of the Ensemblex documentation.","title":"Ensemblex pipeline overview"},{"location":"overview_pipeline/#ensemblex-pipeline-overview","text":"The Ensemblex pipeline was developed to facilitate the application of each of Ensemblex's constituent demultiplexing tools and seamlessly integrate the output files into the Ensemblex framework. We provide two distinct, yet highly comparable pipelines: Demultiplexing with prior genotype information Demultiplexing without prior genotype information The pipelines comprise of four distinct steps: Selection of Ensemblex pipeline and establishing the working directory (Set up) Prepare input files for constituent genetic demultiplexing tools Genetic demultiplexing by constituent demultiplexing tools Application of the Ensemblex framework Each step of the pipeline is comprehensively described in the following sections of the Ensemblex documentation.","title":"Ensemblex pipeline overview"},{"location":"reference/","text":"Adjustable execution parameters for the Ensemblex pipeline Introduction How to modify the parameter files Constituent genetic demultiplexing tools with prior genotype information Demuxalot Demuxlet Souporcell Vireo Constituent genetic demultiplexing tools without prior genotype information Demuxalot Freemuxlet Souporcell Vireo Ensemblex algorithm Introduction Prior to running the Ensemblex pipeline, users should modify the execution parameters for the constituent genetic demultiplexing tools and the Ensemblex algorithm. Upon running Step 1: Set up , a /job_info folder will be created in the wording directory. Within the /job_info folder is a /configs folder which contains the ensemblex_config.ini ; this .ini file contains all of the adjustable parameters for the Ensemblex pipeline. working_directory \u2514\u2500\u2500 job_info \u251c\u2500\u2500 configs \u2502 \u2514\u2500\u2500 ensemblex_config.ini \u251c\u2500\u2500 logs \u2514\u2500\u2500 summary_report.txt To ensure replicability, the execution parameters are documented in ~/working_directory/job_info/summary_report.txt . How to modify the parameter files The following section illustrates how to modify the ensemblex_config.ini parameter file directly from the terminal. To begin, navigate to the /configs folder and view its contents: cd ~/working_directory/job_info/configs ls The following file will be available: ensemblex_config.ini To modify the ensemblex_config.ini parameter file directly in the terminal we will use Nano : nano ensemblex_config.ini This will open ensemblex_config.ini in the terminal and allow users to modify the parameters. To save the modifications and exit the parameter file, type ctrl+o followed by ctrl+x . Constituent genetic demultiplexing tools with prior genotype information Demuxalot The following parameters are adjustable for Demuxalot: Parameter Default Description PAR_demuxalot_genotype_names NULL List of Sample ID's in the sample VCF file (e.g., 'Sample_1,Sample_2,Sample_3'). PAR_demuxalot_minimum_coverage 200 Minimum read coverage. PAR_demuxalot_minimum_alternative_coverage 10 Minimum alternative read coverage. PAR_demuxalot_n_best_snps_per_donor 100 Number of best snps for each donor to use for demultiplexing. PAR_demuxalot_genotypes_prior_strength 1 Genotype prior strength. PAR_demuxalot_doublet_prior 0.25 Doublet prior strength. Demuxlet The following parameters are adjustable for Demuxlet: Parameter Default Description PAR_demuxlet_field GT Field to extract the genotypes (GT), genotype likelihood (PL), or posterior probability (GP) from the sample .vcf file. NOTE : We are currently working on expanding the execution parameters for Demuxlet. Vireo The following parameters are adjustable for Vireo: Parameter Default Description PAR_vireo_N NULL Number of pooled samples. PAR_vireo_type GT Field to extract the genotypes (GT), genotype likelihood (PL), or posterior probability (GP) from the sample .vcf file. PAR_vireo_processes 20 Number of subprocesses for computing. PAR_vireo_minMAF 0.1 Minimum minor allele frequency. PAR_vireo_minCOUNT 20 Minimum aggregated count. PAR_vireo_forcelearnGT T Whether or not to treat donor GT as prior only. NOTE : We are currently working on expanding the execution parameters for Vireo. Souporcell The following parameters are adjustable for Souporcell: Parameter Default Description PAR_minimap2 -ax splice -t 8 -G50k -k 21 -w 11 --sr -A2 -B8 -O12,32 -E2,1 -r200 -p.5 -N20 -f1000,5000 -n2 -m20 -s40 -g2000 -2K50m --secondary=no For information regarding the minimap2 parameters, please see the documentation . PAR_freebayes -iXu -C 2 -q 20 -n 3 -E 1 -m 30 --min-coverage 6 For information regarding the freebayes parameters, please see the documentation . PAR_vartrix_umi TRUE Whether or no to consider UMI information when populating coverage matrices. PAR_vartrix_mapq 30 Minimum read mapping quality. PAR_vartrix_threads 8 Number of threads for computing. PAR_souporcell_k NULL Number of pooled samples. PAR_souporcell_t 8 Number of threads for computing. NOTE : We are currently working on expanding the execution parameters for Souporcell. Constituent genetic demultiplexing tools without prior genotype information Demuxalot The following parameters are adjustable for Demuxalot: Parameter Default Description PAR_demuxalot_genotype_names NULL List of Sample ID's in the sample VCF file generated by Freemuxlet: outs.clust1.vcf (e.g., 'CLUST0,CLUST1,CLUST2'). PAR_demuxalot_minimum_coverage 200 Minimum read coverage. PAR_demuxalot_minimum_alternative_coverage 10 Minimum alternative read coverage. PAR_demuxalot_n_best_snps_per_donor 100 Number of best snps for each donor to use for demultiplexing. PAR_demuxalot_genotypes_prior_strength 1 Genotype prior strength. PAR_demuxalot_doublet_prior 0.25 Doublet prior strength. Freemuxlet The following parameters are adjustable for Freemuxlet: Parameter Default Description PAR_freemuxlet_nsample NULL Number of pooled samples. NOTE : We are currently working on expanding the execution parameters for Freemuxlet. Vireo The following parameters are adjustable for Vireo: Parameter Default Description PAR_vireo_N NULL Number of pooled samples. PAR_vireo_processes 20 Number of subprocesses for computing. PAR_vireo_minMAF 0.1 Minimum minor allele frequency. PAR_vireo_minCOUNT 20 Minimum aggregated count. NOTE : We are currently working on expanding the execution parameters for Vireo. Souporcell The following parameters are adjustable for Souporcell: Parameter Default Description PAR_minimap2 -ax splice -t 8 -G50k -k 21 -w 11 --sr -A2 -B8 -O12,32 -E2,1 -r200 -p.5 -N20 -f1000,5000 -n2 -m20 -s40 -g2000 -2K50m --secondary=no For information regarding the minimap2 parameters, please see the documentation . PAR_freebayes -iXu -C 2 -q 20 -n 3 -E 1 -m 30 --min-coverage 6 For information regarding the freebayes parameters, please see the documentation . PAR_vartrix_umi TRUE Whether or no to consider UMI information when populating coverage matrices. PAR_vartrix_mapq 30 Minimum read mapping quality. PAR_vartrix_threads 8 Number of threads for computing. PAR_souporcell_k NULL Number of pooled samples. PAR_souporcell_t 8 Number of threads for computing. NOTE : We are currently working on expanding the execution parameters for Souporcell. Ensemblex The following parameters are adjustable for the Ensemblex algorithm: Parameter Default Description Pool parameters PAR_ensemblex_sample_size NULL Number of samples multiplexed in the pool. PAR_ensemblex_expected_doublet_rate NULL Expected doublet rate for the pool. If using 10X Genomics, the expected doublet rate can be estimated based on the number of recovered cells. For more information see 10X Genomics Documentation . Set up parameters PAR_ensemblex_merge_constituents Yes Whether or not to merge the output files of the constituent demultiplexing tools. If running Ensemblex on a pool for the first time, this parameter should be set to \"Yes\". Subsequent runs of Ensemblex (e.g., parameter optimization) can have this parameter set to \"No\" as the pipeline will automatically detect the previously generated merged file. Step 1 parameters: Probabilistic-weighted ensemble PAR_ensemblex_probabilistic_weighted_ensemble Yes Whether or not to perform Step 1: Probabilistic-weighted ensemble. If running Ensemblex on a pool for the first time, this parameter should be set to \"Yes\". Subsequent runs of Ensemblex (e.g., parameter optimization) can have this parameter set to \"No\" as the pipeline will automatically detect the previously generated Step 1 output file. Step 2 parameters: Graph-based doublet detection PAR_ensemblex_preliminary_parameter_sweep No Whether or not to perform a preliminary parameter sweep for Step 2: Graph-based doublet detection. Users should utilize the preliminary parameter sweep if they wish to manually define the number of confident doublets in the pool (nCD) and the percentile threshold of the nearest neighour frequency (pT), which can be defined in the following two parameters, respectively. PAR_ensemblex_nCD NULL Manually defined number of confident doublets in the pool (nCD). Value can be informed by the output files generated by setting PAR_ensemblex_preliminary_parameter_sweep to \"Yes\". PAR_ensemblex_pT NULL Manually defined percentile threshold of the nearest neighour frequency (pT). Value can be informed by the output files generated by setting PAR_ensemblex_preliminary_parameter_sweep to \"Yes\". PAR_ensemblex_graph_based_doublet_detection Yes Whether or not to perform Step 2: Graph-based doublet detection. If PAR_ensemblex_nCD and PAR_ensemblex_pT are not defined by the user (NULL), Ensemblex will automatically determine the optimal parameter values using an unsupervised parameter sweep. If PAR_ensemblex_nCD and PAR_ensemblex_pT are defined by the user, graph-based doublet detection will be performed with the user-defined values. Step 3 parameters: Ensemble-independent doublet detection PAR_ensemblex_preliminary_ensemble_independent_doublet No Whether or not to perform a preliminary parameter sweep for Step 3: Ensemble-independent doublet detection. Users should utilize the preliminary parameter sweep if they wish to manually define which constituent tools to utilize for ensemble-independent doublet detection. Users can define which tools to utilize for ensemble-independent doublet detection in the following parameters. PAR_ensemblex_ensemble_independent_doublet Yes Whether or not to perform Step 3: Ensemble-independent doublet detection. PAR_ensemblex_doublet_Demuxalot_threshold Yes Whether or not to label doublets identified by Demuxalot as doublets. Only doublets with assignment probabilities exceeding Demuxalot's recommended probability threshold will be labeled as doublets by Ensemblex. PAR_ensemblex_doublet_Demuxalot_no_threshold No Whether or not to label doublets identified by Demuxalot as doublets, regardless of the corresponding assignment probability. PAR_ensemblex_doublet_Demuxlet_threshold No Whether or not to label doublets identified by Demuxlet as doublets. Only doublets with assignment probabilities exceeding Demuxlet's recommended probability threshold will be labeled as doublets by Ensemblex. PAR_ensemblex_doublet_Demuxlet_no_threshold No Whether or not to label doublets identified by Demuxlet as doublets, regardless of the corresponding assignment probability. PAR_ensemblex_doublet_Souporcell_threshold No Whether or not to label doublets identified by Souporcell as doublets. Only doublets with assignment probabilities exceeding Souporcell's recommended probability threshold will be labeled as doublets by Ensemblex. PAR_ensemblex_doublet_Souporcell_no_threshold No Whether or not to label doublets identified by Souporcell as doublets, regardless of the corresponding assignment probability. PAR_ensemblex_doublet_Vireo_threshold Yes Whether or not to label doublets identified by Vireo as doublets. Only doublets with assignment probabilities exceeding Vireo's recommended probability threshold will be labeled as doublets by Ensemblex. PAR_ensemblex_doublet_Vireo_no_threshold No Whether or not to label doublets identified by Vireo as doublets, regardless of the corresponding assignment probability. Confidence score parameters PAR_ensemblex_compute_singlet_confidence Yes Whether or not to compute Ensemblex's singlet confidence score. This will define low confidence assignments which should be removed from downstream analyses.","title":"Execution parameters"},{"location":"reference/#adjustable-execution-parameters-for-the-ensemblex-pipeline","text":"Introduction How to modify the parameter files Constituent genetic demultiplexing tools with prior genotype information Demuxalot Demuxlet Souporcell Vireo Constituent genetic demultiplexing tools without prior genotype information Demuxalot Freemuxlet Souporcell Vireo Ensemblex algorithm","title":"Adjustable execution parameters for the Ensemblex pipeline"},{"location":"reference/#introduction","text":"Prior to running the Ensemblex pipeline, users should modify the execution parameters for the constituent genetic demultiplexing tools and the Ensemblex algorithm. Upon running Step 1: Set up , a /job_info folder will be created in the wording directory. Within the /job_info folder is a /configs folder which contains the ensemblex_config.ini ; this .ini file contains all of the adjustable parameters for the Ensemblex pipeline. working_directory \u2514\u2500\u2500 job_info \u251c\u2500\u2500 configs \u2502 \u2514\u2500\u2500 ensemblex_config.ini \u251c\u2500\u2500 logs \u2514\u2500\u2500 summary_report.txt To ensure replicability, the execution parameters are documented in ~/working_directory/job_info/summary_report.txt .","title":"Introduction"},{"location":"reference/#how-to-modify-the-parameter-files","text":"The following section illustrates how to modify the ensemblex_config.ini parameter file directly from the terminal. To begin, navigate to the /configs folder and view its contents: cd ~/working_directory/job_info/configs ls The following file will be available: ensemblex_config.ini To modify the ensemblex_config.ini parameter file directly in the terminal we will use Nano : nano ensemblex_config.ini This will open ensemblex_config.ini in the terminal and allow users to modify the parameters. To save the modifications and exit the parameter file, type ctrl+o followed by ctrl+x .","title":"How to modify the parameter files"},{"location":"reference/#constituent-genetic-demultiplexing-tools-with-prior-genotype-information","text":"","title":"Constituent genetic demultiplexing tools with prior genotype information"},{"location":"reference/#demuxalot","text":"The following parameters are adjustable for Demuxalot: Parameter Default Description PAR_demuxalot_genotype_names NULL List of Sample ID's in the sample VCF file (e.g., 'Sample_1,Sample_2,Sample_3'). PAR_demuxalot_minimum_coverage 200 Minimum read coverage. PAR_demuxalot_minimum_alternative_coverage 10 Minimum alternative read coverage. PAR_demuxalot_n_best_snps_per_donor 100 Number of best snps for each donor to use for demultiplexing. PAR_demuxalot_genotypes_prior_strength 1 Genotype prior strength. PAR_demuxalot_doublet_prior 0.25 Doublet prior strength.","title":"Demuxalot"},{"location":"reference/#demuxlet","text":"The following parameters are adjustable for Demuxlet: Parameter Default Description PAR_demuxlet_field GT Field to extract the genotypes (GT), genotype likelihood (PL), or posterior probability (GP) from the sample .vcf file. NOTE : We are currently working on expanding the execution parameters for Demuxlet.","title":"Demuxlet"},{"location":"reference/#vireo","text":"The following parameters are adjustable for Vireo: Parameter Default Description PAR_vireo_N NULL Number of pooled samples. PAR_vireo_type GT Field to extract the genotypes (GT), genotype likelihood (PL), or posterior probability (GP) from the sample .vcf file. PAR_vireo_processes 20 Number of subprocesses for computing. PAR_vireo_minMAF 0.1 Minimum minor allele frequency. PAR_vireo_minCOUNT 20 Minimum aggregated count. PAR_vireo_forcelearnGT T Whether or not to treat donor GT as prior only. NOTE : We are currently working on expanding the execution parameters for Vireo.","title":"Vireo"},{"location":"reference/#souporcell","text":"The following parameters are adjustable for Souporcell: Parameter Default Description PAR_minimap2 -ax splice -t 8 -G50k -k 21 -w 11 --sr -A2 -B8 -O12,32 -E2,1 -r200 -p.5 -N20 -f1000,5000 -n2 -m20 -s40 -g2000 -2K50m --secondary=no For information regarding the minimap2 parameters, please see the documentation . PAR_freebayes -iXu -C 2 -q 20 -n 3 -E 1 -m 30 --min-coverage 6 For information regarding the freebayes parameters, please see the documentation . PAR_vartrix_umi TRUE Whether or no to consider UMI information when populating coverage matrices. PAR_vartrix_mapq 30 Minimum read mapping quality. PAR_vartrix_threads 8 Number of threads for computing. PAR_souporcell_k NULL Number of pooled samples. PAR_souporcell_t 8 Number of threads for computing. NOTE : We are currently working on expanding the execution parameters for Souporcell.","title":"Souporcell"},{"location":"reference/#constituent-genetic-demultiplexing-tools-without-prior-genotype-information","text":"","title":"Constituent genetic demultiplexing tools without prior genotype information"},{"location":"reference/#demuxalot_1","text":"The following parameters are adjustable for Demuxalot: Parameter Default Description PAR_demuxalot_genotype_names NULL List of Sample ID's in the sample VCF file generated by Freemuxlet: outs.clust1.vcf (e.g., 'CLUST0,CLUST1,CLUST2'). PAR_demuxalot_minimum_coverage 200 Minimum read coverage. PAR_demuxalot_minimum_alternative_coverage 10 Minimum alternative read coverage. PAR_demuxalot_n_best_snps_per_donor 100 Number of best snps for each donor to use for demultiplexing. PAR_demuxalot_genotypes_prior_strength 1 Genotype prior strength. PAR_demuxalot_doublet_prior 0.25 Doublet prior strength.","title":"Demuxalot"},{"location":"reference/#freemuxlet","text":"The following parameters are adjustable for Freemuxlet: Parameter Default Description PAR_freemuxlet_nsample NULL Number of pooled samples. NOTE : We are currently working on expanding the execution parameters for Freemuxlet.","title":"Freemuxlet"},{"location":"reference/#vireo_1","text":"The following parameters are adjustable for Vireo: Parameter Default Description PAR_vireo_N NULL Number of pooled samples. PAR_vireo_processes 20 Number of subprocesses for computing. PAR_vireo_minMAF 0.1 Minimum minor allele frequency. PAR_vireo_minCOUNT 20 Minimum aggregated count. NOTE : We are currently working on expanding the execution parameters for Vireo.","title":"Vireo"},{"location":"reference/#souporcell_1","text":"The following parameters are adjustable for Souporcell: Parameter Default Description PAR_minimap2 -ax splice -t 8 -G50k -k 21 -w 11 --sr -A2 -B8 -O12,32 -E2,1 -r200 -p.5 -N20 -f1000,5000 -n2 -m20 -s40 -g2000 -2K50m --secondary=no For information regarding the minimap2 parameters, please see the documentation . PAR_freebayes -iXu -C 2 -q 20 -n 3 -E 1 -m 30 --min-coverage 6 For information regarding the freebayes parameters, please see the documentation . PAR_vartrix_umi TRUE Whether or no to consider UMI information when populating coverage matrices. PAR_vartrix_mapq 30 Minimum read mapping quality. PAR_vartrix_threads 8 Number of threads for computing. PAR_souporcell_k NULL Number of pooled samples. PAR_souporcell_t 8 Number of threads for computing. NOTE : We are currently working on expanding the execution parameters for Souporcell.","title":"Souporcell"},{"location":"reference/#ensemblex","text":"The following parameters are adjustable for the Ensemblex algorithm: Parameter Default Description Pool parameters PAR_ensemblex_sample_size NULL Number of samples multiplexed in the pool. PAR_ensemblex_expected_doublet_rate NULL Expected doublet rate for the pool. If using 10X Genomics, the expected doublet rate can be estimated based on the number of recovered cells. For more information see 10X Genomics Documentation . Set up parameters PAR_ensemblex_merge_constituents Yes Whether or not to merge the output files of the constituent demultiplexing tools. If running Ensemblex on a pool for the first time, this parameter should be set to \"Yes\". Subsequent runs of Ensemblex (e.g., parameter optimization) can have this parameter set to \"No\" as the pipeline will automatically detect the previously generated merged file. Step 1 parameters: Probabilistic-weighted ensemble PAR_ensemblex_probabilistic_weighted_ensemble Yes Whether or not to perform Step 1: Probabilistic-weighted ensemble. If running Ensemblex on a pool for the first time, this parameter should be set to \"Yes\". Subsequent runs of Ensemblex (e.g., parameter optimization) can have this parameter set to \"No\" as the pipeline will automatically detect the previously generated Step 1 output file. Step 2 parameters: Graph-based doublet detection PAR_ensemblex_preliminary_parameter_sweep No Whether or not to perform a preliminary parameter sweep for Step 2: Graph-based doublet detection. Users should utilize the preliminary parameter sweep if they wish to manually define the number of confident doublets in the pool (nCD) and the percentile threshold of the nearest neighour frequency (pT), which can be defined in the following two parameters, respectively. PAR_ensemblex_nCD NULL Manually defined number of confident doublets in the pool (nCD). Value can be informed by the output files generated by setting PAR_ensemblex_preliminary_parameter_sweep to \"Yes\". PAR_ensemblex_pT NULL Manually defined percentile threshold of the nearest neighour frequency (pT). Value can be informed by the output files generated by setting PAR_ensemblex_preliminary_parameter_sweep to \"Yes\". PAR_ensemblex_graph_based_doublet_detection Yes Whether or not to perform Step 2: Graph-based doublet detection. If PAR_ensemblex_nCD and PAR_ensemblex_pT are not defined by the user (NULL), Ensemblex will automatically determine the optimal parameter values using an unsupervised parameter sweep. If PAR_ensemblex_nCD and PAR_ensemblex_pT are defined by the user, graph-based doublet detection will be performed with the user-defined values. Step 3 parameters: Ensemble-independent doublet detection PAR_ensemblex_preliminary_ensemble_independent_doublet No Whether or not to perform a preliminary parameter sweep for Step 3: Ensemble-independent doublet detection. Users should utilize the preliminary parameter sweep if they wish to manually define which constituent tools to utilize for ensemble-independent doublet detection. Users can define which tools to utilize for ensemble-independent doublet detection in the following parameters. PAR_ensemblex_ensemble_independent_doublet Yes Whether or not to perform Step 3: Ensemble-independent doublet detection. PAR_ensemblex_doublet_Demuxalot_threshold Yes Whether or not to label doublets identified by Demuxalot as doublets. Only doublets with assignment probabilities exceeding Demuxalot's recommended probability threshold will be labeled as doublets by Ensemblex. PAR_ensemblex_doublet_Demuxalot_no_threshold No Whether or not to label doublets identified by Demuxalot as doublets, regardless of the corresponding assignment probability. PAR_ensemblex_doublet_Demuxlet_threshold No Whether or not to label doublets identified by Demuxlet as doublets. Only doublets with assignment probabilities exceeding Demuxlet's recommended probability threshold will be labeled as doublets by Ensemblex. PAR_ensemblex_doublet_Demuxlet_no_threshold No Whether or not to label doublets identified by Demuxlet as doublets, regardless of the corresponding assignment probability. PAR_ensemblex_doublet_Souporcell_threshold No Whether or not to label doublets identified by Souporcell as doublets. Only doublets with assignment probabilities exceeding Souporcell's recommended probability threshold will be labeled as doublets by Ensemblex. PAR_ensemblex_doublet_Souporcell_no_threshold No Whether or not to label doublets identified by Souporcell as doublets, regardless of the corresponding assignment probability. PAR_ensemblex_doublet_Vireo_threshold Yes Whether or not to label doublets identified by Vireo as doublets. Only doublets with assignment probabilities exceeding Vireo's recommended probability threshold will be labeled as doublets by Ensemblex. PAR_ensemblex_doublet_Vireo_no_threshold No Whether or not to label doublets identified by Vireo as doublets, regardless of the corresponding assignment probability. Confidence score parameters PAR_ensemblex_compute_singlet_confidence Yes Whether or not to compute Ensemblex's singlet confidence score. This will define low confidence assignments which should be removed from downstream analyses.","title":"Ensemblex"}]} \ No newline at end of file diff --git a/site/search/worker.js b/site/search/worker.js new file mode 100644 index 0000000..8628dbc --- /dev/null +++ b/site/search/worker.js @@ -0,0 +1,133 @@ +var base_path = 'function' === typeof importScripts ? '.' : '/search/'; +var allowSearch = false; +var index; +var documents = {}; +var lang = ['en']; +var data; + +function getScript(script, callback) { + console.log('Loading script: ' + script); + $.getScript(base_path + script).done(function () { + callback(); + }).fail(function (jqxhr, settings, exception) { + console.log('Error: ' + exception); + }); +} + +function getScriptsInOrder(scripts, callback) { + if (scripts.length === 0) { + callback(); + return; + } + getScript(scripts[0], function() { + getScriptsInOrder(scripts.slice(1), callback); + }); +} + +function loadScripts(urls, callback) { + if( 'function' === typeof importScripts ) { + importScripts.apply(null, urls); + callback(); + } else { + getScriptsInOrder(urls, callback); + } +} + +function onJSONLoaded () { + data = JSON.parse(this.responseText); + var scriptsToLoad = ['lunr.js']; + if (data.config && data.config.lang && data.config.lang.length) { + lang = data.config.lang; + } + if (lang.length > 1 || lang[0] !== "en") { + scriptsToLoad.push('lunr.stemmer.support.js'); + if (lang.length > 1) { + scriptsToLoad.push('lunr.multi.js'); + } + if (lang.includes("ja") || lang.includes("jp")) { + scriptsToLoad.push('tinyseg.js'); + } + for (var i=0; i < lang.length; i++) { + if (lang[i] != 'en') { + scriptsToLoad.push(['lunr', lang[i], 'js'].join('.')); + } + } + } + loadScripts(scriptsToLoad, onScriptsLoaded); +} + +function onScriptsLoaded () { + console.log('All search scripts loaded, building Lunr index...'); + if (data.config && data.config.separator && data.config.separator.length) { + lunr.tokenizer.separator = new RegExp(data.config.separator); + } + + if (data.index) { + index = lunr.Index.load(data.index); + data.docs.forEach(function (doc) { + documents[doc.location] = doc; + }); + console.log('Lunr pre-built index loaded, search ready'); + } else { + index = lunr(function () { + if (lang.length === 1 && lang[0] !== "en" && lunr[lang[0]]) { + this.use(lunr[lang[0]]); + } else if (lang.length > 1) { + this.use(lunr.multiLanguage.apply(null, lang)); // spread operator not supported in all browsers: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Operators/Spread_operator#Browser_compatibility + } + this.field('title'); + this.field('text'); + this.ref('location'); + + for (var i=0; i < data.docs.length; i++) { + var doc = data.docs[i]; + this.add(doc); + documents[doc.location] = doc; + } + }); + console.log('Lunr index built, search ready'); + } + allowSearch = true; + postMessage({config: data.config}); + postMessage({allowSearch: allowSearch}); +} + +function init () { + var oReq = new XMLHttpRequest(); + oReq.addEventListener("load", onJSONLoaded); + var index_path = base_path + '/search_index.json'; + if( 'function' === typeof importScripts ){ + index_path = 'search_index.json'; + } + oReq.open("GET", index_path); + oReq.send(); +} + +function search (query) { + if (!allowSearch) { + console.error('Assets for search still loading'); + return; + } + + var resultDocuments = []; + var results = index.search(query); + for (var i=0; i < results.length; i++){ + var result = results[i]; + doc = documents[result.ref]; + doc.summary = doc.text.substring(0, 200); + resultDocuments.push(doc); + } + return resultDocuments; +} + +if( 'function' === typeof importScripts ) { + onmessage = function (e) { + if (e.data.init) { + init(); + } else if (e.data.query) { + postMessage({ results: search(e.data.query) }); + } else { + console.error("Worker - Unrecognized message: " + e); + } + }; +} diff --git a/site/sitemap.xml b/site/sitemap.xml new file mode 100644 index 0000000..32db3fe --- /dev/null +++ b/site/sitemap.xml @@ -0,0 +1,83 @@ + +