The default configuration for the IGVF single-cell pipeline operates in a Docker-based Terra cloud environment. However, Docker is not always available on high-performance computing (HPC) environments. To accommodate this, Singularity can be used as an alternative. This documentation outlines the setup for running the pipeline on a single host using Singularity. As of now, the Cromwell configuration for specific queuing systems has not been finalized, and individual experiments are being executed on a standalone machine.
root_dir | /home/user/proj |
single_cell_pipeline_git | https://github.com/IGVF/single-cell-pipeline |
diane_conda_url | https://woldlab.caltech.edu/~diane/conda/ |
conda_command | /home/user/bin/micromamba |
environment | the IGVF single-cell pipeline |
These variables can be accessed using ${defaults[“variable_name”]} in the shell scripts provided.
It is crucial to use a recent version of Cromwell, as older versions available on Conda may not be compatible with the pipeline’s requirements. Version 40, for example, is not supported.
The following block installs most of the necessary dependencies for the IGVF single-cell pipeline into a Conda environment. The usage of Micromamba is recommended for faster installation, though it is more limited than Conda or Mamba.
Make sure to replace ${defaults[“conda_command”]} and ${defaults[“environment”]} with your own settings.
${defaults['conda_command']} create -q \
-n ${defaults['environment']} \
-c conda-forge -c bioconda \
'cromwell>=86' \
singularity \
gsutil \
'synapseclient>=4' >/dev/null &
Once that process finishes, you can use the following run commands to make sure the main components were installed.
${defaults['conda_command']} run -n ${defaults['environment']} womtool --version
${defaults['conda_command']} run -n ${defaults['environment']} cromwell --version
${defaults['conda_command']} run -n ${defaults['environment']} singularity --version
${defaults['conda_command']} run -n ${defaults['environment']} gsutil --version
${defaults['conda_command']} run -n ${defaults['environment']} synapse --version
Checkout the IGVF Single-Cell Pipeline Repository. As of Sep 2024, this is for release v1.
pushd ${defaults['root_dir']}
git clone ${defaults['atomic_workflows_git']} -b updated-qcs
popd
To use Singularity instead of Docker with Cromwell, the following configuration is required. The example below is for running the pipeline on a single machine. When launching the pipeline, pass the -Dconfig.file=cromwell.conf
argument to Cromwell to use this configuration. Additional adjustments are needed for integration with local queuing systems.
adjustments to integrate with the local queing system.
This configuration is based on the Containers documentation. Note that Synapse requires access to a configuration file and a cache directory to function properly. Ensure that ${HOME} paths in the configuration are replaced with absolute paths to avoid issues with Singularity.
include required(classpath("application"))
backend {
default: singularity
providers: {
singularity {
# The backend custom configuration.
actor-factory = "cromwell.backend.impl.sfs.config.ConfigBackendLifecycleActorFactory"
config {
run-in-background = true
runtime-attributes = """
String? docker
"""
submit-docker = """
singularity exec --containall --bind ${HOME}/.synapseConfig:${HOME}/.synapseConfig --bind ${HOME}/.synapseCache:${HOME}/.synapseCache --bind ${cwd}:${docker_cwd} docker://${docker} ${job_shell} ${docker_script}
"""
}
}
}
}
When running outside of Terra, the metadata .json file for Cromwell must be generated manually. Below is an example .json configuration for a customized 10x multiome experiment.
{
"single_cell_pipeline.prefix": "Team_6-GM12878-10XMultiome",
"single_cell_pipeline.chemistry": "10x_multiome",
"single_cell_pipeline.genome_fasta": "https://api.data.igvf.org/reference-files/IGVFFI0653VCGH/@@download/IGVFFI0653VCGH.fasta.gz",
"single_cell_pipeline.genome_tsv": "IGVF_human_v43_Homo_sapiens_genome_files_hg38_v43.tsv",
"single_cell_pipeline.gtf": "https://api.data.igvf.org/reference-files/IGVFFI7217ZMJZ/@@download/IGVFFI7217ZMJZ.gtf.gz",
"single_cell_pipeline.atac_barcode_offset": 0,
"single_cell_pipeline.fastq_barcode": [
"syn61457432",
"syn61457437",
"syn61457449",
"syn61457459"
],
"single_cell_pipeline.read1_atac": [
"syn61457431",
"syn61457436",
"syn61457448",
"syn61457458"
],
"single_cell_pipeline.read2_atac": [
"syn61457434",
"syn61457438",
"syn61457454",
"syn61457460"
],
"single_cell_pipeline.read1_rna": [
"syn61457461",
"syn61457463",
"syn61457465",
"syn61457469"
],
"single_cell_pipeline.read2_rna": [
"syn61457462",
"syn61457464",
"syn61457468",
"syn61457476"
],
"single_cell_pipeline.seqspecs": [
"https://raw.githubusercontent.com/detrout/y2ave_seqspecs/main/Team_6_GM12878_10XMultiome-L001_seqspec.yaml"
],
"single_cell_pipeline.whitelist_atac": [
"737K-arc-v1_ATAC.txt.gz"
],
"single_cell_pipeline.whitelist_rna": [
"737K-arc-v1_GEX.txt.gz"
],
"single_cell_pipeline.whitelists_tsv": "gs://broad-buenrostro-pipeline-genome-annotations/whitelists/whitelists.tsv",
"single_cell_pipeline.check_read1_rna.disk_factor": 1,
"single_cell_pipeline.check_read2_rna.disk_factor": 1,
"single_cell_pipeline.check_read1_atac.disk_factor": 1,
"single_cell_pipeline.check_read2_atac.disk_factor": 1,
"single_cell_pipeline.check_fastq_barcode.disk_factor": 1
}
To execute the pipeline, ensure the script and configuration files are properly referenced, including the Cromwell configuration file (cromwell.conf), the WDL script, and the generated JSON file.
PATH=/woldlab/loxcyc/home/user/proj/single-cell-pipeline/src/bash/:$PATH
\ cromwell -Dconfig.file=cromwell.conf run \
../single-cell-pipeline/single_cell_pipeline.wdl \ -i tiny-13a.json
Cromwell stores results in a nested directory structure under cromwell-executions/. To locate error or output logs, navigate to the appropriate subdirectory based on the workflow step. The example below shows a partial directory tree from a failed run:
- cromwell-executions
- single_cell_pipeline
- ${random_uuid}
- call-atac
- call-barcode_mapping
- call-check_fastq_barcode
- call-check_read1_atac
- call-check_read1_rna
- shard-0
- execution
- check_inputs_monitor.log
- files
- glob-aae8b15f635ae9fc31e845b03c8537e4
- glob-aae8b15f635ae9fc31e845b03c8537e4.list
- rc
- script
- script.background
- script.submit
- stderr
- stderr.background
- stdout
- stdout.background
- tmp.${suffix}
- execution
- shard-1
- shard-2
- shard-3
- shard-0
- call-check_read2_atac
- call-check_read2_rna
- call-check_seqspec
- ${random_uuid}
- single_cell_pipeline
To find specific files, use the following command:
find cromwell-executions -name ${filename}
This structure can take some trial and error to configure correctly for your environment. For easier results management, consider using tools like Caper. However, it is also possible to manage this manually as demonstrated above.
We extend our gratitude to @detrout for their contribution in providing the initial draft of this document.