Skip to content

Latest commit

 

History

History
207 lines (174 loc) · 8.78 KB

install-pipeline-locally.org

File metadata and controls

207 lines (174 loc) · 8.78 KB

IGVF Single-Cell Pipeline Setup and Execution Documentation

The default configuration for the IGVF single-cell pipeline operates in a Docker-based Terra cloud environment. However, Docker is not always available on high-performance computing (HPC) environments. To accommodate this, Singularity can be used as an alternative. This documentation outlines the setup for running the pipeline on a single host using Singularity. As of now, the Cromwell configuration for specific queuing systems has not been finalized, and individual experiments are being executed on a standalone machine.

Define some variables

root_dir/home/user/proj
single_cell_pipeline_githttps://github.com/IGVF/single-cell-pipeline
diane_conda_urlhttps://woldlab.caltech.edu/~diane/conda/
conda_command/home/user/bin/micromamba
environmentthe IGVF single-cell pipeline

These variables can be accessed using ${defaults[“variable_name”]} in the shell scripts provided.

Set up the conda environment

It is crucial to use a recent version of Cromwell, as older versions available on Conda may not be compatible with the pipeline’s requirements. Version 40, for example, is not supported.

The following block installs most of the necessary dependencies for the IGVF single-cell pipeline into a Conda environment. The usage of Micromamba is recommended for faster installation, though it is more limited than Conda or Mamba.

Make sure to replace ${defaults[“conda_command”]} and ${defaults[“environment”]} with your own settings.

${defaults['conda_command']} create -q \
           -n ${defaults['environment']} \
           -c conda-forge -c bioconda \
           'cromwell>=86' \
           singularity \
           gsutil \
           'synapseclient>=4' >/dev/null &

Once that process finishes, you can use the following run commands to make sure the main components were installed.

${defaults['conda_command']} run -n ${defaults['environment']} womtool --version
${defaults['conda_command']} run -n ${defaults['environment']} cromwell --version
${defaults['conda_command']} run -n ${defaults['environment']} singularity --version
${defaults['conda_command']} run -n ${defaults['environment']} gsutil --version
${defaults['conda_command']} run -n ${defaults['environment']} synapse --version

Checkout the IGVF Single-Cell Pipeline Repository. As of Sep 2024, this is for release v1.

pushd ${defaults['root_dir']}
git clone ${defaults['atomic_workflows_git']} -b updated-qcs
popd

Configuring cromwell

To use Singularity instead of Docker with Cromwell, the following configuration is required. The example below is for running the pipeline on a single machine. When launching the pipeline, pass the -Dconfig.file=cromwell.conf argument to Cromwell to use this configuration. Additional adjustments are needed for integration with local queuing systems. adjustments to integrate with the local queing system.

This configuration is based on the Containers documentation. Note that Synapse requires access to a configuration file and a cache directory to function properly. Ensure that ${HOME} paths in the configuration are replaced with absolute paths to avoid issues with Singularity.

include required(classpath("application"))

backend {
    default: singularity
    providers: {
        singularity {
            # The backend custom configuration.
            actor-factory = "cromwell.backend.impl.sfs.config.ConfigBackendLifecycleActorFactory"

            config {
                run-in-background = true
                runtime-attributes = """
                  String? docker
                """
                submit-docker = """
                  singularity exec --containall --bind ${HOME}/.synapseConfig:${HOME}/.synapseConfig --bind ${HOME}/.synapseCache:${HOME}/.synapseCache --bind ${cwd}:${docker_cwd} docker://${docker} ${job_shell} ${docker_script}
                """
            }
        }
    }
}

Running the IGVF Single-Cell Pipeline Locally

When running outside of Terra, the metadata .json file for Cromwell must be generated manually. Below is an example .json configuration for a customized 10x multiome experiment.

{
  "single_cell_pipeline.prefix": "Team_6-GM12878-10XMultiome",
  "single_cell_pipeline.chemistry": "10x_multiome",
  "single_cell_pipeline.genome_fasta": "https://api.data.igvf.org/reference-files/IGVFFI0653VCGH/@@download/IGVFFI0653VCGH.fasta.gz",
  "single_cell_pipeline.genome_tsv": "IGVF_human_v43_Homo_sapiens_genome_files_hg38_v43.tsv",
  "single_cell_pipeline.gtf": "https://api.data.igvf.org/reference-files/IGVFFI7217ZMJZ/@@download/IGVFFI7217ZMJZ.gtf.gz",
  "single_cell_pipeline.atac_barcode_offset": 0,
  "single_cell_pipeline.fastq_barcode": [
    "syn61457432",
    "syn61457437",
    "syn61457449",
    "syn61457459"
  ],
  "single_cell_pipeline.read1_atac": [
    "syn61457431",
    "syn61457436",
    "syn61457448",
    "syn61457458"
  ],
  "single_cell_pipeline.read2_atac": [
    "syn61457434",
    "syn61457438",
    "syn61457454",
    "syn61457460"
  ],
  "single_cell_pipeline.read1_rna": [
    "syn61457461",
    "syn61457463",
    "syn61457465",
    "syn61457469"
  ],
  "single_cell_pipeline.read2_rna": [
    "syn61457462",
    "syn61457464",
    "syn61457468",
    "syn61457476"
  ],
  "single_cell_pipeline.seqspecs": [
    "https://raw.githubusercontent.com/detrout/y2ave_seqspecs/main/Team_6_GM12878_10XMultiome-L001_seqspec.yaml"
  ],
  "single_cell_pipeline.whitelist_atac": [
    "737K-arc-v1_ATAC.txt.gz"
  ],
  "single_cell_pipeline.whitelist_rna": [
    "737K-arc-v1_GEX.txt.gz"
  ],
  "single_cell_pipeline.whitelists_tsv": "gs://broad-buenrostro-pipeline-genome-annotations/whitelists/whitelists.tsv",
  "single_cell_pipeline.check_read1_rna.disk_factor": 1,
  "single_cell_pipeline.check_read2_rna.disk_factor": 1,
  "single_cell_pipeline.check_read1_atac.disk_factor": 1,
  "single_cell_pipeline.check_read2_atac.disk_factor": 1,
  "single_cell_pipeline.check_fastq_barcode.disk_factor": 1
}

Running cromwell

To execute the pipeline, ensure the script and configuration files are properly referenced, including the Cromwell configuration file (cromwell.conf), the WDL script, and the generated JSON file.

PATH=/woldlab/loxcyc/home/user/proj/single-cell-pipeline/src/bash/:$PATH
  \ cromwell -Dconfig.file=cromwell.conf run \
  ../single-cell-pipeline/single_cell_pipeline.wdl \ -i tiny-13a.json

Finding results with cromwell.

Cromwell stores results in a nested directory structure under cromwell-executions/. To locate error or output logs, navigate to the appropriate subdirectory based on the workflow step. The example below shows a partial directory tree from a failed run:

  • cromwell-executions
    • single_cell_pipeline
      • ${random_uuid}
        • call-atac
        • call-barcode_mapping
        • call-check_fastq_barcode
        • call-check_read1_atac
        • call-check_read1_rna
          • shard-0
            • execution
              • check_inputs_monitor.log
              • files
              • glob-aae8b15f635ae9fc31e845b03c8537e4
              • glob-aae8b15f635ae9fc31e845b03c8537e4.list
              • rc
              • script
              • script.background
              • script.submit
              • stderr
              • stderr.background
              • stdout
              • stdout.background
            • tmp.${suffix}
          • shard-1
          • shard-2
          • shard-3
        • call-check_read2_atac
        • call-check_read2_rna
        • call-check_seqspec

To find specific files, use the following command: find cromwell-executions -name ${filename}

This structure can take some trial and error to configure correctly for your environment. For easier results management, consider using tools like Caper. However, it is also possible to manage this manually as demonstrated above.

We extend our gratitude to @detrout for their contribution in providing the initial draft of this document.