Skip to content

Running SPEAQeasy

Geo Pertea edited this page Nov 19, 2024 · 21 revisions

Running SPEAQeasy

On JHPCE

For LIBD users, I installed a customized instance of the SPEAQeasy pipeline in this directory:

/dcs04/lieber/lcolladotor/dbDev_LIBD001/SPEAQeasy

This was pre-configured to use Gencode 25 with hg38, main chromosomes only. The run_g25m.sh template script found in there should be copied in the working directory where the samples.manifest file resides. Detailed instructions below.

These instructions assume you already have a convenient way to log on a JHPCE login node.

One time setup before the first run

This configuration should only be performed once, before the first run of this version of SPEAQeasy.

Skip to the next section below, about Running SPEAQeasy on a dataset, if you already performed these one-time setup steps.

In order to begin using this version of the pipeline, one has to first run the setup_jhpce.sh script there, in an interactive shell session.

Here are the steps:

  1. get a interactive session on a JHPCE compute node. Unless you already have a shortcut for this, like the one described here, simply type this command on a JHPCE login node

srun --pty --mem=8G --cpus-per-task=2 -p interactive bash

  1. after being allocated an interactive session, execute this command: /dcs04/lieber/lcolladotor/dbDev_LIBD001/SPEAQeasy/jhpce_setup.sh

After that, exit the interactive session (exit command) and follow the steps below.

Running SPEAQeasy on JHPCE on your data

1. Preparing input data and a working directory

The default mode of operation for the run_g25m.sh template script is to use the current working directory as both the input directory (i.e. a samples.manifest file must be placed there, pointing to the actual location of the FASTQ files) and the output base directory with the pipeline results will be in aresults sub-directory that will be created in there with the final output of the pipeline. The wrk sub-directory is also created and it is only used for temporary/intermediate data files, it can be deleted after the pipeline finished running successfully.

On JHPCE such a working directory should be in a partition where you have plenty of storage available, and is should not be under the home directory.

Here are the steps which can be performed on a JHPCE login or transfer nodes :

  • create/choose a working directory (which will have the results subdirectory) and copy the template script in there:
cd /dcs0x/mywork_area/on_jhpce/ #go to a suitable work area
mkdir spqz_wrk1 # create the new working directory with a suitable name
cd spqz_wrk1
cp /dcs04/lieber/lcolladotor/dbDev_LIBD001/SPEAQeasy/run_g25m.sh .
  • prepare the samples.manifest file in this new working directory. The samples.manifest file should list all the input fastq files in a specific format, using their absolute paths. For paired reads I have a little perl script that creates such a file for all fastq.gz files found in multiple directories under a directory tree. This script can be found here:
/dcs04/lieber/lcolladotor/dbDev_LIBD001/SPEAQeasy/ls2manifest.pl

For your convenience on repeated use, simply copy that script somewhere in your $PATH directories, otherwise in the examples below you might have

The way to use this script with a bunch of fastq.gz files that may be into multiple sub-directories:

  • cd into the parent/base directory containing the folders with fastq.gz files to be processed
  • check that ls */*.fastq.gz (or similar) shows the samples as expected
  • run the same ls command and pipe the input into the script
wrkdir=`pwd -P` # save the current working directory
cd /base/path/to/FASTQ-containing-directories # go to your FASTQ base path
ls `pwd -P`/*/*.fastq.gz | ls2manifest.pl > $wrkdir/samples.manifest
cd $wrkdir

Note that the script expects a certain file naming scheme so it might fail if that naming scheme is not respected.

2. Run SPEAQeasy in the working directory

On a JHPCE login node, in the working directory created above, edit the 'run_g25m.sh' file you just copied above (you can use nano for this in your terminal) to modify the values for variables STRAND, STYPE, and maybe REF at the top of the file. Optionally, you can also change the job name in the line #SBATCH --job-name= at the top of the script.

Then you can submit the script in order to run the pipeline (assuming you're on a JHPCE login node) based on the samples.manifest file on that directory:

sbatch run_g25m.sh

If the job fails at some point (due to a grid failure, perhaps?) you can try resuming the pipeline by resubmitting the script with the same sbatch command as shown above (the RESUME option is set to 1 by default in the template script)

Running on srv07 (in progress, being updated)

A working local installation is in /opt/sw/spqz uses the various software dependencies installed in /opt/sw/bin/.

  • pick a working directory and make a copy of the main script: cp /opt/spqz/run.sh .
  • the output directory (results) is going to be created in this working directory
  • place the samples.manifest file in this working directory
  • edit the local copy of run.sh and verify/update the values of these variables as needed: STRAND, STYPE, RESUME, REF (and BG if needed)
  • start the pipeline with ./run.sh &

Also note that nextflow will run as many samples in parallel as it can given the available hardware resources, which for local runs means there is only a lower limit of 8 CPUs but otherwise the pipeline will process many samples in parallel by default, so it might use all the cores of the Linux machine/server when run locally this way, as the local scheduler built into nextflow is very basic and cannot really limit the number of CPU cores to use for such local runs.

Brain swap check and adding phenotype data

This procedure is documented here: http://research.libd.org/SPEAQeasy-example/swap_speaqeasy.html Adding phenotype data is at the bottom of that page. Basically it's about updating colData for the RSE object (after properly swapping the SAMPLE_IDs if needed).