-
Notifications
You must be signed in to change notification settings - Fork 2
Running SPEAQeasy
For LIBD users, I installed a customized instance of the SPEAQeasy pipeline in this directory:
/dcs04/lieber/lcolladotor/dbDev_LIBD001/SPEAQeasy
This was pre-configured to use Gencode 25 with hg38, main chromosomes only. The run_g25m.sh
template script found in there should be copied in the working directory where the samples.manifest
file resides. Detailed instructions below.
These instructions assume you already have a convenient way to log on a JHPCE login node.
This configuration should only be performed once, before the first run of this version of SPEAQeasy.
Skip to the next section below, about Running SPEAQeasy on a dataset, if you already performed these one-time setup steps.
In order to begin using this version of the pipeline, one has to first run the setup_jhpce.sh
script there, in an interactive shell session.
Here are the steps:
- get a interactive session on a JHPCE compute node. Unless you already have a shortcut for this, like the one described here, simply type this command on a JHPCE login node
srun --pty --mem=8G --cpus-per-task=2 -p interactive bash
- after being allocated an interactive session, execute this command:
/dcs04/lieber/lcolladotor/dbDev_LIBD001/SPEAQeasy/jhpce_setup.sh
After that, exit the interactive session (exit
command) and follow the steps below.
The default mode of operation for the run_g25m.sh
template script is to use the current working directory as both the input directory (i.e. a samples.manifest
file must be placed there, pointing to the actual location of the FASTQ files) and the output base directory with the pipeline results will be in aresults
sub-directory that will be created in there with the final output of the pipeline. The wrk
sub-directory is also created and it is only used for temporary/intermediate data files, it can be deleted after the pipeline finished running successfully.
On JHPCE such a working directory should be in a partition where you have plenty of storage available, and is should not be under the home directory.
Here are the steps which can be performed on a JHPCE login or transfer nodes :
- create/choose a working directory (which will have the
results
subdirectory) and copy the template script in there:
cd /dcs0x/mywork_area/on_jhpce/ #go to a suitable work area
mkdir spqz_wrk1 # create the new working directory with a suitable name
cd spqz_wrk1
cp /dcs04/lieber/lcolladotor/dbDev_LIBD001/SPEAQeasy/run_g25m.sh .
- prepare the
samples.manifest
file in this new working directory. Thesamples.manifest
file should list all the input fastq files in a specific format, using their absolute paths. For paired reads I have a little perl script that creates such a file for all fastq.gz files found in multiple directories under a directory tree. This script can be found here:
/dcs04/lieber/lcolladotor/dbDev_LIBD001/SPEAQeasy/ls2manifest.pl
For your convenience on repeated use, simply copy that script somewhere in your $PATH
directories, otherwise in the examples below you might have
The way to use this script with a bunch of fastq.gz
files that may be into multiple sub-directories:
- cd into the parent/base directory containing the folders with fastq.gz files to be processed
- check that
ls */*.fastq.gz
(or similar) shows the samples as expected - run the same ls command and pipe the input into the script
wrkdir=`pwd -P` # save the current working directory
cd /base/path/to/FASTQ-containing-directories # go to your FASTQ base path
ls `pwd -P`/*/*.fastq.gz | ls2manifest.pl > $wrkdir/samples.manifest
cd $wrkdir
Note that the script expects a certain file naming scheme so it might fail if that naming scheme is not respected.
On a JHPCE login node, in the working directory created above, edit the 'run_g25m.sh' file you just copied above (you can use nano
for this in your terminal) to modify the values for variables STRAND
, STYPE
, and maybe REF
at the top of the file. Optionally, you can also change the job name in the line #SBATCH --job-name=
at the top of the script.
Then you can submit the script in order to run the pipeline (assuming you're on a JHPCE login node) based on the samples.manifest
file on that directory:
sbatch run_g25m.sh
If the job fails at some point (due to a grid failure, perhaps?) you can try resuming the pipeline by resubmitting the script with the same sbatch command as shown above (the RESUME option is set to 1 by default in the template script)
A working local installation is in /opt/sw/spqz
uses the various software dependencies installed in /opt/sw/bin/
.
- pick a working directory and make a copy of the main script:
cp /opt/spqz/run.sh .
- the output directory (
results
) is going to be created in this working directory - place the
samples.manifest
file in this working directory - edit the local copy of
run.sh
and verify/update the values of these variables as needed:STRAND
,STYPE
,RESUME
,REF
(andBG
if needed) - start the pipeline with
./run.sh &
Also note that nextflow will run as many samples in parallel as it can given the available hardware resources, which for local runs means there is only a lower limit of 8 CPUs but otherwise the pipeline will process many samples in parallel by default, so it might use all the cores of the Linux machine/server when run locally this way, as the local scheduler built into nextflow is very basic and cannot really limit the number of CPU cores to use for such local runs.
This procedure is documented here: http://research.libd.org/SPEAQeasy-example/swap_speaqeasy.html Adding phenotype data is at the bottom of that page. Basically it's about updating colData for the RSE object (after properly swapping the SAMPLE_IDs if needed).