-
Create a run directory, change current directory to it, and retrieve files from GitHub to it using the command:
git clone https://github.com/compbiomed/RNA_Seq ./
-
Generate a tab-delimited Nextflow input file following the format described below under
params.infile
. -
Edit the
RNA-seq_template.config
file:- Set
params.infile
to the full path to the tab-delimited file describing the FASTQ input files. - Set
params.output_dir
to the full path to the Nextflow run directory. - Set
params.prefix
to a meaningful name for the project. This string will be used as a prefix to label many output files. - Set the fields of
params.genome
:species
: The scientific name of the species being used (e.g.,"Homo sapiens"
,"Mus musculus"
)ucsc
: The UCSC build corresponding to the FASTA reference that will be used (e.g.,"hg38"
,"mm10"
), as generated, for example, by make_ucsc_references.qsub If an Ensembl genome reference is to be used instead, either remove this field or leave it set to""
.assembly
: The corresponding genome assembly (e.g.,"GRCh38"
,"GRCm38"
)set
: The subset of sequences that will be used:"base"
(autosomes, sex chromosomes, and mitochondrial chromosome),"base_random"
(base sequences plus random/unplaced contigs), or"base_random_althap"
(base and random sequences plus alternative haplotype sequences); see make_ucsc_references.qsub for more detailsensembl
: The Ensembl build number that will be used (e.g.,100
)
- Change
params.read_length
,params.paired_end
, andparams.stranded
if needed (rare).
- Set
-
Rename the
RNA-seq_template.config
file to something more meaningful (e.g.,[params.prefix].config
) -
Start the Nextflow run using the qsub file as follows:
qsub -P [SGE project name] submit_RNA_Seq.qsub [config filename]
This will run the Nextflow script RNA_Seq.nf
using the input file and config files from steps 2-4.
This Nextflow pipeline contains the following processes:
generateGTF
:- Retrieve a GTF file from Ensembl for the species, genome assembly, and Ensembl build number specified in the .config file
- Use the UCSC
chromAlias
table for the corresponding UCSC genome build to convert the sequence names (i.e., chromosomes) from Ensembl to UCSC nomenclature - Limit the result to the subset of sequences specified in
params.genome.set
(e.g.,"base_random"
)
generateBED
: Use the UCSC command-line utilitiesgtfToGenePred
andgenePredToBed
to convert the GTF file to a BED file (for use with RSeQC below)runRSEMprepareReference
: Prepare a set of RSEM reference files using the FASTA reference specified in the .config file and the Ensembl
runFastQC
: Use FastQC to perform QC on each pair of FASTQ filesrunMultiQCFastq
: Compile output fromrunFastQC
into TSV tables and HTML report using MultiQC
runSTAR1pass
: Perform a first-pass alignment to a specified genome using STARrunSTARgenomeGenerate
: Create a new genome reference from splice junctions inferred from first-pass STAR alignmentrunSTAR2pass
: Perform a second-pass alignment to the genome reference produced byrunSTARgenomeGenerate
BAM QC, performed using RSeQC
runRSeQCbamStat
runRSeQCclippingProfile
runRSeQCdeletionProfile
runRSeQCgeneBodyCoverage
Note: this is a very slow steprunRSeQCinferExperiment
runRSeQCinnerDistance
runRSeQCinsertionProfile
runRSeQCjunctionAnnotation
runRSeQCjunctionSaturation
runRSeQCreadDistribution
runRSeQCreadDuplication
runRSeQCreadGC
runRSeQCreadNVC
runRSeQCreadQuality
runRSeQCtin
Note: this is a very slow step
runMultiQCSample
: Compile output from RSeQC and STAR using MultiQC
runRSEMcalculateExpression
: Use RSEM to estimate gene- and transcript (isoform)-level expressioncreateSE
: Create SummarizedExperiment R objects, one for gene-level expression and one for transcript-level expression; each object contains several estimates of expression from RSEM, as well as QC parameters and feature-level annotation
This configuration file is intended to be used only with this Nextflow script. It makes a number of assumptions about underlying directory structures and filenames. The parameters that are typically changed are:
Full path to a TSV file containing the following columns (those in bold are not optional and cannot be left blank):
INDIVIDUAL_ID
: An ID for an individual from which one or more samples was obtained (if only one sample was sequenced from each individual, this can be the same asSAMPLE_ID
, or left blank)SAMPLE_ID
: An ID for each sampleLIBRARY_ID
: An ID for each library prepared from a sample (if only one library was sequenced from each sample, this can be the same asSAMPLE_ID
, or left blank)RG_ID
: Read Group ID: the flowcell ID, optionally followed by a lane-specific suffix (for instruments with independent lanes), followed by a sample-specific identifier (e.g.,SAMPLE_ID
)PLATFORM_UNIT
:RG ID
, optionally followed by a suffix specific to a library (if more than one library was sequenced per sample)PLATFORM
: Sequencing platform, e.g., "illumina" for Illumina instrumentsPLATFORM_MODEL
: Instrument model, e.g., "NextSeq", "HiSeq", etc.)RUN_DATE
: Optional run dateCENTER
: Optional name for center at which sequencing was performedR1
: Full path to FASTQ file containing first paired-end readR2
: Full path to FASTQ file containing second paired-end read
Note: some of these fields are discussed in more detail within the GATK read groups documentation.
Full path where the Nextflow output should be written
Prefix for Nextflow output files
Indicates whether paired-end sequencing was used (true
or false
)
Parameters specific to the genome and annotation build used; these can be uncommented and edited as needed
This script kicks off the Nextflow process on the SGE using the .config file specified in its sole argument.