An RNA-Seq pipeline for eukaryotic systems written in Python
C. Breuer, 2024
This pipeline is designed for the analysis of RNA-Seq data from eukaryotic systems. The offers quality assessment with FastQC, and calls pipeline RNA STAR, an efficient splicing-aware aligner via the linux (or WSL) command line. Additional scripts are provided for running differential expression analysis with DESeq2 in R, and for annotating those results with gene names using a GTF file.
- Linux or WSL. It is expected that this pipeline will be executed in a linux environment.
- Python3 and Java (for FastQC)
- Install with: "apt-install python3"
- The RNA-STAR aligner should be installed. This can be done using the following commands in Ubuntu:
" apt-get update"
" apt-install RNA-STAR"
For more information, see the STAR manual. - An RNA-STAR reference genome (see the STAR manual for commands).
This requires you to have a fasta formatted genome file and (optionally) an annotations file (e.g. a .gtf file). It is highly beneficial to include the GTF annotation file as this will allow STAR to tally read counts while doing the alignmenment. Place the output STAR genome files in a folder named "genomefiles". - For rDNA quality control mapping, an rRNA (or tRNA, etc.) fasta sequence and Bowtie2 are required. Bowtie2 can be installed with:
"apt-get update"
"apt-install bowtie2"
Follow the instructions for installing a bowtie genome. e.g. "bowtie2-build path/to/genome.fa yourGenomeName". Move the output genome files to the "genomefiles" folder.
- Create a working folder for your analysis
- Within the working folder, you should have your fastq files in a folder called "fastq". Fastq file name should be formatted like "sample1_R1.fastq" and "sample1_R2.fastq" for a typical paired-end sample. Files can also be gzipped and named like "Sample1_R1.fq.gz" or "Sample1_R1.fastq.gz". If additional formats are required, please edit the rnaSeq.py script, or change the filenames with the provided change_filenames.py script (no arguments needed).
- Ensure that STAR can be called in the linux/WSL command line using command "RNA-STAR".
- All STAR and Bowtie2 genome files should be placed in a subdirectory called "genomefiles". For example, for STAR genome named "dm6", the files should be under yourworkingdirectory/genomefiles/dm6. For bowtie2 genome "bs168", the files should be under yourworkingdirectory/genomefiles. Note that these filepaths can be changed if you edit the rnaSeq.py script customization section.
- Fasta formatted genome files and genome annotation files (e.g. .gtf) should be in ./genomefiles.
- Edit the User Customization section of the rnaSeq.py script in any text editor. Specifically, change line 265 as needed to indicate the name of the genome you're mapping to. i.e. if you named the STAR genome "dm6" for Drosophila melanogaster build number 6, then that line should read g='dm6'.
- Fastq files can be trimmed on adapters if desired. To do this, start a WSL terminal in the working directory and enter python ./scripts/runTrimmomatic.py. Follow the prompts to trim universal TruSeq adaptors from your reads. The script can be edited for other primer sequences.
- For quickly test for rRNA (or other) undesirable contaminants, the a small sample of each fastq file can be created. To do this, start a WSL terminal in the working directory and enter python./scripts/shrinkFastq.py. Follow the on-screen prompts to select the desired numbmer of reads in the small fastq outfiles (e.g. 250000 is often enough). See QC steps below for instructions on running the rRNA mapping.
- The pipeline can be run by executing the rnaSeq.py script with the command: python rnaSeq.py.
- The script will provide run options as screen prompts.
- You may want to first run the script with only the FastQC option. At that point, you can examine your sequence quality and, if needed, trim the ends with Trimmomatic (wrapper and executable are provided in the scripts folder).
- Follow the prompts to activate FastQC, alignment by STAR, and to use FeatureCounts.
- Notes:
- Default STAR settings will result in STAR producing read counts as part of the alignment process. This will only occur if you input a gene annotation file when building your STAR genome.
- STAR will output aligned files in .bam format.
- To test for contaminating rRNA/tRNA/etc. follow the instructions above (File Prep) to sample the fastq files (or you can map the whole file, it will just take longer). Open a WSL terminal in the working directory and enter python ./scripts/map2rDNA.py to do the alignment. Alignment statistics will be output to a log file in the working directory. Note that you will have to have previously set up genome files for this mapping to work.
- Follow the instructions in the DESeq manual to set up readcounts input files and metadata. The data (readcounts) file looks like this:
gene | Sample1 | Sample2 | Sample3 | Sample4 | Sample5 | Sample6 |
---|---|---|---|---|---|---|
gene1 | 2257 | 2440 | 2537 | 611 | 501 | 413 |
gene2 | 3 | 0 | 1 | 0 | 1 | 7 |
gene3 | 1217 | 1394 | 1239 | 801 | 413 | 578 |
gene4 | 0 | 0 | 0 | 0 | 0 | 0 |
gene5 | 20 | 19 | 7 | 12 | 15 | 4 |
Here the samples 1-3 are controls and samples 3-6 are test. As such the metadata file looks like this:
id | genotype |
---|---|
Sample1 | WT |
Sample2 | WT |
Sample3 | WT |
Sample4 | Mutant |
Sample5 | Mutant |
Sample6 | Mutant |
- To analyze these the data file, load both the data and metadata files into R using the ./scripts/DESeq2.R script in RStudio.
- This will output a volcano plot with significantly changed genes highlighted in red, and a 2D principal component analysis plot to visually display sample variability.
- To label the DESEq outfiles with gene level information, open the annotateRC.R script in R studio. Load your GTF file and follow the steps to merge the two tables.
- Though not covered here, differential gene expression profiles can be loaded into pathway analysis software (e.g. Ingenuity Pathway Analysis) to get a sense of the broader significance of your gene signatures.