Skip to content

Latest commit

 

History

History
242 lines (163 loc) · 10.7 KB

KeplerMAKER.md

File metadata and controls

242 lines (163 loc) · 10.7 KB

DerDocs Home

Guide to MAKER Genome Annotation Pipeline on Kepler

Overview

MAKER is a software pipeline designed for annotating whole-genome assemblies but it may be useful for annotating shorter sequences as well. MAKER streamlines genome annotation by automatically carrying out processes such as sequence alignment, ab initio gene prediction, handling intermediate data files, and synthesizing final annotations from multiple lines of evidence. It is meant to be run iteratively (run the pipeline multiple consecutive times in the same directory), and can update and extend existing genome annotations, a process that works best if the existing genome annotation was created by MAKER. Protein-coding gene annotations are expected to be most complete if a combination of transcriptome, proteome, and HMM-based predictions are used. Repeat loci are annotated and some are classified by RepeatMasker if classifications are present in the repeat database used; Repbase is a good repeat database to include and this can be supplemented with a de novo set of consensus repeat sequences generated by RepeatModeler (separate from MAKER). tRNA and rRNA can be identified as part of a MAKER run by turning on the relevant options and adding the relevant paths to maker_exe.ctl.



Loading MAKER

MAKER is installed both locally (/home/joshd/software/) and globally (/share/apps/genomics/). The global installation is probably less reliable than the local one. If you don't already have one, you may need to create or copy a local module file for maker. Information about how to set up your module environment is in KeplerModules.md

To load MAKER, type

module load local/maker

A message should print with the MAKER version number and a short description. This information is hardcoded in the maker module file at



Setting up your MAKER space

It's best to create a new directory in which to run MAKER from. When ready to start the run, this directory will contain three control files, your PBS/torque/qsub/shell script to submit to the resource manager TORQUE using qsub, explained below. While running, MAKER will generate a very large number of files. Warning: moving a MAKER output directory around in your filesystem takes a very long time.

In this document, $RUNDIR is the location of the hypothetical directory from which MAKER will be executed.



Control Files

To generate template control files, from within $RUNDIR and after loading the MAKER module, type maker -CTL. This will generate three files:

maker_opts.ctl contains general settings,
maker_bopts.ctl contains BLAST settings,
maker_exe.ctl contains paths to dependency executables.

The syntax for the control files is key=value with # preceeded comments. The control files contain descriptive comments. Here is my explanation of the most important options to know about. Default BLAST settings were chosen because MAKER performed well under them for eukaryotic genomes.

DNA and protein sequences can be provided to MAKER in FASTA or GFF3 format. Below, the FASTA options are shown.


Selected Options in maker_opts.ctl

syntax: key=value e.g., genome=/path/to/mygenome.fasta

DNA sequence (genome assembly)

genome
Required, in FASTA format. Value is a path to a FASTA file.

Eu/Pro-karyotic

organism_type
Value is either eukaryotic or prokaryotic.

Transcript(ome)s

est
Value is a path to nucleotide FASTA file. Expressed sequences (e.g. EST, mRNA-seq transcripts) MAKER will use mappings of these sequences as as evidence for the existence of genes, and will directly infer models from genes using Exonerate est2genome if est2genome=1.

Alt-species transcript(ome)s

altest
Value is a path to nucleotide FASTA file. Expressed sequences (e.g. EST, mRNA-seq transcripts) from a different organism than the one you are annotating. MAKER will use alignments of these sequences as as evidence for the existence of genes, and will directly infer models from genes using Exonerate est2genome if est2genome=1. This option will likley not add much to the annotation if assemble mRNA-seq data are given with est.

Protein sequences

protein
Value is a path to protein FASTA file. Protein sequences. MAKER will use homology to these proteins as evidence for the existence of genes, and will directly infer models from genes using Exonerate protein2genome if protein2genome=1. The MAKER documentation suggests using "NP" sequences from ref-seq but not unreviewed sequences.

Repeat library (TEs, other repeats)

rmlib
Value is a path to nucleotide FASTA file. Repbase is a good start, but de novo libraries can be created with RepeatModeler.

TE proteins

repeat_protein
Value is a path to protein FASTA file. Transposable element proteins to assist with masking. A file containing some called te_proteins.fasta is included with MAKER at makerinstalldir/data/

Gene predictor trained HMM

snaphmm
gmhmm
augustus_species
fgenesh_par_file
Value is a path to program-specific HMM file. Necessary for using gene predictors, a trained HMM for on of the four integrated ab initio gene prediction programs: SNAP, Augustus, GeneMark, or FGENESH.

Infer gene models directly from sequence alignments

est2genome
protein2genome
Value is 0 or 1. These turn on direct inference of gene models from transcript (est & altest) and protein (protein) sequences, respectively.

Annotate tRNA genes

trna
Value is 0 or 1.

Annotate rRNA genes

snoscan_rrna
Value is 0 or 1.

Run ab initio on unmasked sequence

unmask
Value is 0 or 1. This could be helpful in annotating novel transposons.

Number of processors available

cpus
Value is any positive integer. Leave this value at 1 if you are using MPI; that is, running MAKER on multiple nodes simultaneously.

Extra stats in GFF3

pred_stats
Value is 0 or 1. By default, only transcripts contain AED and other statistics (QI), while features such as exons do not. Turn it on to get this information for all features.

Quality threshold

AED_threshold
Value is a number between 0 and 1. Setting maximum tolerated AED to less than one will result in fewer models of higher average quality.

Min protein length from gene predictors

min_protein
Value is a positive integer. Gene predictors can predict many small proteins. Setting this can reduce spurious predictions by ab initios.

Find alternate splice forms

alt_splice
Value is 0 or 1. Gene predictors will be trained to find alternative splice forms if this is on, which will be output in the GFF3.


Paths in maker_exe.ctl


Ab initio gene predicion algorithms are only required if being used in your MAKER run.

#-----Location of Executables Used by MAKER/EVALUATOR
makeblastdb= #location of NCBI+ makeblastdb executable
blastn= #location of NCBI+ blastn executable
blastx= #location of NCBI+ blastx executable
tblastx= #location of NCBI+ tblastx executable
RepeatMasker=/home/joshd/software/maker/bin/../exe/RepeatMasker/RepeatMasker #location of RepeatMasker executable
exonerate=/home/joshd/software/maker/bin/../exe/exonerate/bin/exonerate #location of exonerate executable

#-----Ab-initio Gene Prediction Algorithms
snap= #location of snap executable
gmhmme3= #location of eukaryotic genemark executable
gmhmmp= #location of prokaryotic genemark executable
augustus= #location of augustus executable
fgenesh= #location of fgenesh executable
tRNAscan-SE= #location of trnascan executable
snoscan= #location of snoscan executable


Command Line Options and Submission Script

The first line of the submission script tells the computer what program is used to open it. Information about the values on the next six lines can be found by typing man qsub; in the screen that pops up, push / then type a word to search for, for example, -k. The search looks only ahead of the current view position of the file, but you can go to the top of the file by pressing g. To go to the next search match press n and to go to the previous search match press p. The available options for the line #PBS -q are q40, q24, long, and performance.

Maker needs to be loaded along with tRNAscan-SE if tRNA annotation will be carried out by your MAKER run, and OpenMPI needs to be loaded only if you will run your job on multiple nodes. In addition to running on MPI, multiple identical jobs may be started in the same directory to achieve cross-node parallelization but the MAKER developers say MPI is faster. If you are running with MPI, first enter mpirun then n i where i is the total number of processors you will use. After that put a normal MAKER call, beginning with maker. If using MPI, set the value for cpu to cpu=0 in the file maker_opts.ctl. Next on the line specify the paths to the control files and redirect stdout and stderr to files to serve as references if anything goes wrong.

#!/bin/bash
#PBS -k oe
#PBS -N JobName
#PBS -q q40
#PBS -j oe
#PBS -m ea
#PBS -M [email protected]
#PBS -l nodes=1:ppn=40

module load local/maker
module load local/trnascan
module load openmpi

cd /home/derstudent/data/santalales/annotation_all_taxa

mpirun -n 40 maker maker_bopts.ctl maker_exe.ctl maker_opts.ctl 1>maker.err 2>maker.log


Output

Within the directory from which the shell script calling maker is submitted with qsub, a directory will appear named <inputFastaFilename>.maker.output/. Many subdirectories exist within this directory where MAKER's processes read and write files in parallel. When MAKER has finished running, GFF3 and FASTA files of annotations can be obtained using accessory scripts.

  1. Collecting final annotations in GFF3 format

gff3_merge -d <outputDir/inputFilename_master_datastore_index.log>

  1. Collecting final annotations in FASTA format

fasta_merge -d <outputDir/inputFilename_master_datastore_index.log>



MAKER Accessory Scripts



Training Gene Predictors


SNAP


AUGUSTUS



Example MAKER Protocol

What we did for Azolla and Salvinia.



Manipulating GFF3-format Files


Adding Attributes


Renaming Attributes


Command-Line Pipeline Examples for Extracting Information from GFF3


Creating Tracks for Circos


Feature-Coordinate Arithmetic with BED Format



DerDocs Home