Skip to content

convert nucleic sequence in protein sequence

License

Notifications You must be signed in to change notification settings

feliixx/gotranseq

Repository files navigation

Go Report Card codecov PkgGoDev

Gotranseq

Translate nucleic acid sequences to their corresponding peptide sequences. Like EMBOSS transeq, but written in go

Purpose

EMBOSS transeq is a great tool, but can be quite painfull for some use cases, because it silently truncate the sequence ID if it contains chars like ':', or rename the sequence ID if it contains chars like '|'

This tool is an attempt to solve this problem. It's also way faster than EMBOSS transeq because it can be parrallelized:

benchmark on ubuntu 16.04, machine with 2 CPU Intel(R) Core(TM)2 Duo CPU 3.00GHz with a 189MB fasta file:

#EMBOSS transeq
time transeq -sequence file.fna -outseq out.faa -frame 6  
41,82s user 0,76s system 85% cpu 49,696 total

#gotranseq
time ./gotranseq --sequence file.fna --outseq out.faa --frame 6 -n 2
7,75s user 0,98s system 159% cpu 5,472 total

Works on Linux, Mac and windows

Installation

Download the binary from the release page

or

Build from source:

First, make sure that go is installed on your machine (see install go for details ). Then clone the repo and build it:

git clone https://github.com/feliixx/gotranseq.git
cd gotranseq
go build

Usage

use gotranseq --help to print the help:

gotranseq version 0.3.0

Usage:
  gotranseq --sequence file.fna --outseq out.faa

required:
  -s, --sequence=<filename>    Nucleotide sequence(s) filename
  -o, --outseq=<filename>      Protein sequence filename

optional:
  -f, --frame=<code>           Frame to translate. Possible values:
                               [1, 2, 3, F, -1, -2, -3, R, 6]
                               F: forward three frames
                               R: reverse three frames
                               6: all 6 frames
                               (default: 1)
  -t, --table=<code>           NCBI code to use, see https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi?chapter=tgencodes#SG1 for
                               details. Available codes:
                               0: Standard code
                               2: The Vertebrate Mitochondrial Code
                               3: The Yeast Mitochondrial Code
                               4: The Mold, Protozoan, and Coelenterate Mitochondrial Code and the Mycoplasma/Spiroplasma Code
                               5: The Invertebrate Mitochondrial Code
                               6: The Ciliate, Dasycladacean and Hexamita Nuclear Code
                               9: The Echinoderm and Flatworm Mitochondrial Code
                               10: The Euplotid Nuclear Code
                               11: The Bacterial, Archaeal and Plant Plastid Code
                               12: The Alternative Yeast Nuclear Code
                               13: The Ascidian Mitochondrial Code
                               14: The Alternative Flatworm Mitochondrial Code
                               16: Chlorophycean Mitochondrial Code
                               21: Trematode Mitochondrial Code
                               22: Scenedesmus obliquus Mitochondrial Code
                               23: Thraustochytrium Mitochondrial Code
                               24: Pterobranchia Mitochondrial Code
                               25: Candidate Division SR1 and Gracilibacteria Code
                               26: Pachysolen tannophilus Nuclear Code
                               29: Mesodinium Nuclear
                               30: Peritrich Nuclear
                               (default: 0)
  -c, --clean                  Replace stop codon '*' by 'X'
  -a, --alternative            Define frame '-1' as using the set of codons starting with the last codon of the sequence
  -T, --trim                   Removes all 'X' and '*' characters from the right end of the translation. The trimming process starts at the
                               end and continues until the next character is not a 'X' or a '*'
  -n, --numcpu=<n>             Number of worker to use (default: number of CPU)

general:
  -h, --help                   Show this help message
  -v, --version                Print the tool version and exit