Skip to content

Commit

Permalink
Merge pull request #1 from aweimann/master
Browse files Browse the repository at this point in the history
v1.01
  • Loading branch information
aweimann committed Mar 11, 2016
2 parents 1a78555 + 9f2d576 commit 922c781
Show file tree
Hide file tree
Showing 28 changed files with 209 additions and 79 deletions.
18 changes: 10 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# traitar – the microbial trait analyzer
traitar is a software for characterizing microbial samples from nucleotide or protein sequences. It can accurately phenotype [67 diverse traits](traits.tsv).
# Traitar – the microbial trait analyzer
Traitar is a software for characterizing microbial samples from nucleotide or protein sequences. It can accurately phenotype [67 diverse traits](traits.tsv).

### Table of Contents
[Installation](#installation)
Expand All @@ -17,7 +17,9 @@ Please see [INSTALL.md](INSTALL.md) for installation instructions.

``traitar phenotype <in dir> <sample file> from_nucleotides <out_dir> ``

will trigger the standard workflow of traitar, which is to predict open reading frames with Prodigal, annotate the coding sequences provided as nucleotide FASTAs in the <in_dir> for all samples in <sample_file> with Pfam families using HMMer and finally predict phenotypes from the models for the 67 traits.
will trigger the standard workflow of Traitar, which is to predict open reading frames with Prodigal, annotate the coding sequences provided as nucleotide FASTAs in the <in_dir> for all samples in <sample_file> with Pfam families using HMMer and finally predict phenotypes from the models for the 67 traits.

![Alt text](/workflow.png?raw=true "Optional Title")

The sample file has one column for the sample file names and one for the names as specified by the user. You can also specify a grouping of the samples in the third column, which will be shown in the generated plots. The template looks like following - The header row is mandatory; please also take a look at the sample file for the packaged example data:
sample_file_name{tab}sample_name{tab}category
Expand All @@ -26,16 +28,16 @@ sample2_file_name{tab}sample2_name[{tabl}sample_category2]

``traitar phenotype <in dir> <sample file> from_genes <out_dir> ``

assumes that gene prediction has been conducted already externally. In this case analysis will start with the Pfam annotation. If the output directory already exists, traitar will offer to recompute or resume the individual analysis steps. This option is only available if the process is run interactively.
assumes that gene prediction has been conducted already externally. In this case analysis will start with the Pfam annotation. If the output directory already exists, Traitar will offer to recompute or resume the individual analysis steps. This option is only available if the process is run interactively.

### Parallel usage
traitar can benefit from parallel execution. The ``-c`` parameter sets the number of processes used e.g. ``-c 2`` for using two processes
Traitar can benefit from parallel execution. The ``-c`` parameter sets the number of processes used e.g. ``-c 2`` for using two processes

``traitar phenotype <in dir> <sample file> from_nucleotides out_dir -c 2``

This requires installing GNU parallel as noted above.

### Run traitar with packaged sample data
### Run Traitar with packaged sample data
``traitar phenotype <traitar_dir>/data/sample_data <traitar_dir>/data/sample_data/samples.txt from_genes <out_dir> -c 2`` will trigger phenotyping of *Listeria grayi DSM_20601* and *Listeria ivanovii WSLC3009*. Computation should be done within 5 minutes. You can find out ``<traitar_dir>`` by running

```
Expand All @@ -46,7 +48,7 @@ python


# Results
traitar provides the gene prediction results in ``<out_dir>/gene_prediction``, the Pfam annotation in ``<out_dir>/pfam_annotation`` and the phenotype prediction in``<out_dir>/phenotype prediction``.
Traitar provides the gene prediction results in ``<out_dir>/gene_prediction``, the Pfam annotation in ``<out_dir>/pfam_annotation`` and the phenotype prediction in``<out_dir>/phenotype prediction``.

### Heatmaps
The phenotype prediction is summarized in heatmaps individually for the phyletic pattern classifier in ``heatmap_phypat.png``, for the phylogeny-aware classifier in ``heatmap_phypat_ggl.png`` and for both classifiers combined in ```heatmap_comb.png``` and provide hierarchical clustering dendrograms for phenotypes and the samples.
Expand All @@ -57,7 +59,7 @@ The phenotype prediction is summarized in heatmaps individually for the phyletic
These heatmaps are based on tab separated text files e.g. ``predictions_majority-votes_combined.txt``. A negative prediction is encoded as 0, a prediction made only by the pure phyletic classifier as 1, one made by the phylogeny-aware classifier by 2 and a prediction supported by both algorithms as 3. ``predictions_flat_majority-votes_combined.txt`` provides a flat version of this table with one prediction per row. The expert user might also want to access the individual results for each algorithm in the respective sub folders ``phypat`` and ``phypat+PGL``.

### Feature tracks
If traitar is run from_nucleotides it will generate a link between the Prodigal gene prediction and predicted phenotypes in ``phypat/feat_gffs`` and ``phypat+PGL/feat_gffs`` (no example in the sample data). The user can visualize gene prediction phenotype-specific Pfam annotations tracks via GFF files.
If Traitar is run from_nucleotides it will generate a link between the Prodigal gene prediction and predicted phenotypes in ``phypat/feat_gffs`` and ``phypat+PGL/feat_gffs`` (no example in the sample data). The user can visualize gene prediction phenotype-specific Pfam annotations tracks via GFF files.

#### Feature tracks with *from_genes* option (experimental feature)
If the *from_genes* option is set, the user may specify gene GFF files via an additional column called gene_gff in the sample file. As gene ids are not consistent across gene GFFs from different sources e.g. img, RefSeq or Prodigal the user needs to specify the origin of the gene gff file via the -g / --gene_gff_type parameter. Still there is no guarantee that this works currently. Using samples_gene_gff.txt as the sample file in the above example will generate phenotype-specific Pfam tracks for the two genomes.
Expand Down
124 changes: 124 additions & 0 deletions README.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
Traitar – the microbial trait analyzer
======================================

Traitar is a software for characterizing microbial samples from
nucleotide or protein sequences. It can accurately phenotype `67 diverse
traits <https://github.com/hzi-bifo/traitar/blob/master/traits.tsv>`__.
Please take a look at the `gitHub repository <https://github.com/hzi-bifo/traitar/>`__ for further information.

Table of Contents
~~~~~~~~~~~~~~~~~

| `Installation <#installation>`__
| `Basic usage <#basic-usage>`__
| `Results <#results>`__
Installation
============

Please see `INSTALL.md <https://github.com/hzi-bifo/traitar/blob/master/INSTALL.md>`__ for installation instructions.

Basic usage
===========

``traitar phenotype <in dir> <sample file> from_nucleotides <out_dir>``

will trigger the standard `workflow <https://raw.githubusercontent.com/hzi-bifo/traitar/master/workflow.png>`__ of Traitar, which is to predict open
reading frames with Prodigal, annotate the coding sequences provided as
nucleotide FASTAs in the for all samples in with Pfam families using
HMMer and finally predict phenotypes from the models for the 67 traits.

The sample file has one column for the sample file names and one for the
names as specified by the user. You can also specify a grouping of the
samples in the third column, which will be shown in the generated plots.
The template looks like following - The header row is mandatory; please
also take a look at the sample file for the packaged example data:
sample\_file\_name{tab}sample\_name{tab}category
sample1\_file\_name{tab}sample1\_name[{tabl}sample\_category1]
sample2\_file\_name{tab}sample2\_name[{tabl}sample\_category2]

``traitar phenotype <in dir> <sample file> from_genes <out_dir>``

assumes that gene prediction has been conducted already externally. In
this case analysis will start with the Pfam annotation. If the output
directory already exists, Traitar will offer to recompute or resume the
individual analysis steps. This option is only available if the process
is run interactively.

Parallel usage
~~~~~~~~~~~~~~

Traitar can benefit from parallel execution. The ``-c`` parameter sets
the number of processes used e.g. ``-c 2`` for using two processes

``traitar phenotype <in dir> <sample file> from_nucleotides out_dir -c 2``

This requires installing GNU parallel as noted above.

Run Traitar with packaged sample data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

``traitar phenotype <traitar_dir>/data/sample_data <traitar_dir>/data/sample_data/samples.txt from_genes <out_dir> -c 2``
will trigger phenotyping of *Listeria grayi DSM\_20601* and *Listeria
ivanovii WSLC3009*. Computation should be done within 5 minutes. You can
find out ``<traitar_dir>`` by running

::

python
>>> import traitar
>>> traitar.__path__

Results
=======

Traitar provides the gene prediction results in
``<out_dir>/gene_prediction``, the Pfam annotation in
``<out_dir>/pfam_annotation`` and the phenotype prediction
in\ ``<out_dir>/phenotype prediction``.

Heatmaps
~~~~~~~~

The phenotype prediction is summarized in heatmaps individually for the
phyletic pattern classifier in ``heatmap_phypat.png``, for the
phylogeny-aware classifier in ``heatmap_phypat_ggl.png`` and for both
classifiers `combined <https://github.com/aweimann/traitar/blob/master/traitar/data/sample_data/traitar_out/phenotype_prediction/heatmap_combined.png?raw=true>`__
in ``heatmap_comb.png`` and provide hierarchical
clustering dendrograms for phenotypes and the samples.

Phenotype prediction - Tables and flat files
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

These heatmaps are based on tab separated text files e.g.
``predictions_majority-votes_combined.txt``. A negative prediction is
encoded as 0, a prediction made only by the pure phyletic classifier as
1, one made by the phylogeny-aware classifier by 2 and a prediction
supported by both algorithms as 3.
``predictions_flat_majority-votes_combined.txt`` provides a flat version
of this table with one prediction per row. The expert user might also
want to access the individual results for each algorithm in the
respective sub folders ``phypat`` and ``phypat+PGL``.

Feature tracks
~~~~~~~~~~~~~~

If Traitar is run from\_nucleotides it will generate a link between the
Prodigal gene prediction and predicted phenotypes in
``phypat/feat_gffs`` and ``phypat+PGL/feat_gffs`` (no example in the
sample data). The user can visualize gene prediction phenotype-specific
Pfam annotations tracks via GFF files.

Feature tracks with *from\_genes* option (experimental feature)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

If the *from\_genes* option is set, the user may specify gene GFF files
via an additional column called gene\_gff in the sample file. As gene
ids are not consistent across gene GFFs from different sources e.g. img,
RefSeq or Prodigal the user needs to specify the origin of the gene gff
file via the -g / --gene\_gff\_type parameter. Still there is no
guarantee that this works currently. Using samples\_gene\_gff.txt as the
sample file in the above example will generate phenotype-specific Pfam
tracks for the two genomes.

``traitar phenotype . samples_gene_gff.txt from_genes traitar_out -g refseq``
4 changes: 4 additions & 0 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,13 @@
verstr = mo.group(1)
else:
raise RuntimeError("Unable to find version string in %s." % (VERSIONFILE,))

long_description = open('README.rst', 'r').read()

setup(name='traitar',
version = verstr,
description='traitar - The microbial trait analyzer',
long_description = long_description,
url = 'http://github.com/aweimann/traitar',
author='Aaron Weimann',
author_email='[email protected]',
Expand Down
2 changes: 1 addition & 1 deletion traitar/_version.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "1.0.0"
__version__ = "1.0.1"
Binary file modified traitar/data/models/phypat+PGL.tar.gz
Binary file not shown.
Binary file modified traitar/data/models/phypat.tar.gz
Binary file not shown.
2 changes: 1 addition & 1 deletion traitar/data/pt2cat2col.txt
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ Growth in 6.5% NaCl Growth 0x00538A 0 83 138
Bile-susceptible Growth 0x00538A 0 83 138
Growth in KCN Growth 0x00538A 0 83 138
Mucate utilization Growth 0x00538A 0 83 138
Growth at 42 degrees C Growth 0x00538A 0 83 138
Growth at 42°C Growth 0x00538A 0 83 138
Colistin-Polymyxin susceptible Growth: Antibiotic 0xA6BDD7 166 189 215
Acetate utilization Carboxylic Acid 0xFF6800 255 104 0
Citrate Carboxylic Acid 0xFF6800 255 104 0
Expand Down
Binary file not shown.
Binary file not shown.
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
Salicin Catalase Gelatin hydrolysis Coccus Lysine decarboxylase Motile Coccus - pairs or chains predominate Maltose Growth on ordinary blood agar Colistin-Polymyxin susceptible Melibiose Spore formation Yellow pigment DNase Nitrate to nitrite Gram positive Anaerobe Bile-susceptible Glucose oxidizer Gram negative Ornithine decarboxylase L-Arabinose Casein hydrolysis Gas from glucose Lactose Tartrate utilization Raffinose Cellobiose L-Rhamnose Bacillus or coccobacillus Mucate utilization Indole D-Xylose Starch hydrolysis Growth on MacConkey agar Citrate Urea hydrolysis Glycerol Voges Proskauer Pyrrolidonyl-beta-naphthylamide Lipase D-Mannitol Trehalose Nitrite to gas Arginine dihydrolase Acetate utilization Malonate myo-Inositol Methyl red ONPG (beta galactosidase) D-Mannose Growth in 6.5% NaCl Growth at 42 degrees C Glucose fermenter Aerobe Coccus - clusters or groups predominate Capnophilic Oxidase Alkaline phosphatase Beta hemolysis Growth in KCN Hydrogen sulfide Facultative Esculin hydrolysis Sucrose D-Sorbitol Coagulase production
Salicin Catalase Gelatin hydrolysis Coccus Lysine decarboxylase Motile Coccus - pairs or chains predominate Maltose Growth on ordinary blood agar Colistin-Polymyxin susceptible Melibiose Spore formation Yellow pigment DNase Nitrate to nitrite Gram positive Anaerobe Bile-susceptible Glucose oxidizer Gram negative Ornithine decarboxylase L-Arabinose Casein hydrolysis Gas from glucose Lactose Tartrate utilization Raffinose Cellobiose L-Rhamnose Bacillus or coccobacillus Mucate utilization Indole D-Xylose Starch hydrolysis Growth on MacConkey agar Citrate Urea hydrolysis Glycerol Voges Proskauer Pyrrolidonyl-beta-naphthylamide Lipase D-Mannitol Trehalose Nitrite to gas Arginine dihydrolase Acetate utilization Malonate myo-Inositol Methyl red ONPG (beta galactosidase) D-Mannose Growth in 6.5% NaCl Growth at 42°C Glucose fermenter Aerobe Coccus - clusters or groups predominate Capnophilic Oxidase Alkaline phosphatase Beta hemolysis Growth in KCN Hydrogen sulfide Facultative Esculin hydrolysis Sucrose D-Sorbitol Coagulase production
Listeria_grayi_DSM_20601 1 1 0 0 0 1 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 1 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 1 1 0 0 1
Listeria_ivanovii_WSLC3009 1 1 0 0 0 1 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1 0 1 0 0 1 1 1 0 0 0 0 1 0 0 0 1 1 0 0 1
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
Salicin Catalase Gelatin hydrolysis Coccus Lysine decarboxylase Motile Coccus - pairs or chains predominate Maltose Growth on ordinary blood agar Colistin-Polymyxin susceptible Melibiose Spore formation Yellow pigment DNase Nitrate to nitrite Gram positive Anaerobe Bile-susceptible Glucose oxidizer Gram negative Ornithine decarboxylase L-Arabinose Casein hydrolysis Gas from glucose Lactose Tartrate utilization Raffinose Cellobiose L-Rhamnose Bacillus or coccobacillus Mucate utilization Indole D-Xylose Starch hydrolysis Growth on MacConkey agar Citrate Urea hydrolysis Glycerol Voges Proskauer Pyrrolidonyl-beta-naphthylamide Lipase D-Mannitol Trehalose Nitrite to gas Arginine dihydrolase Acetate utilization Malonate myo-Inositol Methyl red ONPG (beta galactosidase) D-Mannose Growth in 6.5% NaCl Growth at 42 degrees C Glucose fermenter Aerobe Coccus - clusters or groups predominate Capnophilic Oxidase Alkaline phosphatase Beta hemolysis Growth in KCN Hydrogen sulfide Facultative Esculin hydrolysis Sucrose D-Sorbitol Coagulase production
Salicin Catalase Gelatin hydrolysis Coccus Lysine decarboxylase Motile Coccus - pairs or chains predominate Maltose Growth on ordinary blood agar Colistin-Polymyxin susceptible Melibiose Spore formation Yellow pigment DNase Nitrate to nitrite Gram positive Anaerobe Bile-susceptible Glucose oxidizer Gram negative Ornithine decarboxylase L-Arabinose Casein hydrolysis Gas from glucose Lactose Tartrate utilization Raffinose Cellobiose L-Rhamnose Bacillus or coccobacillus Mucate utilization Indole D-Xylose Starch hydrolysis Growth on MacConkey agar Citrate Urea hydrolysis Glycerol Voges Proskauer Pyrrolidonyl-beta-naphthylamide Lipase D-Mannitol Trehalose Nitrite to gas Arginine dihydrolase Acetate utilization Malonate myo-Inositol Methyl red ONPG (beta galactosidase) D-Mannose Growth in 6.5% NaCl Growth at 42°C Glucose fermenter Aerobe Coccus - clusters or groups predominate Capnophilic Oxidase Alkaline phosphatase Beta hemolysis Growth in KCN Hydrogen sulfide Facultative Esculin hydrolysis Sucrose D-Sorbitol Coagulase production
Listeria_grayi_DSM_20601 1 1 0 0 0 1 0 1 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 1 0 0 1 0 0 0 1 0 1 0 1 0 0 0 1 1 1 1 0 0 0 0 1 0 0 0 1 1 1 0 1
Listeria_ivanovii_WSLC3009 1 1 0 0 0 1 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 1 0 1 0 0 1 0 0 0 1 0 0 0 1 0 1 0 1 1 1 1 0 0 0 0 1 0 0 0 1 1 0 1 1
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
Salicin Catalase Gelatin hydrolysis Coccus Lysine decarboxylase Motile Coccus - pairs or chains predominate Maltose Growth on ordinary blood agar Colistin-Polymyxin susceptible Melibiose Spore formation Yellow pigment DNase Nitrate to nitrite Gram positive Anaerobe Bile-susceptible Glucose oxidizer Gram negative Ornithine decarboxylase L-Arabinose Casein hydrolysis Gas from glucose Lactose Tartrate utilization Raffinose Cellobiose L-Rhamnose Bacillus or coccobacillus Mucate utilization Indole D-Xylose Starch hydrolysis Growth on MacConkey agar Citrate Urea hydrolysis Glycerol Voges Proskauer Pyrrolidonyl-beta-naphthylamide Lipase D-Mannitol Trehalose Nitrite to gas Arginine dihydrolase Acetate utilization Malonate myo-Inositol Methyl red ONPG (beta galactosidase) D-Mannose Growth in 6.5% NaCl Growth at 42 degrees C Glucose fermenter Aerobe Coccus - clusters or groups predominate Capnophilic Oxidase Alkaline phosphatase Beta hemolysis Growth in KCN Hydrogen sulfide Facultative Esculin hydrolysis Sucrose D-Sorbitol Coagulase production
Salicin Catalase Gelatin hydrolysis Coccus Lysine decarboxylase Motile Coccus - pairs or chains predominate Maltose Growth on ordinary blood agar Colistin-Polymyxin susceptible Melibiose Spore formation Yellow pigment DNase Nitrate to nitrite Gram positive Anaerobe Bile-susceptible Glucose oxidizer Gram negative Ornithine decarboxylase L-Arabinose Casein hydrolysis Gas from glucose Lactose Tartrate utilization Raffinose Cellobiose L-Rhamnose Bacillus or coccobacillus Mucate utilization Indole D-Xylose Starch hydrolysis Growth on MacConkey agar Citrate Urea hydrolysis Glycerol Voges Proskauer Pyrrolidonyl-beta-naphthylamide Lipase D-Mannitol Trehalose Nitrite to gas Arginine dihydrolase Acetate utilization Malonate myo-Inositol Methyl red ONPG (beta galactosidase) D-Mannose Growth in 6.5% NaCl Growth at 42°C Glucose fermenter Aerobe Coccus - clusters or groups predominate Capnophilic Oxidase Alkaline phosphatase Beta hemolysis Growth in KCN Hydrogen sulfide Facultative Esculin hydrolysis Sucrose D-Sorbitol Coagulase production
Listeria_grayi_DSM_20601 0.525 1.170 1.137 1.234 1.454 1.002 0.268 0.440 0.780 0.177 0.178 0.751 0.453 0.275 0.969 0.383 0.971 0.290 1.389 0.792 1.306 1.827 0.060 0.723
Listeria_ivanovii_WSLC3009 0.887 1.218 1.181 0.908 1.803 1.054 0.810 2.162 0.074 0.593 0.142 1.042 0.980 0.693 0.843 0.213 1.139 1.714 0.792 0.987 1.941 2.021 0.085 0.820
Loading

0 comments on commit 922c781

Please sign in to comment.