Skip to content

New options for dealing with tax information

Compare
Choose a tag to compare
@iquasere iquasere released this 29 May 09:17
· 63 commits to master since this release

Original workflow of KEGGCharter attempts to download taxa specific KGMLs for organisms in KEGG Genomes (Fig. 1).

Fig. 1 - Original KEGGCharter workflow. Only arcticus had KOs with functions for the TCA cycle attributed that, simultaneously, were present in the KGML for the TCA cycle and the taxon arcticus.

This type of workflow uses both taxon-specific information and results from the datasets inputted. All functions represented validated by KEGG (i.e., those functions are available for those organisms), but many identifications may be lacking, since information at KEGG is often incomplete.

Setting "--include-missing-genomes" represents organisms that are not in KEGG Genomes

Organisms that are not identified in KEGG Genomes can still be represented, if running KEGGCharter with the option --include-missing-genomes. All functions for the KOs identified for that organism will be represented (Fig. 2).

Fig. 2 - KEGGCharter output expanded with --include-missing-genomes parameter. hydrocola is not present in KEGG Genomes, but all functions attributed to its KOs are still represented.

This setting allows to still obtain validated information for the taxonomies that are present in KEGG Genomes, while also allowing for representation of organisms not present in KEGG Genomes. It should offer the best compromise between false positives and false negatives.

Setting "--map-all" ignores KEGG Genomes completely, and represents all functions identified

Functions that are not present organisms specific KGMLs can still be represented, if running KEGGCharter with the option --map-all. This will bypass all taxon specific KGMLs, and map all functions for all KOs present in the input dataset (Fig. 3).

Fig. 3 - KEGGCharter output expanded with --map-all parameter. No functions for oleophylus and franklandus were simultaneously present in the KOs identified and available in their KGMLs. In this case, the requirement for presence in the KGMLs is bypassed, and all functions are represented for all taxa.

This setting represents the most information on the KEGG maps, and will produce the most colourful representations, but will likely return many false positives. Maps produced should be analyzed with caution This setting may be required, however, if information for organisms in KEGG Genomes is very incomplete.