title | layout | show_sidebar | menubar |
---|---|---|---|
CobiontID overview |
page |
false |
docs_menu |
Disentangling sequences from different sources can be both interesting and challenging. It can reveal interactions between organisms and their genomes through time (considering extant associations and molecular fossils). In some cases, the aim may simply be to remove contamination that has found its way into a sample. However, determining where a sequence came from is not always straightforward - especially when exploring less well sampled parts of the tree of life, where few close relatives have been sequenced.
On these pages, you will find some examples showing how we are tackling this issue in the Tree of Life programme. Our approach, which is tailored for HiFi data, combines grouping sequences from the same source with finding reliable taxonomic hints. This allows us to reduce reliance on databases, which can be incomplete and contain mislabelled sequences.
The CobiontID process has two parts: First, Marker scan provides taxonomic information. HMM profiles of marker genes, such as rRNAs, which are well-sampled and conserved, are useful to classify sequences from genomes that are otherwise too diverged from their closest sequenced relative. We can therefore gauge which species are present in a given sample, and construct streamlined databases for read classification. Second, a combination of assembly, read mapping and compositional clustering allows the sequences to be assigned to groups that can be tagged with this taxonomic information.
See here for an illustration of the outputs the tools presented here provide, and how to interpret them. If you have ever looked at the "Cobionts" section of a page on Tree of Life QC and wondered how to read the tables and plots, your questions will hopefully be answered here (a list of pages with examples can be found here).
Tool | Description | Application | Language |
---|---|---|---|
kmer-counter | Fast k-mer counter for large read sets | Get tetranucleotide counts | Rust |
unique-kmers | Count distinct k-mers in sequences | Calculate k-mer diversity | Rust |
hexamer | Detect likely coding regions | Estimate coding density | C |
fastk-medians | Calculate median number of times each large k-mer in a sequence occurs across the set (modified version of Profex from the original FASTK library) | Approximate k-mer coverage | C |
Workflow | Description |
---|---|
MarkerScan | Determine taxonomic composition of an assembly; separate and assemble individual components |
read VAE | Generate annotated 2D visualisations for long reads; interactively explore and select data for downstream analyses |
- Slides from talk on CobiontID at PopGroup55 (2022)
- Flash presentation accompanying PopGroup talk
- Disentangling Cobionts and Contamination in Long-Read Genomic Data using Sequence Composition https://www.biorxiv.org/content/10.1101/2024.05.30.596622v1
- Phylogenomic analysis of Wolbachia genomes from the Darwin Tree of Life biodiversity genomics project https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3001972
- MarkerScan: Separation and assembly of cobionts sequenced alongside target species in biodiversity genomics projects https://doi.org/10.12688/wellcomeopenres.20730.1