Increasing discoverability and usability of data originating from multiple domains of biodiversity research is still a major challenge in biodiversity informatics. Enabling connections between datasets or records across domains, such as sequence data and specimen or sample records is essential to facilitate access, provenance tracking and reproducibility in biodiversity research and conservation.
Sequence data stored at the International Nucleotide Sequence Database Collaboration (INSDC) holds metadata that refer to its biological source (specimen or sample) which is described in the source feature qualifiers for sequences or in the sample attributes. However, these metadata are not always well-structured, hampering persistent and machine-readable linking to specimens and sample records in natural history collections and biobanks.
In this project we aim to develop tools to facilitate linking between sequence data and associated specimens and/or sample records in public collections. For that we will build on tools previously developed by team members to derive mappings between sequence data and specimen or sample information, based on the datasets available at the Global Genome Biodiversity Network (GGBN) data portal. These mappings will allow the generation of machine readable links between data types and the enrichment of the sequence records at INSDC through the submission of improved annotations to the ELIXIR Contextual Data ClearingHouse.
We expect that the tools and workflows developed within this project can then be applied to other datasets and therefore may further contribute to improved linking between sequence data and specimen information, promoting the reusability of data for biodiversity research.
All the data we will work with is openly licensed and accessible through web-based HTTP APIs and downloads. We plan to do some preliminary research prior to the hackathon to ensure we can start the core work of the project immediately.
This will involve identifying the GGBN records that might be easily linked. We can then work outward towards more difficult linking, working with records with less information or where the identifiers are harder to disambiguate. All the mappings will then be used to link records and improve current source metadata on the sequence records (e.g. by linking to DiSSCo identifiers).
We would like to engage people with a broad range of skills and experience, from bioinformaticians, molecular biologists with knowledge on the molecular databases, ecologists and taxonomic experts to work on this proposal.
A team of 6-10 people would be sufficient to work on this project.
Joana Pauperio, Sam Leeflang, Quentin Groom