It is common for data resources, including ELIXIR Core Data Resources, to connect with scientific literature as the gateway to scientific evidence, in support of the curation effort. The granularity of annotation is often at the level of the article, such as PMID or DOI, meaning that a paper is cross-referenced from a database entry or from some sub-annotations of the entry.
This cross-reference is often described via terminological or ontological descriptors (e.g., molecular functions, cellular locations, diseases, traits) within the entry. It is only in rare cases that a higher granularity of the annotation is available. E.g., sentence-level annotations in GeneRiF.
In complement, literature services (e.g., JATS) and text mining communities (e.g., BioC), as well as specific biological communities (e.g., TaxPub for taxonomic treatments) have developed standards to tentatively capture annotations directly or as a supplement to the published text. While annotations like Accession Number are trivially captured by literature services (e.g., EuropePMC, SIBiLS), more structured evidence (e.g., named-entities or relationships between entities) remains challenging for both curation-support and text mining pipelines. Further, non-textual publication materials (e.g., supplementary data files) have been less used by both curation and publication communities due to the lack of exploration tools and standards to process these files.
Project plan
- O1: to establish a landscape analysis of data and services resources;
- O2: to enhance existing standards (e.g., JATS) to better capture curated evidence from the literature;
- O3: to explore how literature and crediting services (e.g., APICURON) can benefit from these new standardization efforts.
O1 should be delivered by the end of the Hackathon. O2 should be mostly (~80%) completed by the end of the Hackathon. O3 will benefit from the Hackathon’s prototyping effort and could be completed within a year or in a future hackathon.
Timeline: A non-linear timeline could be the following: Landscape analysis [2days], standard developments [2days], prototyping [2days].
Level of expertise and population: We expect balanced contributions from two types of profiles: biocurators (N=3-5) and data/software developers (N=3-5).
Methods: We plan to alternate focus group meetings and RAD development phase, all co-ordinated by senior DevOps (Mihail Anton/SciLifeLab and Alexandre Flament/SIB).
Silvio Tosatto, Ulrike Wittig, Mihail Anton