Releases: EBIvariation/CMAT
v2.2.1: Minor updates for the 21.06 submission
- Update ClinVar investigations report (#243)
- Bump OT schema version to 2.0.9
This was the code version used to process the 21.06 submission.
v2.2.0: Data and operational updates for batch 21.06; Major test suite revamp
Data updates
- More strict duplication checks for evidence strings and corresponding documentation updates (#229 by @M-casado)
- Flatten
cohortPhenotypes
representation to include all names instead of only the primary ones (#238 by @apriltuesday) - Report all evidence regardless of ontology mapping status (#239 by @apriltuesday)
- Remove nonspecific allele origin values from the evidence strings. Report evidence strings even if they have no allele origin values, assumed to be germline by default (#240 by @apriltuesday)
Operational updates and bug fixes
- Strip white spaces from ontology identifiers (#216 by @afoix)
- Update FTP upload documentation (#219 by @afoix)
- Fixes for new OMIM identifier format in ClinVar (#230 by @apriltuesday)
- Bump Open Targets schema to v2.0.8 (#241 by @apriltuesday, #242 by @tskir)
Test suite revamp
- Update VEP pipeline tests to use XML input (#217 by @apriltuesday)
- Update tests to support VEP 104 (#225 by @apriltuesday)
- Migrate the tests to GitHub actions (#227 by @afoix)
- Improve and unify testing (#228 by @apriltuesday)
- Fix VEP tests and GitHub actions (#234 by @apriltuesday)
v2.1.0: Updated ClinVar model investigations; Quality control system revamp
This release leaves the actual evidence strings unchanged compared to v2.0.2, but introduces other important changes:
- #208 Significant updates to the ClinVar data model investigation scripts & the resulting report
- #212, #214 Major refactor of the quality control system and the associated spreadsheet
- #213 Update the workflow diagram to reflect changes in v2.0.0...v2.0.2 of the pipeline.
The reason for the minor version change is that the quality control metrics are now more precise and not always comparable to the metrics generated previously.
v2.0.2: Corrections and updates for the Open Targets batch 2021.04
Resolves several issues with evidence strings loss in v2.0.0 and v2.0.1 compared to v1.3.2:
- All ClinVar traits are now processed instead of only “Disease” type traits.
- All names of a ClinVar trait are now used for looking up the corresponding ontology term. Previously, only the preferred name was used, which does not always correspond to the one in the string-to-ontology mapping database.
- Reintroduced processing of mitochondrial variants and variants containing IUPAC ambiguity bases. They were previously skipped due to not being supported by the Open Targets schema.
Technical changes:
- Updated the Open Targets schema version: 2.0.5 → 2.0.6.
- Updated test data and assertions to fix some inconsistencies which weren't spotted in v2.0.0 and v2.0.1 releases.
v2.0.1: Evidence string duplication, literature references, and ontology mapping adjustments
Version 2.0.1 addresses three groups of issues.
- Evidence string duplication
- Processing of PubMed references
- Handling string to ontology mappings
- Verified that the preferred trait names are used consistently across the pipeline (#177).
- Verified that multiple string-to-ontology mappings are consistently supported across the pipeline, fixed a minor bug and amended documentation (#115).
- Prevented non-specific terms like “disease” from reappearing in the manual curation results (#179).
- Fixed a bug in construction of MONDO IRIs from ClinVar data (#175).
See also PR #202.
v2.0.0: Major refactor of ClinVar input, repeat expansion pipeline, and JSON schema
ClinVar input rewrite
All components of the pipeline now use the comprehensive XML data dump from ClinVar as input. The use of VCF and TSV summary files has been discontinued. This should make the results more consistent and comprehensive.
This is made possible by the new clinvar_xml_utils
module, which provides a Python interface to work with ClinVar data. External users with similar goals are welcome to also try it out.
Repeat expansion pipeline refactor
Under the new approach, the following Microsatellite records are considered repeat expansion events:
- Variants with explicit allele sequences which represent insertions of 12 bases or more;
- Variants without explicit allele sequences, the HGVS-like notation of which does not represent a deletion.
The old approach was essentially confined to category (2). As a result, the number of repeat expansion consequences processed is now larger by approximately a factor of 6.
JSON schema migration
The pipeline output was migrated to accommodate the new major version of the Open Targets JSON schema, 2.0.5 (up from 1.7.5), described and discussed in detail in #189.
Other changes
- Substantial refactoring and documentation updates under the hood.
- Copy of the JSON schema is no longer stored in the repository and fetched on the fly instead.
- Manual curation protocol now includes a “Notes” column, which stores the “NT expansion” annotation without replacing the trait frequency.
- Removed a number of unused modules, including the old ClinVar XML parser written in Java.
v1.3.2: Minor updates for the 2021.02 batch
- Migrated to Open Targets schema version 1.7.5.
v1.3.1: Minor updates for the 2020.11 batch
- Evidence string related changes
- Migrated to Open Targets schema version 1.7.3.
- Minor updates to the evidence string generation review checklist.
- Evidence string name format changed from
DD-MM-YYYY
toYYYY-MM-DD
.
- Other changes
- Minor fixes to the manual curation protocol to ensure stable sort order.
- ClinVar data examination script now calculates distributions of allele origins as well.
The latest ClinVar version with which this pipeline will work is 2020/08. After that, the variant_summary.tsv
format has changed so that it does not include a “NT expansion” category anymore.
v1.3: Process additional ClinVar attributes
These changes introduce additional ClinVar attributes into the evidence strings, in preparation for implementing a better and more comprehensive scoring mechanism. All changes affect both genetic_association
and somatic_mutation
evidence strings.
- #146 Report records with all clinical significance levels
- Removed filtering by clinical significance throughout the pipeline.
- Format and process the clinical significance levels according to the new schema, allowing multiple values per record.
- Removed the obsolete
target.activity
attribute. - Always set the
evidence.gene2variant.is_associated
andevidence.variant2disease.is_associated
fields to True.
- #148 Add ClinVar star rating and review status
- Add star rating, which ranges from 0 to 4.
- Add review status, e.g.
criteria provided, conflicting interpretations
.
- #149 Add mode of inheritance
- Reported as strings verbatim from ClinVar and not additionally processed.
- This field will contain an array, even when there is only one mode of inheritance (which is true for the majority of all records), for consistency between all records.
- #150 Add last evaluated date
- This fields tracks the timestamp of the most recent clinically meaningful update of the record: essentially, the latest (re)evaluation of the clinical significance level.
v1.2: Technical improvements and bug fixes
- #138 Refactor approach for submitting and reusing ZOOMA feedback
- Now the trait-to-ontology mappings from previous iterations of manual curation are reused directly, rather than relying on files for ZOOMA feedback, and also the feedback files themselves are generated at more appropriate stages of the pipeline.
- This solves a number of issues which occur where two iterations of manual curation happen back to back without evidence string generation in between.
- #140 Use virtualenv, reorganise dependencies and pin their versions
- For more consistent dependency management, the pipeline now uses virtualenv for all purposes.
- The list of dependencies was reorganised and their versions were pinned.
- Fixed problems caused by release of Pandas 1.1.0 with multiple regressions by downgrading to Pandas 1.0.5.
- #141 Changes for batch 2020.09. Includes update from JSON schema 1.6.7 to 1.7.1 (only test files and version updates, no actual evidence string format changes necessary) and minor documentation fixes.