Exon boundary / intron validation - need for genome build specific validation? #700
Labels
data provider schema change
enhancement
New feature or request
keep alive
exempt issue from staleness checks
Biocommons HGVS currently performs no validation on whether an intronic coordinate is valid or inside an intron or not.
The trouble is - to perform validation - you need to know information about strandedness, which HGVS does not have access to until it knows the genome build. - this means you can't do the validation in the obvious place
ExtrinsicValidator
probably on the "var_n" (sequence variant of type "n")For instance you could provide a wrong exon boundary. The HGVS Spec on numbering says:
If offset is positive, exon boundary should be in stranded ends
If offset is negative, exon boundary should be in stranded starts
Example 1 (no error w/Biocommons HGVS) - correct boundary is 228, I provide the wrong exon boundary:
ClinGen gives the same error message if the exon boundary is wrong, even if you would be inside the intron (eg
NM_152587.3:c.227+5A>T
)Example 2 (no error w/Biocommons HGVS) - I reverse the offset (from "-" to "+") leaving boundary as is
VariantValidator is looking at the correct exon boundary for the strandedness (ie starts or ends) so even if you use a valid exon boundary (just with signs reversed) it gives the same error
Notes on validation implementation
Key issue: To know the correct exon starts/exon ends - you need to know the transcript's strandedness
It's often easier to work with sequence variants of type "n" as their boundaries correspond to transcript exon start/ends, eg:
But how to map upstream/downstream to exon starts/ends? You need to know strand - and this is NOT provided in any data providers methods that don't take genome build / contig
Valid exon boundaries that map outside the transcript
To work either of these out, you need to know how big the introns are - which you can only get via data provider methods that take a contig/genome build
Offsets of 0 are prohibited
This is a low priority issue, and probably doesn't hurt much to leave it.
But technically,
NM_152587.3:c.228-0=
is invalid.Variant validator doesn't throw an error, but ClinGen allele registry throws "HgvsParsingError - Cannot parse definition of mutation"
The text was updated successfully, but these errors were encountered: