You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are several flavors of GRCh38. All are coordinate compatible but have distinct sequences. "Official" GRCh38 sequences are uppercase and contain ambiguity characters. Ensembl replaces ambiguity characters with N. hg38 from UCSC represents repeat regions with lower case.
SeqRepo needs a way to preserve the original sequence verbatim, but also to support commonly used transformations, and to make this choice apparent to users.
Background
The GRC defines official genomic references, which includes the assembly name, member accessions, nucleotide sequences, alternate assemblies, etc. For an example, see GCF_000001405.26 assembly report.
According to GRCh38, the sequences referred to by GRCh38:1 and refseq:NC_000001.11 is a (masked) sequence w/ambiguity characters. It is unacceptable to hijack these identifiers to mean another sequence. However, these sequences are very usable as-is because no one expects lower case in the genomic sequence, for example. (Embedding annotations like masking into sequences is a mistake.)
Because the GRC sequences are inconvenient to use as-is, UCSC and Ensembl transform the sequences to be more useful. The transformations preserve coordinates, but change the sequence by upper-casing. Thus, we have two versions of each sequence for a given assembly.
While supporting case-squashing and disambiguating sequences, it should also be possible to support reverse complement and circular sequences and coordinates.
The text was updated successfully, but these errors were encountered:
Problem Summary
There are several flavors of GRCh38. All are coordinate compatible but have distinct sequences. "Official" GRCh38 sequences are uppercase and contain ambiguity characters. Ensembl replaces ambiguity characters with N. hg38 from UCSC represents repeat regions with lower case.
SeqRepo needs a way to preserve the original sequence verbatim, but also to support commonly used transformations, and to make this choice apparent to users.
Background
The GRC defines official genomic references, which includes the assembly name, member accessions, nucleotide sequences, alternate assemblies, etc. For an example, see GCF_000001405.26 assembly report.
According to GRCh38, the sequences referred to by
GRCh38:1
andrefseq:NC_000001.11
is a (masked) sequence w/ambiguity characters. It is unacceptable to hijack these identifiers to mean another sequence. However, these sequences are very usable as-is because no one expects lower case in the genomic sequence, for example. (Embedding annotations like masking into sequences is a mistake.)Because the GRC sequences are inconvenient to use as-is, UCSC and Ensembl transform the sequences to be more useful. The transformations preserve coordinates, but change the sequence by upper-casing. Thus, we have two versions of each sequence for a given assembly.
While supporting case-squashing and disambiguating sequences, it should also be possible to support reverse complement and circular sequences and coordinates.
The text was updated successfully, but these errors were encountered: