Skip to content

AAV Atlas Database Schema

Robert J. Gifford edited this page Dec 29, 2024 · 4 revisions

Schema Documentation for AAV-Atlas

The AAV-Atlas schema defines the data structure for managing adeno-associated virus (AAV) sequence information and associated analyses within the GLUE framework. This schema extends the core GLUE schema to include tables and fields tailored to AAV-specific research, supporting rich metadata, custom analyses, and integration of sequence data. Below is a detailed overview of the schema structure.


Schema Components

Custom Tables

The schema introduces two custom tables:

  1. aav_replacement: Stores information about amino acid replacements in AAV sequences.

    • Fields:
      • display_name: A textual identifier for the replacement.
      • codon_label: The codon where the replacement occurs.
      • reference_aa: The reference amino acid.
      • reference_nt: The reference nucleotide position.
      • replacement_aa: The replaced amino acid.
      • num_seqs: The number of sequences exhibiting this replacement.
      • grantham_distance_int: Grantham distance as an integer.
      • miyata_distance: Miyata distance for amino acid replacement.
      • parent_feature: The parent genomic feature affected by the replacement.
      • parent_codon_label: The label of the parent codon affected.
  2. aav_replacement_sequence: Maps sequences to replacements, supporting a many-to-one relationship with aav_replacement.

    • Configured with an extended ID field length (200 characters).

Extensions to the Core Tables

sequence Table

The core sequence table is extended with fields capturing metadata and sequence properties:

  • GenBank Fields:
    • gb_create_date (DATE): The sequence's creation date in GenBank.
    • gb_update_date (DATE): The last update date in GenBank.
  • Taxonomy and Clade Information:
    • full_name (VARCHAR 100): Full taxonomic name.
    • association (VARCHAR 100): Associated group or property.
    • species (VARCHAR 50): Virus species name.
    • lineage (VARCHAR 50): Taxonomic lineage.
    • clade (VARCHAR 50): Clade designation.
  • Source and Host Information:
    • isolate (VARCHAR 100): Isolate identifier.
    • isolation_source (VARCHAR 200): Source of the isolation.
    • sampled_host_sci_name (VARCHAR 150): Scientific name of the sampled host.
    • country (VARCHAR 100): Country where the sample was collected.
    • place_sampled (VARCHAR 100): Specific location of sampling.
  • Collection Details:
    • molecule_type (VARCHAR): Type of molecule (e.g., DNA or RNA).
    • collection_year (INTEGER): Year of sample collection.
    • collection_month (VARCHAR): Month of sample collection.
    • collection_month_day (INTEGER): Day of sample collection.
  • Additional Metadata:
    • length (INTEGER): Length of the sequence.
    • pubmed_id (INTEGER): Associated PubMed ID for reference.
    • variation_present (BOOLEAN): Indicates the presence of variations.

alignment Table

The alignment table is extended to include:

  • clade_category (VARCHAR 20): Categorizes clades.
  • phylogeny (CLOB): Stores phylogenetic data in a large object.

member_floc_note Table

This table is extended with:

  • ref_nt_coverage_pct (DOUBLE): Percentage coverage of the reference nucleotide.

Schema Extensions for Replacements

The schema supports relationships and linkages for managing replacements:

  • Links:
    • variationaav_replacement (ONE_TO_ONE): Links variations to replacements.
    • aav_replacementaav_replacement_sequence (ONE_TO_MANY): Links replacements to their associated sequences.
    • sequenceaav_replacement_sequence (ONE_TO_MANY): Links sequences to associated replacements.

Summary of Key Features

  • Custom Tables for AAV-specific analyses, focusing on amino acid replacements.
  • Extended Metadata fields in the sequence table for detailed annotation of sequences.
  • Alignment Table Enhancements for clade categorization and phylogenetic storage.
  • Replacement Schema supporting amino acid replacement tracking and analysis with comprehensive metadata.
  • Relational Links between sequences, replacements, and variations to enable integrated queries and analyses.

This schema supports robust and reproducible research workflows in AAV genomics, enabling deep insights into sequence variation, clade-specific patterns, and evolutionary relationships.