Skip to content

Navigation Menu

Explore
By company size
By use case
By industry
View all solutions
Topics
- AI
- DevOps
- Security
- Software Development
- View all
Explore
- GitHub Sponsors
  Fund open source developers
- The ReadME Project
  GitHub community articles
Repositories
- Enterprise platform
  AI-powered developer platform
Available add-ons
Pricing

Search code, repositories, users, issues, pull requests...

Search

Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Saved searches

Use saved searches to filter your results more quickly

Name

Query

To see all available qualifiers, see our documentation.

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

vgteam / vg Public

Notifications You must be signed in to change notification settings
Fork 194
Star 1.1k

Code
Issues 825
Pull requests 8
Actions
Projects
Wiki
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Actions
Projects
Wiki
Security
Insights

Roadmap

Jump to bottom

Adam Novak edited this page Jul 25, 2022 · 71 revisions

This document sets out the high-level tasks which the vg development team hopes to accomplish in the next few versions of vg and beyond.

By Time

These are the things we hope to achieve on several planning horizons:

End of Summer (9/22)

Releases for vg libraries (libhandlegraph, libbdsg, libvgio)
- Enhanced release acceptance testing so a release means production quality, like we would do for a paper. Proven to be able to run some number of genomes. (Tag paper-validated hashes as release?)
Actual tutorials and examples for using libvgio/libbdsg for DV use cases (see pysam's progressively increasing examples)
See if snarl normalization improves mapping/calling (Robin)
Giraffe chain seed fill-in for long reads (Adam)
Good understanding of/models for how many/which minimizers to use for long reads in Giraffe, for nanopore and PacBio (Stephen)
Stop speeding up DI2 (Xian)
Delete at least one index each from vg index #3144 (Adam, Jordan)

End of Year (12/22)

Better developer documentation (model: pysam docs) for vg library,
- Links from libvgio Doxygen section to Protobuf-derived doc pages
Stability guarantees for vg libraries (like htslib, 1-2 releases per year requiring a code change)
Support DV use cases in vg libraries (libhandlegraph/libvgio)
- Query the graph (for nodes/edges)
- GRCh38 location -> retrieve aligned reads near there
- Get read attributes and know what they mean
User-facing, under-test Giraffe docs for HPRC graphs (Jordan)
Delete old snarl manager (except as a backward-compatible load?) and use snarl-only DI2 everywhere
Figure out if we need a libsnarls actually (Adam, Xian, Jordan)
No cruft in vg index #3144 (Adam, Jordan)
- Just front-end the autoindex system with a sensible CLI for single indexes?

Next Summer (6/23)

Libraries free of cryptic error messages when the user or inputs are wrong
Giraffe actually competitive on long reads

To Assign

Previous Roadmap

Next 3 Months (3/22)

Idiomatic Giraffe on long-node rGFA
- Input and Output "stable" coordinates in GAF
- Output rGFA-style cover of vg graph, using prioritization of existing paths as option, inventing paths if necessary
- Node auto-chopping from rGFA is the last missing piece of #3126
- Chopping overlay (Jordan)
- Saving, loading, and using coordinate translations to/from node-coalesced, string-ID'd input GFA space
  - HG interface and implementation in libbdsg graphs?
- Back-translate mappings
- Implicit node chopping on GFA input
RNA project
- dbGAP access
- Better splice junction identification/graph distance (Jordan)
HPRC tooling
- Let surject generate supplementary alignments for e.g. mappings over inversions
  - Eliminate intermediate Alignment as surject output, go graph Alignment -> BAM record(s)

Also:

Eliminate vg::VG (Jordan)
- Steal all the things only it can do away from it
Default everything to GAF instead of GAM
- mpGAF (Jordan, Jonas)
- Also pgvf (Graph to graph)
- Calls and snarls in one of these?
Get rid of GFAKluge!
Full subpath support in vg (Adam, Jordan)
- HG API support (see old Github issue on handlegraph)
- Plugging in to tools
Drop pinchesAndCacti and sonlib
- Drop Cactus-library-based snarl finder (Adam)
Python bindings for libhandlegraph algorithms
- Are they the right algorithms?

Next Year (9/22)

Pipeline plan for long read Giraffe (Stephen)
- Chain minimizers (Xian, Adam)
- Generalize DP
- Replace gapless extension with gapped WFA
Giraffe on really complex graphs: Non-minimum distance index? Linear layout distance index? Inject from GBWT path set mappings?
Use memory-mapped graphs
- For tube map, to enable interactive whole-genome use (Future data vis enthusiast)
- For Giraffe
Get GBWT build working in under 200 GB memory on 100m variants with fancy disk-backed in-progress GBWT implementation (need 300m random access vectors that grow independently)
Support Erik's multi-level graph format when mature
Redesign and reorganize little tools (Where should each manipulation live? Should some just be scripts you write?)
- vg mod
- vg chunk
- vg circularize
- vg view
- vg paths

Running Projects

These are things we are working on, with no particular delivery date goal.

Use of MCMC techniques in the genotyper with multipath alignments

Wishlist

These are things we would like to do eventually.

vg deconstruct and Beagle to impute genotypes into partially-mapped and called data, as a PanGenie alternative (Erik, Andrea)
Alignment
- Adoption of the multipath alignment paradigm as the default
- Graph-to-graph mapping (Xian)
Variant Calling
- Implementation of an HHGA-like machine learning based variant caller
- Integration of variant calling and assembly polishing processes
- Prune the zoo of TraversalFinders, and expose the useful ones to Python
Visualization
- Browser-free tube map
- Better tube map handling of edge cases
  - No haplotypes on a node
  - Starting on a rare haplotype
Infrastructure
- Destructively modernize and unify IO
  - Eliminate VPKG framing if possible in favor of magic numbers everywhere
    - Resolve ensuing questions about GAM format
      - Just use GAF?
    - Handle things like GFA that need to manually sniff
  - Just save from the object; no more save_handle_graph
  - Magic format registration for libvgio magic numbers for loading
  - Depend on libvgio in libbdsg to do the IO there and pick the right handle graph implementation
- Replace Protobuf internal formats with faster ones
- Revision of ID assignment logic to allow deterministic node breaking
- Accept gzipped GFA if practical (can't mmap)
- Improved HandleGraph API
  - Abstract away node boundaries
  - View all sequence as C++17 string_views instead of sequence-owning strings
  - O(1) reverse complement DNAStringView
- CMake-ify the main vg build
- Eliminate old systems and their associated submodules, or factor them out into their own projects
  - vg vectorize could be its own project
    - Update vg vectorize to modern, system Vowpal Wabbit
    - Or pull it out into its own submodule and remove Vowpal Wabbit dependency from vg
  - Eliminate RocksDB from vg; everybody using vg map uses GCSA indexes now.
  - vg genotype
  - vg srpe
- More cross-language support
  - Interoperate with Rust handle graph users/providers
  - Interoperate with Java handle graph users/providers

Toggle table of contents Pages 51

Home
Annotating a VG graph the RDF way
Automatic indexing for read mapping and downstream inference
Basic Operations
Building a Graph Positional Burrows Wheeler Transform (gPBWT)
building vg (or not building vg)
Building VG on Cent OS 6.6 or 7 and using it
Changing References
Considerations for Batch Processing
Construction
Construction Examples
Data model
Draft Changelog
Evaluating alignment performance using simulation
Example: Serving out an RDF version of a VG graph with Apache Fuseki
Example: Serving out an RDF version of a VG graph with Blazegraph
Extracting a FASTA from a Graph
File Formats
File Types
GBWTGraph
Giraffe best practices
Haplotype Sampling
Importing Cactus Alignments into vg
Index Construction
Index Types
Indexing Huge Datasets
Long read assemblies using vg msga
Mapping short reads with Giraffe
Multipath alignments and vg mpmap
Openstack Gitlab Runner Setup
Path Metadata Model
Programming with the vg API
Quickstart
RDF: for VG
Releases
Roadmap
Simulating reads with vg sim
Stable Deterministic Node identifiers by encoding edit history in Node name identity
SV Genotyping and variant calling
SV genotyping with vg
Testable Documentation
Transcriptomic analyses
Troubleshooting
Variant Representation
VCF export with vg deconstruct
VG GBWT Subcommand
VG RDF for Summarization graphs
VG RDF, the Ensembl bacteria E. coli genome hack attack
VG RDF: proposal for representation of variation on VG RDF
Visualization
Working with a whole genome variation graph

Build VG (or use it in Docker)

Clone this wiki locally

Footer

© 2024 GitHub, Inc.

Footer navigation

Terms
Privacy
Security
Status
Docs
Contact

You can’t perform that action at this time.