-
Notifications
You must be signed in to change notification settings - Fork 194
Programming with the vg API
You can use the graph data structures used by vg
in your own programs. Different data structures, useful for different purposes, are packaged in different vg-ecosystem libraries.
The most useful vg-ecosystem library is libbdsg, which provides the basic bi-directed sequence graph implementations that vg
uses to store and manipulate graphs.
To use this library in a C++ project, you can add it as a Git submodule, and use it from your project's CMake-based build system with add_subdirectory
:
mkdir myproject
cd myproject
git init
mkdir deps
cd deps
git submodule add https://github.com/vgteam/libbdsg.git
cd libbdsg
git checkout 06ab14bbc7fcfd40745885cd85c301e4bf1cf0cd
git submodule update --init --recursive
cd ../..
cat >CMakeLists.txt <<'EOF'
cmake_minimum_required(VERSION 3.10)
project(myproject VERSION 0.0.1)
set(CMAKE_CXX_STANDARD 14)
# We will not need libbdsg's Python bindings
set(BUILD_PYTHON_BINDINGS OFF CACHE INTERNAL "" FORCE)
# We also will not need its Doxygen docs
set(RUN_DOXYGEN OFF CACHE INTERNAL "" FORCE)
add_subdirectory("${CMAKE_CURRENT_SOURCE_DIR}/deps/libbdsg")
EOF
Then you can build the project:
cd myproject
mkdir build
cd build
cmake ..
make -j $(nproc)
To actually use the library, you will need to define a CMake target for your code, and connect it to the libbdsg
library target:
cd myproject
mkdir src
cat >src/main.cpp <<'EOF'
int main(int argc, char** argv) {
return 0;
}
EOF
cat >>CMakeLists.txt <<'EOF'
add_executable(myproject ${CMAKE_CURRENT_SOURCE_DIR}/src/main.cpp)
target_link_libraries(myproject libbdsg)
EOF
(cd build && cmake ..)
Let's access node and edge data from an example graph.
First, we need to get a graph in the correct format. Since libbdsg can't yet read GFA format, we can use vg
to convert to the HashGraph
format, which it can understand.
cd myproject
wget -q https://raw.githubusercontent.com/vgteam/vg/ffd491375f7e6d97c63598e35fefc653c9aea286/test/graphs/cactus-BRCA2.gfa
vg convert --hash-out cactus-BRCA2.gfa >cactus-BRCA2.vg
Then we can check to make sure we have the right format:
cd myproject
vg stats --format cactus-BRCA2.vg
Since we want to load the graph with a bdsg::HashGraph
, make sure this says:
format: HashGraph
If you have a graph that isn't a HashGraph, you would need to either use a different libbdsg class (bdsg::PackedGraph
or bdsg::ODGI
), or else convert to a supported libbdsg format with vg
, like we do above.
Once you have a HashGraph, you can load it like this:
cd myproject
cat >src/main.cpp <<'EOF'
#include <bdsg/hash_graph.hpp>
#include <iostream>
int main(int argc, char** argv) {
// Make an empty graph
bdsg::HashGraph graph;
// Populate it by loading the graph from the given filename
graph.deserialize("cactus-BRCA2.vg");
// Report the number of nodes in the graph
std::cout << "Graph has " << graph.get_node_count() << " nodes." << std::endl;
return 0;
}
EOF
# Build the program
(cd build && make -j $(nproc))
When you run the program:
cd myproject
./build/myproject
You should get the right node count:
Graph has 1134 nodes.
Coming soon! See also the Python tutorial.
A bdsg::HashGraph
implements handlegraph::MutablePathDeletableHandleGraph
, a full-featured interface for a graph that stores embedded paths, and supports modification and deletion of graph elements.
References for vg often include haplotype information, which tracks how individual genomes are embedded in the graph, and what paths they take. This often comes in the form of a GBZ file (see File Types), which includes both the base graph nodes and edges, and the haplotype information, in a single file. To work with real-world data, you might need to be able to load and access a GBZ file.
First, we need to get a GBZ file. The vg
binary knows how to make one.
vg gbwt -G cactus-BRCA2.gfa --gbz-format -g cactus-BRCA2.gbz
The GBZ file format is implemented in the GBWTGraph library. So, the first step is to add that library and its dependencies to our project.
cd myproject
cd deps
# Install GBWT library
git submodule add https://github.com/jltsiren/gbwt.git
cd gbwt
git checkout 180e041a029af05f1d74e57058e3c16974c85362
cd ..
# Install gbwtgraph library
git submodule add https://github.com/jltsiren/gbwtgraph.git
cd gbwtgraph
git checkout 3cb45d5f0c9950d2db4493b2445acb60c8f24199
cd ..
cd ..
# Attach to CMake
cat >>CMakeLists.txt <<'EOF'
# SDSL library ("sdsl" target) already exists from libbdsg
# GBWT and GBWTGraph don't have CMake yet so we need to explain how to build them.
# GBWT
foreach (gbwt_source IN ITEMS algorithms.cpp bwtmerge.cpp cached_gbwt.cpp dynamic_gbwt.cpp fast_locate.cpp files.cpp gbwt.cpp internal.cpp metadata.cpp support.cpp test.cpp utils.cpp variants.cpp)
list(APPEND gbwt_libFiles "deps/gbwt/src/${gbwt_source}")
endforeach()
if(MSVC)
file(GLOB gbwt_headerFiles RELATIVE ${CMAKE_CURRENT_SOURCE_DIR} "${CMAKE_CURRENT_SOURCE_DIR}/deps/gbwt/include/gbwt/*.h")
endif()
set( gbwt_SRCS ${gbwt_libFiles} ${gbwt_headerFiles})
add_library( gbwt ${gbwt_SRCS} )
target_include_directories( gbwt PUBLIC "${CMAKE_CURRENT_SOURCE_DIR}/deps/gbwt/include")
find_package( OpenMP REQUIRED )
target_link_libraries( gbwt PUBLIC sdsl OpenMP::OpenMP_CXX )
install(TARGETS gbwt
LIBRARY DESTINATION lib
ARCHIVE DESTINATION lib
PUBLIC_HEADER DESTINATION include)
# GBWTGraph
foreach (gbwtgraph_source IN ITEMS algorithms.cpp cached_gbwtgraph.cpp gbwtgraph.cpp gbz.cpp gfa.cpp internal.cpp minimizer.cpp path_cover.cpp utils.cpp)
list(APPEND gbwtgraph_libFiles "deps/gbwtgraph/src/${gbwtgraph_source}")
endforeach()
if(MSVC)
file(GLOB gbwtgraph_headerFiles RELATIVE ${CMAKE_CURRENT_SOURCE_DIR} "${CMAKE_CURRENT_SOURCE_DIR}/deps/gbwtgraph/include/gbwt/*.h")
endif()
set( gbwtgraph_SRCS ${gbwtgraph_libFiles} ${gbwtgraph_headerFiles})
add_library( gbwtgraph ${gbwtgraph_SRCS} )
target_include_directories( gbwtgraph PUBLIC "${CMAKE_CURRENT_SOURCE_DIR}/deps/gbwtgraph/include")
target_link_libraries( gbwtgraph PUBLIC gbwt handlegraph_shared )
install(TARGETS gbwtgraph
LIBRARY DESTINATION lib
ARCHIVE DESTINATION lib
PUBLIC_HEADER DESTINATION include)
# Link gbwtgraph into our executable.
# Other libraries will come along
target_link_libraries(myproject gbwtgraph)
EOF
# Configure the build
(cd build && cmake ..)
Once that's done, we can load a GBZ file and interact with it using the normal handlegraph::HandleGraph
interface.
cd myproject
cat >src/main.cpp <<'EOF'
#include <gbwtgraph/gbz.h>
#include <iostream>
#include <fstream>
int main(int argc, char** argv) {
// Make an empty GBZ object, which handles deserialization.
gbwtgraph::GBZ gbz;
// Open a stream to read the GBZ from
std::ifstream gbz_file("cactus-BRCA2.gbz");
// Load from the stream
gbz.simple_sds_load(gbz_file);
// The GBZ object now owns a HandleGraph subclass (GBWTGraph) that we can use
gbwtgraph::GBWTGraph& graph = gbz.graph;
// Report the number of nodes in the graph
std::cout << "Graph has " << graph.get_node_count() << " nodes." << std::endl;
return 0;
}
EOF
# Build the program
(cd build && make -j $(nproc))
A GBZ file's graph only contains the nodes and edges that are visited by paths in the graph, so the node and edge counts may be different than those in the original graph used to generate the GBZ.
When you run the program:
cd myproject
./build/myproject
You should get the right node count:
Graph has 1134 nodes.
To recognize a GBZ file, you can use its leading magic number tag, 0x205A4247
little-endian, or GBZ
.
In addition to directly working with the graph, you can use predefined algorithms from libhandlegraph, a dependency that libbdsg brings along. You can find a list of available algorithms in the Doxygen documentation.
Coming soon!
If you are done with the tutorial and don't need your project anymore, you can get rid of it:
rm -Rf myproject