reads2graph is a practical method for constructing an edit-distance-based read graph for a short-read sequencing dataset via minimizer and order min hash bucketing and graph traversal.
- If you attempt to install reads2graph from both
bioconda
and source code in the sameconda
environment, the installation may fail or take a long time to resolve dependencies. The bioconda channel currently does not supportgxx_linux-64==13.2.0
, and the following instructions for installing from source codes use thegxx_linux-64==13.2.0
. - reads2graph has only been tested and released on Linux. Support for other platforms will be added in the future.
Note: reads2graph installed from bioconda
currently only support linux-64
Add the necessary channels if you are not currently using them.
conda config --add channels bioconda
conda config --add channels conda-forge
conda create -n reads2graph-env
conda activate reads2graph-env
conda install bioconda::reads2graph
Note: Given that reads2graph depends on the Seqan3, Boost, and Sharg libraries, and requires a modern C++20 supported compiler, it is recommended to use the Conda environment for configuring and compiling this software, rather than relying on the default compiler of the Linux system.
Add the necessary channels if you are not currently using them.
conda config --add channels bioconda
conda config --add channels conda-forge
conda create -n reads2graph-env
conda activate reads2graph-env
conda install cmake==3.27.6 gxx_linux-64==13.2.0 sharg==1.1.1 seqan3==3.3.0 boost==1.82.0 openmp==8.0.1
git clone https://github.com/Jappy0/reads2graph.git
cd reads2graph
mkdir build
cd build
cmake ..
make
Please check the number of available CPU cores that can be used. Specify the exact number of CPU cores you want to use in parallel with the -p your_cpu_cores
option. If you do not set this, reads2graph may not run in parallel correctly. Using only one CPU core will run slowly.
cd ..
cd example
../bin/reads2graph -i ./data/SRR1543965.umi.fasta -o ./ -x 2 -n 1 -p 64 -k 4 --window_number 2 --omh_times 4 --omh_k 4
download the example data SRR1543965.umi.fasta under /example/data of this repo
git clone https://github.com/Jappy0/reads2graph.git
cd reads2graph/example
reads2graph -i ./data/SRR1543965.umi.fasta -o ./ -x 2 -n 1 -p 64 -k 4 --window_number 2 --omh_times 4 --omh_k 4
Note: The date, time, and commands will reflect the current status upon executing reads2graph. The example process is expected to complete in approximately 4 minutes using 64 CPU cores. Some HPC servers can finish the test example in less than one minute with 64 CPU cores.
2024-05-26 10:35:37: INFO: Welcome to use reads2graph!
2024-05-26 10:35:37: INFO: ../../../bin/reads2graph -i ../../../data/12/SRR1543965.umi.fasta -o ./ -x 2 -n 1 -p 64 -k 4 --window_number 2 --omh_times 4 --omh_k 4
2024-05-26 10:35:37: INFO: The number of threads: 64
2024-05-26 10:35:48: INFO: Loading data done!
2024-05-26 10:35:48: INFO: The number of unique reads: 101557, minimum read length: 12.
2024-05-26 10:35:48: INFO: Number of windows: 2, Window size: 6, K size: 4.
2024-05-26 10:37:40: INFO: Pairwise comparison for the small- or medium-size-based buckets done!
2024-05-26 10:37:40: INFO: The number of edges with weight of 1: 95262.
2024-05-26 10:37:40: INFO: The number of edges with weight of 2: 665519.
2024-05-26 10:37:40: INFO: The total number of edges: 760781.
2024-05-26 10:37:45: INFO: Bucketing unique reads using OMH done!
2024-05-26 10:39:58: INFO: Graph update for the small- or medium-size-based buckets generated by OMH bucketing done!
2024-05-26 10:39:58: INFO: The number of edges with weight of 1: 95262.
2024-05-26 10:39:58: INFO: The number of edges with weight of 2: 702067.
2024-05-26 10:39:58: INFO: The total number of edges: 797329.
If you are constructing graphs for short reads (e.g., with read lengths between 50bp and 300bp), you may only need to customise the following paramters following your needs. Using all the other parameters as default, we have optimised this for you.
-x, --max_edit_dis (unsigned 8 bit integer)
The maximum edit distance for constructing edges between reads Default: 2.
-n, --min_edit_dis (unsigned 8 bit integer)
The minimum edit distance for constructing edges between reads. Default: 1.
-p, --num_process (signed 32 bit integer)
The number of expected processes. Default: 26.
If you are constructing UMI graphs, you must input the following parameters by yourself, as some of the defaults may not work for UMIs.
-k, --k_size (unsigned 8 bit integer)
The size for minimiser. Default: 4.
-w, --window_number (unsigned 8 bit integer)
The window number for minimiser. Default: 3.
-x, --max_edit_dis (unsigned 8 bit integer)
The maximum edit distance for constructing edges between reads Default: 2.
-n, --min_edit_dis (unsigned 8 bit integer)
The minimum edit distance for constructing edges between reads. Default: 1.
--omh_k (unsigned 32 bit integer)
K-mer size used in order min hashing. Default: 4.
--omh_times (unsigned 32 bit integer)
The number of times to perform permutation in order min hashing. Default: 3.
-p, --num_process (signed 32 bit integer)
The number of expected processes. Default: 26.
Please use the following commands to view all the options if you need
reads2graph -h
installed from bioconda
or
xx/bin/reads2graph -h
installed from Source codes
2024-06-22 13:57:19: INFO: Welcome to use reads2graph!
2024-06-22 13:57:19: INFO: reads2graph -h
reads2graph - Construction of edit-distance graphs from a set of short reads.
=============================================================================
DESCRIPTION
Construction of edit-distance graphs from large scale sets of short reads through minimizer- and gOMH-bucketing
and graph traversal.
OPTIONS
-i, --input_data (std::filesystem::path)
Please provide a fasta/fastq/ data file. Default: "".
-o, --output_dir (std::filesystem::path)
The directory for outputs. Default: "/data/pping/Repo/reads2graph".
-r, --read_length (unsigned 32 bit integer)
No need to input this parameter, reads2graph will calculate the minimum read length. Default: 0.
-k, --k_size (unsigned 8 bit integer)
The size for minimiser. Default: 4.
-w, --window_number (unsigned 8 bit integer)
The window number for minimiser. Default: 3.
-x, --max_edit_dis (unsigned 8 bit integer)
The maximum edit distance for constructing edges between reads Default: 2.
-n, --min_edit_dis (unsigned 8 bit integer)
The minimum edit distance for constructing edges between reads. Default: 1.
-p, --num_process (signed 32 bit integer)
The number of expected processes. Default: 26.
--pair_wise (bool)
Brute Force calcualte the pairwise edit distance. Default: 0.
--bin_size_max (unsigned 32 bit integer)
The larger threshold used to group buckets of different sizes. Default: 10000.
--omh_k (unsigned 32 bit integer)
K-mer size used in order min hashing. Default: 4.
--omh_times (unsigned 32 bit integer)
The number of times to perform permutation in order min hashing. Default: 3.
--omh_seed (unsigned 64 bit integer)
The seed to generate a series of seeds for OMH bucketing. Default: 2024.
--omh_flag (bool)
Do not set this flag by yourself. When the permutation_times larger than the number of k-mer candidates and
the kmer size are the same one, bucketing the reads using each kmer candidate. Default: 0.
--bad_kmer_ratio (double)
The maximum ratio of bad k-mers out of total number of kmers in a window of a read. Default: 0.3.
--probability (double)
The expected probability P for grouping two similar reads into same bucket by at least one minimiser that
does not include the different bases Default: 0.86.
--visit_depth (unsigned 32 bit integer)
The maximum distance of nodes from the give node for updating more potential edges. Default: 15.
--save_graph (bool)
If ture, reads2graph will save graph to file in graphviz dot format. Default: 0.
VERSION
Last update: 04.06.2024
reads2graph version: 1.0.0
Sharg version: 1.1.1
SeqAn version: 3.3.0
LEGAL
reads2graph Copyright: BSD 3-Clause License
Author: Pengyao Ping
SeqAn Copyright: 2006-2023 Knut Reinert, FU-Berlin; released under the 3-clause BSDL.
For full copyright and/or warranty information see --copyright.
Feel free to contact me if you have any questions or bugs on runnning reads2graph or are interested in reads2graph