d'accord is a non hybrid long read consensus program based on local de Bruijn graph assembly. The package contains the following programs
- daccord: main consensus computation program
- computeintrinsicqv: compute intrinsic quality value track
- lasdetectsimplerepeats: simple repeat detection for repeats completed contained inside a long read
- lasfilteralignments: remove improper alignments (likely repeat induced alignments)
- lasfilteralignmentsborderrepeats: remove repeat induced alignments at read borders
- maftobam: convert MAF files produced by PBsim to BAM
- bamidrename: replace read names in BAM file by unique identifiers
- generateperfectpile: generate perfect alignment pile based on read to reference alignments
- checklas: compare aligner generated data with perfect alignment piles
- checkconsensus: check consensus accurary based on ground truth data
- sortfasta: sort a FastA file by read name
- mapconstoraw: insert corrected fragments into uncorrected reads
- fillfasta: insert missing reads into FastA file
A short list of options is available for each program by calling it with the -h parameter, e.g.
daccord -h
The daccord source code is hosted on github:
[email protected]:gt1/daccord.git
Release packages can be found at
https://github.com/gt1/daccord/releases
Please make sure to choose a package containing the word "release" in it's name if you intend to compile daccord for production (i.e. non development) use.
daccord needs libmaus2 [https://github.com/gt1/libmaus2] . libmaus2 needs to
be built with support for the GMP library. When libmaus2
is installed in
- autoreconf -i -f
- ./configure --with-libmaus2=${LIBMAUSPREFIX} \
--prefix=${HOME}/daccord
- make install
The release packages come with a configure script included (making the autoreconf call unnecessary for source obtained via one of those).
The core algorithms used by daccord are described in the following two papers:
- German Tischler and Eugene W. Myers: Non Hybrid Long Read Consensus Using Local De Bruijn Graph Assembly, preprint on bioRxiv, https://doi.org/10.1101/106252
- German Tischler: Haplotype And Repeat Separation In Long Reads, preprint on bioRxiv, https://doi.org/10.1101/145474
The first paper describes the algorithm used in the daccord program. The second one explains the algorithms used by the programs split_agr and split_dis.
daccord is the main consensus computation program. It requires the name of an LAS file and a Dazzler database as input arguments:
daccord reads.las reads.db
Please see DAZZ_DB (https://github.com/thegenemyers/DAZZ_DB) for creating Dazzler databases and DALIGNER (https://github.com/thegenemyers/DALIGNER) for creating LAS alignment files.
daccord has a set of optional parameters which can be set on the command line:
- -t: sets the number of threads used. By default this is the number of logical CPU cores detected
- -w: window size for building De Bruijn graphs
- -a: advance size for De Bruijn graph windows. Window i is at positions [ia,ia+w)
- -d: maximum depth used. If set to some v then daccord will use only the v longest alignments for each read
- -f: produce full alignments. Set by calling with -f1. If set then daccord will output one record per corrected read. Corrected stretches are printed as upper case letters, uncorrected stretches as lower case letters.
- -V: verbosity. Smaller values make the program more quiet.
- -I: read interval. Given as -Ii,j . This means the program will process the read id interval [i,j).
- -J: batch interval. Given as -Ji,j for 0 <= i < j. The program will process batch id i out of j packages. More precisely, if the input LAS file starts at A-read id a and ends at A-read z, then each batch package has size s=(z-a+j-1)/j and batch package i contains reads [a+i*s,a+(i+1)*s)
- -E: sets the name of the error profile to be used. By default this is the name of the input LAS file with .eprof appended. If this file does not exist then the program will estime the error profile based on the alignments in the LAS file.
- -m: minimum window coverage. This denotes the minimum number of reads which need to cover a window so consensus construction will be attempted.
- -e: maximum window error. If this is set then the windows which have an average error between the reads and the consensus of more than this will not be used for the consensus.
- -l: minimum length of output. Corrected stretches shorter than this length will not be produced. This is ignored if the -f parameter is set.
- -D: maximum number of alignments processed per read (default 5000)
Please note that the parameters need to be provided before any other arguments (i.e. before the LAS file name and the name of the Dazzler DB).
daccord produces a FastA file on standard output. The read names follow the scheme
read-id/unique-id/from_to
An example for this is
2/7/500_23141
This designates that the respective sequence is the corrected version of read id 2 (in the trimmed Dazzler database) from base 500 to 23141. The unique-id field provides a unique identifier for each corrected fragment and has otherwise no specific meaning.
computeintrinsicqv computes intrinsic quality values and requires a numerical depth parameter -d, a Dazzler database and an LAS file to compute intrinsic quality values. An example call is:
computeintrinsicqv -d40 reads.db reads.las
The argument for the depth parameter is the sequencing depth, which is given by the number of sequenced bases divided by the size of the genome sequenced. The reads file can refer to a single block of the database, e.g.
computeintrinsicqv -d40 reads.db reads.1.las
is a valid call for the first block of reads.db . The database however needs to be complete, i.e.
computeintrinsicqv -d40 reads.1 reads.1.las
is not a valid call. When intrinsic quality values have been computed for all blocks of a database, then the quality tracks for the single blocks can be concatenated to the quality track of the complete database using the Catrack tool of DAZZ_DB.
lasdetectsimplerepeats is a program which detects simple repeats (such which are completely contained in a read) by virtue of checking intrinsic quality values. The program takes a Dazzler database and an LAS file as input. An example call is
lasdetectsimplerepeats reads.db reads.las >reads.rep
The database given has to be complete, the LAS file can refer to a single block of the database.
lasdetectsimplerepeats has a set of optional parameters which can be set on the command line:
- -e: error threshold for proper alignment ends. The end of an alignment is considered proper if it either extends to a read end or to a region with an intrinsic quality corresponding to an error of at least this value. By default this value is set to 0.35, i.e 35%.
- -d: depth threshold for repeat detection. A region of a read will be considered as a repeat if at least this many other reads give evidence that the region is a repeat. By default this is set to 10, i.e. 10.
lasdetectsimplerepeats produces it's output on the standard output channel. The output for single blocks can be concatenated to the output for the whole database using the Unix cat program. This requires the outputs of lasdetectsimplerepeats to be concatenated in increasing order of the block id, i.e. run
cat reads.1.rep reads.2.rep ... reads.n.rep >reads.rep
if n is the number of blocks in the database.
lasfilteralignments is a program which removes improper alignments from an LAS file. It requires intrinsic qualities to do so. The program takes a Dazzler database and an LAS file as input. An example call is
lasfilteralignments reads.db reads.las
This will create the output file reads_filtered.las .
lasfilteralignments has a set of optional parameters which can be set on the command line:
- -e: error threshold for proper alignment ends. The end of an alignment is considered proper if it either extends to a read end or to a region with an intrinsic quality corresponding to an error of at least this value. By default this value is set to 0.35, i.e 35%.
lasfilteralignmentsborderrepeats is a program which removes (probably) repeat induced alignments involving a prefix or suffix of a read. The repeats are not detected by the program but need to be provided in the form of an output file as produced by lasdetectsimplerepeats. It expect four input arguments:
- an output file name (output in LAS format)
- a Dazzler database
- a repeats file (as produced by lasdetectsimplerepeats)
- an input LAS file name
An example call is
lasfilteralignmentsborderrepeats out.las in.db in.rep in.las
lasfilteralignmentsborderrepeats has a set of optional parameters which can be set on the command line:
- -e: error threshold for proper alignment ends. The end of an alignment is considered proper if it either extends to a read end or to a region with an intrinsic quality corresponding to an error of at least this value. By default this value is set to 0.35, i.e 35%.
- -t: sets the number of threads used. By default this is the number of logical CPU cores detected
- -I: read interval. Given as -Ii,j . This means the program will process the read id interval [i,j).
- -J: batch interval. Given as -Ji,j for 0 <= i < j. The program will process batch id i out of j packages. More precisely, if the input LAS file starts at A-read id a and ends at A-read z, then each batch package has size s=(z-a+j-1)/j and batch package i contains reads [a+i*s,a+(i+1)*s)
- -T: prefix for temporary files. By default temporary files will be created in the current directory. The program tries to create file names avoiding collisions with other program runs based on the name of the program, the host name it runs on and the process id.
maftobam converts alignments produced in the MAF format by PBsim to the BAM format. It requires one argument designating a FastA file containing the underlying reference. An example call is
maftobam ref.fasta <in.maf >out.bam
An optional second file name can be given to perform online reference id replacements. This is useful as PBsim does not output the name of the original reference sequence.
bamidrename processes a BAM file an replaces all read names after the scheme
L0/id/0_len
where id is the index of the alignment record (i.e. 0 for the first one, 1 for the second, etc.) and len is the length of the query sequence. This allows to identify the index of a record in the output file after the file has later been sorted to a different order.
generateperfectpiles generates alignments contained in perfect alignment piles from a set of read to reference alignments given in a BAM file. The alignments produced are output in DALIGNER's LAS format. The input BAM file is assumed to have geen generated by first running bamidrename on it and subsequently having been sorted by coordinate order (use for instance biobambam2's bamsort utility to obtain this order).
A sample call is
generateperfectpiles out.las in.bam
generateperfectpiles has a set of optional parameters which can be set on the command line:
- -t: sets the number of threads used. By default this is the number of logical CPU cores detected
- -T: prefix for temporary files. By default temporary files will be created in the current directory. The program tries to create file names avoiding collisions with other program runs based on the name of the program, the host name it runs on and the process id.
checklas compares an aligner produced LAS file with an LAS file containing perfect piles only and prints a statstic comparing both for each read. A sample call is
checklas in.db perfectpiles.las in.bam in.las
It requires four arguments
- in.db: a Dazzler database on which perfectpiles.las and in.las are based
- perfectpiles.las: ground truth perfect alignment piles in LAS format
- in.bam: ground truth reads to reference alignments
- in.las: aligner generated data in LAS format
The file in.bam is expected to contain read names as produced by bamidrename. It has to be sorted by read name using biobambam2's bamsort utility.
Optional parameters are:
- -t: sets the number of threads used. By default this is the number of logical CPU cores detected
- -T: prefix for temporary files. By default temporary files will be created in the current directory. The program tries to create file names avoiding collisions with other program runs based on the name of the program, the host name it runs on and the process id.
- --verbose: a higher value leads to more verbosity (default: 0)
- --minsiglen: minimum length of ground truth alignemtns considered (default 1000) The ground truth alignments file may contain arbitrarily short alignments because two reads overlap by a very small amount of bases on the reference. As such short alignments do not designate significant alignment events no real aligner will produce them.
- --tlow: minimum read id considered (by default all reads in the input file are considered)
- --high: maximum read id considered (by default all reads in the input file are considered)
- --mark: produce file in.las.mark.las in which ground truth alignments are marked using flag 2^30, any other alignments in in.las are copied as they are (i.e. no flag is added)
- --mis: produce file in.las.mis.las containing all ground truth read against read alignments which are not contained in in.las
- --keep: produce file in.las.keep.las containing all ground truth read against read alignment which are contained in in.las
- --seqdepth: average sequencing coverage
Statistic lines are printed on the standard output channel. A sample output line is
[R] 1941 missing 0 got 62 extra 0 erate 0.118001 U00096.2:409510,426938 0 1 1 1 180 regular
The columns are
- line id ([R])
- read id (1941)
- string missing
- number of missing ground truth alignments in in.las (0)
- string got
- number of ground truth alignments present in in.las (62)
- string extra
- number of extraneous alignments in in.las which are not true overlaps (0)
- string erate
- error rate of ground truth alignment (0.118001)
- alignment coordinates of ground truth mapping in format refid:from,to (U00096.2:409510,426938)
- minimum true rate of any trace block on read
- average true rate of any trace block on read
- maximum true rate of any trace block on read
- number of blocks average is based on
- line category "regular" or complete miss (regular). A read is marked as regular if in.las contains any alignments for it, no matter whether they are contained in the ground truth set or not. Otherwise it is a complete miss.
The true rate values are computed as follows. For each trace block (usually of size 100) the (at most) seqdepth best (i.e. lowest number of errors) aligning reads are collected. Then the number n_t of true (contained in the ground truth) and n_f of false alignments in this list of seqdepth are counted. Based on this a fraction f = n_t / (n_t+n_f) is computed which designates how many of the (at most) seqdepth minimum error alignments are true. The program then outputs the minimum, average and maximum over all trace blocks for the read. In essence the minimum column should be 1 for any reads not stemming from repetetive regions.
checkconsensus checks consensus accuracy by comparing consensus sequences to ground truth data. The program expects four arguments
- a sub command argument. The available sub commands are index, check, batchlist. These sub commands are explained below.
- ref.fasta: a FastA file containing the reference sequences
- reads.bam: a BAM file containing ground truth read to reference alignments
- reads_cons.fasta: a FastA file containing the consensus sequences
The names of the sequences in ref.fasta must match the sequence lines in the header of reads.bam and the sequences need to appear in the same order in ref.fasta and the header of reads.bam. The read names in reads.bam need to follow the scheme described for bamidrename. reads.bam needs to have been sorted to query name order using biobambam2's bamsort. The read names in reads_cons.fasta need to follow the scheme used in the about of daccord as described above.
Before any other sub commands can be run the program needs to be run with teh index sub command. This computes indices for the various input files which will be used by the other sub command.
If the program is to be run on a single node, then the check sub command can be used subsequently to perform the accuracy comparison. Optional arguments are
- -t: sets the number of threads used. By default this is the number of logical CPU cores detected
- -T: prefix for temporary files. By default temporary files will be created in the current directory. The program tries to create file names avoiding collisions with other program runs based on the name of the program, the host name it runs on and the process id.
For larger read sets the checking can be distributed over the nodes of a compute cluster. To this end the program can be run with the batchlist sub command. This will produce a shell script containing program calls which need to be called to obtain the same result as if the check sub command would have been called. An example call is
checkconsensus -p4 batchlist ref.fasta reads.bam reads_cons.fasta
This will produce a shell script similar to
checkconsensus <opts> batchprocess ref.fasta reads.bam reads_cons.fasta reads_cons.fasta_check_000000
checkconsensus <opts> batchprocess ref.fasta reads.bam reads_cons.fasta reads_cons.fasta_check_000001
checkconsensus <opts> batchprocess ref.fasta reads.bam reads_cons.fasta reads_cons.fasta_check_000002
checkconsensus <opts> batchprocess ref.fasta reads.bam reads_cons.fasta reads_cons.fasta_check_000003
checkconsensus <opts> batchmerge ref.fasta reads.bam reads_cons.fasta
checkconsensus <opts> cleanup ref.fasta reads.bam reads_cons.fasta
using the sub commands batchprocess, batchmerge and cleanup (which should not be called directly). All the lines for batchprocess can be processed in parallel. After the batchprocess lines have been processed the batchmerge sub command will merge the resulting partial data and produce the output of the accuracy check. The cleanup sub command removes partial files produced by the batchprocess commands.
The batchlist sub command has the following optional parameters:
- -t: sets the number of threads used by each sub command in the shell script produced. By default this is the number of logical CPU cores detected
- -T: prefix for temporary files. By default temporary files will be created in the current directory. The program tries to create file names avoiding collisions with other program runs based on the name of the program, the host name it runs on and the process id.
- -p: number of work packages produced, i.e. number of batchprocess calls in the resulting script.
The output of checkconsensus contains several types of lines:
- single read lines
- coverage lines
- global statistics line(s)
- EP lines
- EM lines
An example for a single read line is
6 AlignmentStatistics(matches=14732,mismatches=0,insertions=2,deletions=0,editdistance=2,erate=0.0001357405) AlignmentStatistics(matches=112436,mismatches=0,insertions=10,deletions=0,editdistance=10,erate=0.0000889316) [0,14731] 0:97757,112489
The columns are
- read id (6)
- accuracy of read fragment in comparison with ground truth (AlignmentStatistics(matches=14732,mismatches=0,insertions=2,deletions=0,editdistance=2,erate=0.0001357405))
- accumulated average accuracy for this read and reads with smaller read id
- interval on read corrected ([0,14731])
- reference region corrected fragment was aligned to for comparison in the format refid:from-to (0:97757,112489)
An example for a coverage line is
[C] 16 13645 13642 0.99978
The columns are
- line identifier ([C])
- read id (16)
- length of ground truth in reference bases (13645)
- number of bases covered by reconstruction fragments
- column four / column three (i.e. fraction of ground truth bases covered by reconstruction fragments)
An example of a global statistics line is
[G] 29460533 29447890 0.999571 AlignmentStatistics(matches=29447461,mismatches=195,insertions=2199,deletions=234,editdistance=2628,erate=0.0000892357)
The columns are
- line identifier ([G])
- reference base sum (29460533, sum over third columns of all coverage lines)
- covered base sum (29447890, sum over fourth columns of all coverage lines)
- column three / column two (0.999571, i.e. fraction of ground truth bases covered by reconstruction fragments)
- accumulated error statistics over all single read lines (AlignmentStatistics(matches=29447461,mismatches=195,insertions=2199,deletions=234,editdistance=2628,erate=0.0000892357))
An example of an EM line is
[EM] 0.0004741958 18 0.0091047
The columns are
- line identifier ([EM])
- error rate (0.0004741958)
- number of reads having this error rate compared to ground truth or higher (18)
- fraction of reads having this error rate compare to ground truth or higher (0.0091047)
An example of an EP line is
[EP] 0.0000485201 669 0.338392
The columns are
- line identifier ([EP])
- error rate (0.0000485201)
- number of reads having this error rate compared to ground truth or lower (669)
- fraction of reads having this error rate compare to ground truth or lower (0.338392)
sortfasta reads a FastA file from standard input, sorts it by read name and outputs the sorted data on standard output. Read names are compared in the same way as biobambam2's bamsort does for query name sorting.
mapconstoraw inserts corrected fragments into uncorrected reads by mapping the fragments onto the reads. It requires two arguments, both in FastA format:
- the uncorrected reads
- the corected read fragments
Both of these input files are required to have been sorted using the sortfasta program.
The read names in the uncorrected reads file need to follow the scheme
read_{id}
where {id} is a numerical id. A valid example would be
read_5
The read names in the corrected read fragments file need to follow the scheme
read_{id}_{subid}
where {id} and {subid} are numerical values. A valid example would be
read_5_0
which would designate the first fragment (sub id 0) of read 5. The sub ids for each read need to be consecutive and start from 0.
mapconstoraw outputs a FastA file with the mapped corrected fragments inserted into the uncorrected reads. Corrected fragments are in upper case, uncorrected regions in lower case.
mapconstoraw has the following optional parameters:
- -t: sets the number of threads used. By default this is the number of logical CPU cores detected
- -T: prefix for temporary files. By default temporary files will be created in the current directory. The program tries to create file names avoiding collisions with other program runs based on the name of the program, the host name it runs on and the process id.
fillfasta processes two sorted FastA files and computes their union such that the second file takes precedence (i.e. if a read name is present in both files, then the data from the second file will be kept and the one from the first one discarded). The read names contained in the second file need to be a subset of the read names found in the first file. The read names in both files need to follow the scheme
read_{id}
where {id} is a numerical id. A valid example would be
read_5
computeextrinsicqv produces a Dazzler database track providing the edit distance between a source read and the consensus produced for that read for each block of size tspace on the source read. It has two mandatory arguments. The mandatory arguments are a consensus file (as produced by daccord) in FastA format and the name of a Dazzler database (.db/.dam) file. A valid example would be
computeextrinsicqv data_fcgr/reads_cons.fasta data_fcgr/reads.dam
split_agr performs agreement based read pile splitting for haplotypes and repeats. It expects four arguments
- out.las: the name of the output file, which will be written in the LAS file format
- cons.fasta: a read consensus file for the reads in the input database in FastA format as produced by daccord
- in.las: alignments for in.db as generated by DALIGNER
- in.db: input read database
In addition the -d (average sequencing depth) parameter is mandatory.
A sample call is
split_agr -d20 out.las cons.fasta in.las in.db
The program requires an extrinsic quality value track for the input database. This track can be computed using the computeextrinsicqv program.
The following options can be used (no space between option name and parameter allowed):
- -t: sets the number of threads used. By default this is the number of logical CPU cores detected
- -d: sequencing depth (no default)
- -D: maximum number of alignments processed per read (default 5000)
- -I: read interval. Given as -Ii,j . This means the program will process the read id interval [i,j).
- -J: batch interval. Given as -Ji,j for 0 <= i < j. The program will process batch id i out of j packages. More precisely, if the input LAS file starts at A-read id a and ends at A-read z, then each batch package has size s=(z-a+j-1)/j and batch package i contains reads [a+i*s,a+(i+1)*s)
- -E: sets the name of the error profile to be used. By default this is the name of the input LAS file with .eprof appended.
- -T: prefix for temporary files. By default temporary files will be created in the current directory. The program tries to create file names avoiding collisions with other program runs based on the name of the program, the host name it runs on and the process id.
split_dis performs disagreement based read pile splitting for haplotypes and repeats. It expects four arguments
- out.las: the name of the output file, which will be written in the LAS file format
- cons.fasta: a read consensus file for the reads in the input database in FastA format as produced by daccord
- in.las: alignments for in.db as generated by DALIGNER
- in.db: input read database
In addition the -d (average sequencing depth) parameter is mandatory.
A sample call is
split_dis -d20 out.las cons.fasta in.las in.db
The following options can be used (no space between option name and parameter allowed):
- -t: sets the number of threads used. By default this is the number of logical CPU cores detected
- -d: sequencing depth (no default)
- -D: maximum number of alignments processed per read (default 5000)
- -p: phase threshold. Overlapping regions shorter than this will be disregarded (default 4000)
- -I: read interval. Given as -Ii,j . This means the program will process the read id interval [i,j).
- -J: batch interval. Given as -Ji,j for 0 <= i < j. The program will process batch id i out of j packages. More precisely, if the input LAS file starts at A-read id a and ends at A-read z, then each batch package has size s=(z-a+j-1)/j and batch package i contains reads [a+i*s,a+(i+1)*s)
- -E: sets the name of the error profile to be used. By default this is the name of the input LAS file with .eprof appended.
- -T: prefix for temporary files. By default temporary files will be created in the current directory. The program tries to create file names avoiding collisions with other program runs based on the name of the program, the host name it runs on and the process id.
- --drate: difference rate (default 0.01)
- --dcthres: splitting tuple count threshold (default computed using d parameter)
filterchain processes an input LAS file and filters it using a consensus fragment list as produced by daccord. For each pair of reads the program chains up alignments using a greedy algorithm. Chains which overlap with the respective consensus sequences by less than a given number of bases (-l parameter) are removed from the output.
It expects three arguments
- out.las: the name of the output file, which will be written in the LAS file format
- cons.fasta: a read consensus file for the reads in the input database in FastA format as produced by daccord
- in.las: alignments in LAS file format as generated by DALIGNER
The following options can be used (no space between option name and parameter allowed):
- -l: minimum chain length (default 4000)
A sample call is
filterchains -l5000 out.las cons.fasta in.las
wgsimtobam converts simulated reads produced by wgsim to a BAM file taking the positions into account and computing alignments between the reads and the designated reference region for each read.
It expects two arguments
- ref.fasta: the name of the FastA file used as a reference
- reads.fasta: FastA file obtained by converting a FastQ file produced by wgsim to FastA
A sample call is
wgsimtobam ref.fasta reads.fasta >reads.bam
The program treats the input file as single ended reads.