GitHub - KPU-AGC/ddRAD-seq: A repository for performing double-digest Restriction Associated DNA sequencing (ddRAD-seq) for reduced representation sequencing.

ddRAD-seq

A repository for performing double-digest Restriction Associated DNA sequencing (ddRAD-seq) for reduced representation sequencing.

Background

Double digest restriction-site associated DNA sequencing (ddRAD-seq) involves the use of two restriction enzymes to digest genomic DNA and create fragments of different sizes. The fragments are then size-selected to include only those in a specific length range. This selective process ensures that only a subset of the genome is sequences which reduces complexity and cost. Adapters are ligated to the size-selected fragments such that there is a different adapter for the overhang produced by the respective restriction enzyme. These are then amplified by PCR and sequenced using next-generation sequencing technology.

This method allows for reduced representation sequencing, which is a cost-effective alternative to whole genome sequencing. By using ddRAD-seq, we can identify single nucleotide polymorphisms (SNPs) without sequencing the entire genome. This technique provides control over the fraction of the genome sequenced, making it suitable for lower throughput applications. This approach allows for strain-level tracking and population structure analysis, achieving the necessary genetic insights within certain budget and technical constraints.

Figure. Diagram of ddRAD-seq workflow showing the construction of adapter ligation on double-digested fragments¹.

Objective

To simulate double digestion to identify the optimal pair of enzymes for performing ddRAD-seq on a given reference genome or simulated genome given genome length and GC content.

Requirements

The recommended way to install dependencies is via conda using the environment.yml provided in the root directory of this repository.

The following programs and packages should be installed.

julia
parallel
fastx
argparse

Usage

1A. Generating restriction enzyme statistics from a reference genome.

This method is probably the most reliable. With a very solid reference, this provides consistent cut sites and accurate prediction of restriction fragment loci.

(recommended) Put the reference genome in the directory provided.
Run the following command to generate statistics for a single run of double digestion of specific enzymes.

REF="/path/to/reference_genome.fa"
ENZYME_1="EcoRI"
ENZYME_2="MseI"
julia src/jl-simRAD.jl use-ref ${REF} ${ENZYME_1} ${ENZYME_2}

# Or with size-selection (300 bp < x < 600 bp):
julia src/jl-simRAD.jl use-ref ${REF} ${ENZYME_1} ${ENZYME_2} -m 300 -M 600

1B. Generating restriction enzyme statistics without a reference genome.

If no reference genome is available, this program has the capability to simulate a genome with a given length and GC content. Compared to a stable reference genome, the results of this method are very inconsistent and should be run multiple times to be more accurate to the mean number of fragments produced.

Run the following command to generate statistics for a single run of double digestion of specific enzymes, similar to above.

GENOME_SIZE=46600000
GC_CONTENT=46
ENZYME_1="EcoRI"
ENZYME_2="MseI"
julia src/jl-simRAD.jl no-ref ${GENOME_SIZE} ${GC_CONTENT} ${ENZYME_1} ${ENZYME_2}

# Or with size-selection (300 bp < x < 600 bp):
julia src/jl-simRAD.jl no-ref ${GENOME_SIZE} ${GC_CONTENT} ${ENZYME_1} ${ENZYME_2} -m 300 -M 600

2. Running simulations of enzyme combinations in parallel.

To efficiently accomplish the goal of checking enzyme combinations, this program should be run in parallel using GNU parallel. There is a script provided in src/ which takes an input list of enzymes and outputs a list of all unique combinations without replacement.

# 1. For use with a reference genome:
REF="/path/to/reference_genome.fa"
ENZYMES_LIST="/path/to/enzymes_list.txt"
./src/generate_combinations.sh ${ENZYMES_LIST} | parallel --colsep '\t' --progress \
    'julia src/jl-simRAD.jl use-ref '"${REF}"' {1} {2} > output/{1}-{2}.csv'

# 2. For use with a simulated genome of 46.6 Mb, GC content of 46%, and with 10 repetitions:
GENOME_SIZE=46600000
GC_CONTENT=46
ENZYMES_LIST="/path/to/enzymes_list.txt"
for i in $(seq 1 10); do
    mkdir -p gen_output/${i}
    ./src/generate_combinations.sh ${ENZYMES_LIST} | parallel --colsep '\t' --progress \
        'julia src/jl-simRAD.jl no-ref '"${GENOME_SIZE}"' '"${GC_CONTENT}"' {1} {2} > gen_output/'"${i}"'/{1}-{2}.csv'
done

3. Aggregating results.

By default, results are output in .csv format for easy processing with R and other programs.

# 1. For reference-guided digestions:
find output -name "*.csv" -print | xargs cat > results.csv
# or honestly cat output/*.csv also works 

# 2. For reference-free digestions with multiple repetitions:
find gen_output -name "*.csv" -print | xargs cat  > results.csv

# And use this to add a header to the output.
sed -i '1s/^/enzymes,n_bases_covered,n_fragments_post_filter,n_fragments_pre_filter,genome_coverage\n/' results.csv

4. Output in GFF3 format.

When performing restriction digest with a given reference, it may be useful to see where restriction digestion occurs relative to genomic elements like genes or CpG islands. This makes the most sense, I think once the respective combination of restriction fragments has already been chosen.

REF="/path/to/reference_genome.fa"
ENZYME_1="EcoRI"
ENZYME_2="MseI"
julia src/jl-simRAD.jl use-ref ${REF} ${ENZYME_1} ${ENZYME_2} --gff3 true --csv false --pretty false > results.gff3

References

Liu, Michael & Worden, Paul & Monahan, Leigh & DeMaere, Matthew & Burke, Catherine & Djordjevic, Steven & Charles, Ian & Darling, Aaron. (2017). Evaluation of ddRADseq for reduced representation metagenome sequencing. PeerJ. 5. e3837. 10.7717/peerj.3837. ↩

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
ref		ref
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
enzyme_list.txt		enzyme_list.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ddRAD-seq

Background

Objective

Requirements

Usage

1A. Generating restriction enzyme statistics from a reference genome.

1B. Generating restriction enzyme statistics without a reference genome.

2. Running simulations of enzyme combinations in parallel.

3. Aggregating results.

4. Output in GFF3 format.

References

About

Releases

Packages

Languages

License

KPU-AGC/ddRAD-seq

Folders and files

Latest commit

History

Repository files navigation

ddRAD-seq

Background

Objective

Requirements

Usage

1A. Generating restriction enzyme statistics from a reference genome.

1B. Generating restriction enzyme statistics without a reference genome.

2. Running simulations of enzyme combinations in parallel.

3. Aggregating results.

4. Output in GFF3 format.

References

Footnotes

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages