This workflow is heavily based on https://github.com/seb-mueller/snakemake-bisulfite
by @seb-mueller
. I basically just streamlined/tailored the workflow to my needs.
The workflow is based on the following tools:
The separate installation of the tools is not necessary, they are installed 'on the fly' (see Usage below).
Snakemake
should be installed as outlined in its documentation for instance using conda
/mamba
. It is recommended to create a dedicated conda
environment for Snakemake.
The workflow can be equally used for the analysis of data derived from NEB's EMseq approach as well as data derived from WGBS (whole-genome bisulfite sequencing). Working with environmental microbes, we made the experience that the usage of commonly available bisulfite treatment kits lead to a strong loss of DNA. As a result, we ended up giving EMseq a try. For details about the methodology have a look at NEBs EMseq paper.
A reference genome stored in resources/
is bisulfite-treated in silico with bismark
. Paired-end sequencing data (stored in data/
) is subjected to quality-control and adapter-trimming using bbduk
. Quality reports are written using fastQC
before and after trimming.
Read pairs are subsequently mapped onto the bisulfite-treated genome, alignments with identical mapping positions are removed. Methylations are extracted for all three contexts (CpG, CHH, CHX) and used to generate a .bedGraph
and coverage file. The latter can be used for downstream methylation analysis.
The below DAG graph outlines the different processes of the workflow.
Start by cloning the repository and move into respective directory.
git clone https://github.com/wegnerce/smk_emseq.git
cd smk_emseq
Place paired sequence data (R{1,2}.fastq.gz) in data/
. The repository contains two pairs of exemplary files (La.1_R1.fastq.gz + La.1_R2.fastq.gz & Nd.1_R1.fastq.gz + Nd.1_R2.fastq.gz).
config/
contains, besides from the configuration of the workflow (config/config.yaml
), a tab-separated table samples.tsv
, which contains a list of all datasets, one per line. The workflow expects *.fastq.gz
files and R{1,2}
as prefixes for forward and reverse read files.
From the root directory of the workflow, processing the data can then be started.
# --use-conda makes sure that needed tools are installed based
# on the requirements specified in the respective *.yaml in /envs
snakemake --use-conda
The directory structure of the workflow is shown below:
├── config
│ ├── config.yaml
│ └── samples.tsv
├── dag.svg
├── data
│ ├── La.1_R1.fastq.gz
│ ├── La.1_R2.fastq.gz
│ ├── Nd.1_R1.fastq.gz
│ └── Nd.1_R2.fastq.gz
├── logs
├── LICENSE
├── README.md
├── resources
│ ├── adapters.fa
│ └── RHAL1_chromosome_plasmid.fa
├── results
└── workflow
├── envs
│ ├── bbmap.yaml
│ ├── bismark.yaml
│ ├── fastqc.yaml
│ └── samtools.yaml
├── rules
│ ├── bismark.smk
│ ├── qc.smk
│ └── sort.smk
└── Snakefile
Output from the different steps of the workflow are stored in /results
and /logs
.
The resulting *.cov.gz
files (results/04_coverage/*.cov.gz
) can be used for downstream methylation analysis.
©️ Carl-Eric Wegner, 2023