A standardized, reproducible pipeline to process WGBS bisulfite & EM-seq data. This goes from .fastq to methylation calls (via biscuit) and includes extensive QC and plotting, using a Snakemake pipeline.
At a high level, this pipeline reproducibly:
- Builds a reference genome
- Trims & (minimally) filters reads
- Aligns & calls methylation using biscuit
- Flags non-converted reads
- Generates standardized outputs & QC including
- FastQC
- fastp
- Biscuit QC
- samtools stats
- MethylDackel mbias plots
- Goleft covplots
- epibed/epiread files
- Runs multiqc across entire projects
- Clone this repo:
git clone https://github.com/semenko/serpent-methylation-pipeline.git
- Install mamba:
conda install -c conda-forge mamba
or mambaforge - Install snakemake:
mamba install -c bioconda -c conda-forge snakemake
- (Optionally) Test install the dependencies:
mamba env create -n test_env -f serpent-methylation-pipeline/workflow/envs/env.yaml
$ nice snakemake --cores 50 --use-conda --printshellcmds --rerun-incomplete --rerun-triggers mtime --keep-going --dry-run
After removing the --dry-run
flag, this will download reference genomes and build indices.
Raw data files from data are processed and analyzed by this snakemake workflow. Within each project directory, the output is (roughly) structured as:
SAMPLE_01/ # e.g. melanoma / crc / healthy, etc.
│ SAMPLE.bam # The final alignment file
| SAMPLE.bam.bai # (and its index)
|── biscuit_qc/ # biscuit QC.sh text files
|── epibeds/ # epibed files (bgzip-compressed & tabix-indexed)
├── fastp/ # fastp statistics & logs
├── fastqc/ # fastqc graphs
├── goleft/ # goleft coverage plots
├── logs/ # runlogs from each pipeline component
├── methyldackel/ # mbias plots
├── raw/
│ ├── ...fastq.gz # Raw reads
| ├── ...md5.txt # Checksums and validation
├── samtools/ # samtools statistics
SAMPLE_02/
...
...
multiqc/ # A project-level multiqc stats across all data
Note each project also has a _subsampled
directory with identical structure, which is the result of the pipeline run on only 10M reads/sample.
This pipeline was designed for highly reproducible, explainable alignments and analysis of epigenetic sequencing data.
I chose GRCh38, with these specifics:
- No patches
- Includes the hs38d1 decoy
- Includes Alt chromosomes
- Applies the U2AF1 masking file
- Applies the Encode DAC exclusion
You can see a good explanation of the rationale for some of these components at this NCBI explainer.
All software requirements are specified in env.yaml.
Most are relatively common, but a few are semi-unique:
- biscuit (for alignment)
- NEB's mark-nonconverted-reads (to mark partially converted reads)
biscuit was chosen after a comparison with bwa-meth and bismark — its latest version was the most flexible with extremely well annotated .bams (some critical tags are missing from bwa-meth for identifying read level methylation, and would require patching MethylDackel to extract data).
I briefly experimented with wgbs_tools (which defines nice .pat/.beta formats) but its licensing is too restrictive to use.
I chose a relatively conservative approach to trimming -- which is needed due to end-repair bias, adaptase bias, and more.
For EMseq, I trim 10 bp everywhere, after personal QC and offline discussions with NEB. See my notes here.
For BSseq, I trim 15 bp 5' R2, and 10 bp everywhere else due to adaptase bias.
For all reads, I set --trim_poly_g
(due to two color bias) and set a --length_required
(minimum read length) of 10 bp.
Notably I do NOT do quality filtering here (I set --disable_quality_filtering
), and save this for downstream analyses as desired.
I experimented with more stringent quality filtering early on, and found it had little yield / performance benefit.
I strongly suggest reading work from Felix Krueger (author of Bismark) as background. In particular:
- TrimGalore's RRBS guide
- The Babraham WGBS/RRBS tutorials
For similar pipelines and inspiration, see:
- NEB's EM-seq pipeline
- Felix Krueger's Nextflow WGBS Pipeline
- The Snakepipes WGBS pipeline
Here's a high-level overview of the Snakemake pipeline (generated via snakemake --rulegraph | dot -Tpng > rules.png
)