The purpose of this tool is to provide a simplified way to examine and explore coverage and mutations in sequence alignments
The alignment-viewer tool was built around the metagenome dataset of Lake Washington sediment microbes used in Janet Matsen's thesis dissertation that can be found here. The tool is built as a jupyter notebook to allow visual eploration of different genes and regions.
The tool looks for one directory per metagenome/reference pair.
Each data directory (base_directory/referenceID/sampleID/
) must contain at least:
- a
.coverage.bed
file containing coverage depth data along the genome - a
.vcf
file containing sequence variant data along the genome
Additionally, the base directory must contain two .csv
reference files:
aligned_isolate_genomes.csv
- contains the human-readable reference genomes and the names of the corresponding directoriesfastq_sample_lookup.csv
- contains metadata about the metagenomes included in the analysis
see the miscellaneous directory for examples of the .csv
reference files
Details on generating these files and the directory organization can be found below, and in the documentation for the bash scripts in the scripts directory.
dependencies:
The notebook depends on pybedtools, among other packages. I found some dependency conflicts on my machine between samtools and pybedtools. A .yml
file with the specifications for the environment that worked for me can be found in the miscellaneous directory.
data organization:
Organized data directories were generated using the generate_align_tasklist.sh script that can be found in the scripts directory. The directories should be organized in a nested fashion as follows:
base_directory
| aligned_isolate_genomes.csv
| fastq_sample_lookup.csv
|
└───referenceID (directory named with reference genome ID)
|
└───sampleID (directory named with metagenome sample ID)
sampleID_referenceID.coverage.bed
sampleID_referenceID.vcf
sampleID_referenceID.extension (other data files, such as .bam)
file generation:
Alignment output files (.coverage.bed
& .vcf
) were generated using the align_coverage_call.sh script that can be found in the scripts directory.
This workflow runs the following steps:
- uses bwa to run a burrows-wheeler alignment, outputting a
.sam
file - uses samtools to convert the
.sam
to a.bam
file - uses samtools to sort the
.bam
file, outputting a.sorted.bam
file - uses samtools to index the
.sorted.bam
file - uses bedtools to calculate coverage across the alignment, outputting a
.coverage.bed
file - uses bcftools to calculate sequence variants, outputting a
.vcf
file
running the notebook:
- Import packages
- Point the notebook to the correct base directory for the data by editing the
base_directory
global variable - Run the code cell to define the core functionality
- Render the user input widgets and use them to identify the region of interest
- Check that the .coverage.bed and .vcf files exist in the expected location for the data of interest
- Use the notebook to render the visualization and explore the data of interest