In this example, we want to model a cross of two diploid yeast strains. Depending on which sequence data you have available, you can take multiple approaches depending on which genome data you have available. In this document we will demonstrate a workflow that makes us of variant call files, and a workflow that uses whole genome alignment. In both cases we study two beer yeasts from the 1002 Yeast Genome collection: Orval trappist yeast strain DBVPG6695 from Belgium and American ale strain 1.3_Safale_US05 (codenames BRQ and CFD).
First we set up a new gen repository to store and track our sequences. For this project we don't anticipate using more than one gen database file, and also don't need more than one collection. By setting defaults we won't have to enter these details in future commands.
bvh@mbp:~$ gen init
Gen repository initialized.
bvh@mbp:~$ gen defaults --database breeding_experiment.db --collection genome
Default database set to breeding_experiment.db
Default collection set to genome
The dataset we are working with includes variant call files that were generated by aligning sequencing reads to the S288C reference genome, specifically version R64-1-1. We'll start by downloading this reference sequence from the Saccharomyces Genome Database (SGD) and extracting the archive.
bvh@mbp:~$ wget http://sgd-archive.yeastgenome.org/sequence/S288C_reference/genome_releases/S288C_reference_genome_R64-1-1_20110203.tgz
bvh@mbp:~$ tar -xzvf S288C_reference_genome_R64-1-1_20110203.tgz
Let's add the reference genome sequence to our gen repository
bvh@mbp:~$ gen import --fasta S288C_reference_genome_R64-1-1_20110203\/S288C_reference_sequence_R64-1-1_20110203.fsa
Created it
The variant calls themselves are stored in a large Genomic VCF (GVCF) file that includes variants from all samples in the dataset. We download a compressed version of this file from the the 1002 Yeast Genomes project, and then use bcftools to extract the two samples we are interested in.
bvh@mbp:~$ wget http://1002genomes.u-strasbg.fr/files/1011Matrix.gvcf.gz
We then use the bcftools application to extract just the samples we want. To do this, we must first create a text file that contains the sample identifiers separated by a line break.
bvh@mbp:~$ bcftools view -s BRQ -o BRQ.vcf 1011Matrix.gvcf.gz
bvh@mbp:~$ bcftools view -s CFD -o CFD.vcf 1011Matrix.gvcf.gz
Unfortunately these variant call files refer to the chromosomes by their number, whereas the reference sequence listed them by their accession identifier. We rename them in the VCF file using bcftools with a text file called contig_mapping.txt that has the following contents:
chromosome1 ref|NC_001133|
chromosome2 ref|NC_001134|
chromosome3 ref|NC_001135|
chromosome4 ref|NC_001136|
chromosome5 ref|NC_001137|
chromosome6 ref|NC_001138|
chromosome7 ref|NC_001139|
chromosome8 ref|NC_001140|
chromosome9 ref|NC_001141|
chromosome10 ref|NC_001142|
chromosome11 ref|NC_001143|
chromosome12 ref|NC_001144|
chromosome13 ref|NC_001145|
chromosome14 ref|NC_001146|
chromosome15 ref|NC_001147|
chromosome16 ref|NC_001148|
chromosome17 ref|NC_001224|
bvh@mbp:~$ bcftools annotate --rename-chrs contig_mapping.txt BRQ.vcf -o BRQ_2.vcf
bvh@mbp:~$ bcftools annotate --rename-chrs contig_mapping.txt CFD.vcf -o CFD_2.vcf
Because Gen also uses the sample names from the VCF file to create new block groups, or find an existing block group to update, we need to rename the samples as well so that the variants will be loaded into one sample called F1. We again use bcftools to do this, using a text file sample_mapping.txt with the following contents:
BRQ F1
CFD F1
bvh@mbp:~$ bcftools reheader -s sample_mapping.txt -o BRQ3.vcf BRQ2.vcf
bvh@mbp:~$ bcftools reheader -s sample_mapping.txt -o CFD3.vcf CFD2.vcf
We can now import the individual VCF files into gen, which will then use the chromosome/contig and sample names in the VCF file to find the appropriate block groups to update or create.
bvh@mbp:~$ gen update --vcf BRQ3.vcf
bvh@mbp:~$ gen update --vcf CFD3.vcf
Lastly, we export the F1 sample to a GFA file that can be used by graph-aware sequence aligners.
bvh@mbp:~$ gen export --gfa cross.gfa
First we download the genome assembly archive and extract the genome sequence for the strains we want to mate.
wget http://1002genomes.u-strasbg.fr/files/1011Assemblies.tar.gz
tar -xzf 1011Assemblies.tar.gz GENOMES_ASSEMBLED/BRQ_4.re.fa GENOMES_ASSEMBLED/CFD_4.re.fa
Then we use Cactus to create a graph via whole genome alignment. The Comparative Genomics Toolkit provides a Docker container that includes all the infrastructure needed to do so. We launch the container and bind our working directory to the /data directory in the container:
bvh@mbp:~$ docker run --mount type=bind,source=$PWD,target=/data -it quay.io/comparative-genomics-toolkit/cactus:v2.9.0 bash
We tell Cactus which genomes to align by providing a text file containing names and file paths. You can set this up manually, or generate it for all fasta files in our current directory and subdirectories:
bvh@mbp:~$ find . -type f -name "*.fa" -printf "%f %p\n" | awk '{ sub(/\..*$/, "", $1); print $1, $2 }' > file_list.txt
Then we can start the Cactus pipeline using the command list below. ./js
specifies where intermediate files are stored,
outName and outDir influence the pipeline, and --gfa
tells Cactus to output a GFA file. While this is a graph based
alignment, we must still designate one genome as reference, in this case we pick BRQ.
bvh@mbp:~$ cactus-pangenome ./js file_list.txt --outDir ./cactus_output --outName cross --reference BRQ_4 --gfa
After this completes (approximately 10 minutes), we can load it back into gen by running:
bvh@mbp:~$ gen import --gfa cactus_output/cross.gfa