-
Notifications
You must be signed in to change notification settings - Fork 5
Tutorial
This page will walk you through how to use BetaScan, starting with vcf files and ending with Beta Scores. It uses the toolkit glactools for file format conversion. If you already have your data in BetaScan format, then you can skip the steps which use glactools.
The first decision you must make is if you are going to use an outgroup sequence. An outgroup sequence can be useful for two things:
- Calling ancestral alleles. An ancestral allele is the older nucleotide at a position (the one that was in the population before the mutation caused a polymorphism at that basepair). This is how the Beta1 statistic uses an outgroup.
- Calling substitutions. Substitutions are positions at which the nucleotide in the outgroup (e.g. chimpanzee) is different than the nucleotide in your species of interest (e.g. human). Doing both this and (1) is how the Beta2 statistic uses an outgroup.
Using an outgroup sequence to do (1) improves power a good amount, while using it to do (2) improves power slightly. So if you do have an outgroup available, we recommend using it, but if not then you will still be well equipped for a genome-wide scan for balancing selection.
The sections below walk through how to generate BetaScores for two situations: no outgroup (Beta1* statistic), or using an outgroup either to call ancestral alleles (Beta1 statistic), or using the outgroup to call both ancestral alleles and substitutions (Beta2 statistic).
If no outgroup is available, then we only need vcfs from your species of interest. As an example, we will use chromosome 6 from the 1000 genomes project, which contains the MHC region. The MHC contains many variants thought to be under balancing selection, so it's a good positive control.
We first obtain the 1000 genomes data. We need two files for glactools: the fasta file (human_g1k_v37.fasta.fai) which contains the reference genome, and the genotypes for the individuals (ALL.chr6.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf.gz). Because the number of individuals is so large in the 1000 genomes project, we will also use vcftools to select a subset of individuals and a subset of the chromosome so that the file conversion steps in this tutorial don't take hours. If you don't have vcftools, you can skip using it and pass vcfm2acf the full vcf.gz file from 1000 genomes.
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.fai
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr6.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
wget https://github.com/ksiewert/BetaScan/blob/master/AFRsubset_fortutorial.txt
vcftools --gzvcf ALL.chr6.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz --chr 6 --from-bp 1 --to-bp 40000000 --keep AFRsubset_fortutorial.txt --recode --mac 1 --out chr6.AFRsubset
We then run these files through glactools vcfm2acf to convert them to acf format. Then we use glactools acf2betascan to convert from acf format to betascan.
glactools vcfm2acf --onlyGT --fai human_g1k_v37.fasta.fai chr6.AFRsubset.recode.vcf > 1kg_chr6.AFRsubset.acf.gz
glactools acf2betascan --fold 1kg_chr6.AFRsubset.acf.gz > 1kg_chr6.AFRsubset.beta.txt
Or alternatively, the above commands all in one line. This allows you to generate betascan format without having intermediate files saved on your computer:
glactools vcfm2acf --onlyGT --fai <( wget -q -O /dev/stdout
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.fai
) <(wget -q -O /dev/stdout
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/analysis_results/integrated_call_sets/ALL.chr6.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf.gz | vcftools --gzvcf - --chr 6 --from-bp 1 --to-bp 40000000 --keep AFRsubset_fortutorial.txt --recode --mac 1 --stdout )
| glactools acf2betascan --fold - > 1kg_chr6.beta.txt
Now, we have a BetaScan formatted file. If we look at the head of this file (1kg_chr6.beta.txt), it looks like this:
63979 1 20
63980 1 20
100100 2 20
100112 1 20
100114 1 20
100116 6 20
117271 4 20
143500 8 20
145008 3 20
145579 2 20
This section covers basepairs 63979-145579 of chr6. Allele frequencies range from singletons to 40%. There are 20 haploid genomes (i.e. 10 diploids) in the vcf file at these positions. You'll notice that in this file all frequencies are the minor allele frequency (the frequency of the less common allele). This is because Beta1 uses a folded/minor allele site frequency spectrum.
We next want to run BetaScan.
python BetaScan.py -i 1kg_chr6.beta.txt -fold -o 1kg_chr6.betascores.txt
The head of the output looks like:
Position Beta1*
63979 0.209713
63980 0.209713
100100 0.024024
100112 0.131487
100114 0.131487
100116 -0.68808
117271 0.0
143500 0.0
145008 0.0
The first column is the positions and the second column is the Beta1* scores. You'll notice that some of the scores are 0.0. This is because there are no SNPs within 500bp (the default window size on either side of the core SNP). If you sort the file by Beta Score, you'll see that the top scoring SNP is at position 32525747. This SNP is in the MHC region, which is reassuring.
For more details on running BetaScan see the input format section of the wiki and the output section of the wiki.
We first obtain the 1000 genomes data as before. To call ancestral alleles we also need to have outgroup information. To do this glactools uses an epo file, which gives information about variant states in primate outgroups.
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.fai
wget https://sid.erda.dk/share_redirect/f2hHSEpl8w -O all.epo.gz
wget https://sid.erda.dk/share_redirect/et5TpffnmS -O all.epo.gz.tbi
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr6.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
wget https://github.com/ksiewert/BetaScan/blob/master/AFRsubset_fortutorial.txt
vcftools --gzvcf ALL.chr6.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz --chr 6 --from-bp 1 --to-bp 40000000 --keep AFRsubset_fortutorial.txt --recode --mac 1 --out chr6.AFRsubset
Next, we use these files to generate BetaScan format. The --useanc flag will cause glactools to use the chimpanzee information from the epo file to call ancestral/derived alleles and substitutions.
glactools vcfm2acf --onlyGT --epo all.epo.gz --fai human_g1k_v37.fasta.fai chr6.AFRsubset.recode.vcf > 1kg_chr6.AFRsubset.acf.gz
glactools acf2betascan --useanc 1kg_chr6.AFRsubset.acf.gz | gzip > 1kg_chr6.beta.gz
You'll notice that unlike for the folded scan, some of the SNPs in this file have frequency over 50%. This is because the derived allele (the nucleotide not found in chimp) is at higher frequency than the ancestral allele. Finally, to run BetaScan with this unfolded data the command is:
python BetaScan.py -i 1kg_chr6.beta.gz -o 1kg_chr6.betascores.txt
This tutorial covers only Beta scans with very few modifications to default parameters. We recommend looking into the -thetaMap and -std options if you want to standardize Beta (i.e. divide by the standard deviation). We also highly recommend changing the window size using the -w flag if your species has a different recombination rate than humans. For more information on these options and guidance on their values, see the Usage page of the wiki.