Skip to content

Latest commit

 

History

History
231 lines (157 loc) · 7.57 KB

beagle3xVCF.md

File metadata and controls

231 lines (157 loc) · 7.57 KB

Beagle3 to VCF Conversion

By Pavel Salazar-Fernandez ([email protected])

Human Population and Evolutionary Genomics Lab | LANGEBIO

About

Documentation: BEAGLE 3, VCF [1000 Genomes]

This document explains how to convert a PHASED file (Beagle3) to VCF (Variant Call Format).

WARNING: The conversion will de-phase the input file.

Conversion

PHASED to VCF

Program:

Requires:

  • PHASED file for each chromosome

Command: fcgene --bgl [FILE].phased --oformat vcf --out [OUTPUT]

Output:

  • [FILE].vcf

WARNING: This conversion has some bugs, the resulting VCF file must be fixed before using it.

Fixing the header

The output VCF may have some errors in its header that will not be allowed for some software. Use the following commands to correct them.

  • Missing: contig line awk '/##source=fcGENE/ { print; print "##contig=<ID=[chr]",assembly=b37>"; next }1' [FILE].vcf > [FILE].fix.vcf

  • Switched FORMAT and FILTER tags: sed -i 's/##FILTER=<ID/##FORMAT=<ID/1' [FILE].fix.vcf sed -i 's/##FORMAT=<ID=PASS/##FILTER=<ID=PASS/1' [FILE].fix.vcf

Output:

  • [FILE].fix.vcf

Adding position info

Since PHASED does not save any position info, the chromosome and position must be re-supplied to the generated VCF file. This requires a SNP list.

SNP List

Source: dbSNP FTP readme

A table containing a list of SNPs for each chromosome with their position. This information can be downloaded from dbSNP:

dbSNP: human, build 147, GRCh37

The downloaded files can be reduced to simpler table like this one:

1   1000  rs123
1   2000  rs456
1   3000  rs789

Command: awk '{if($8 ~ /^NT_/ && $12 > 0)print$7,$12,"rs"$1 | "sort -n -k2"}' OFS="\t" [chr].txt > [chr].cprs.txt

Procedure

Program:

  • awk

Requires:

  • Converted VCF file
  • SNP list per chromosome (see above)

Command: awk 'NR==FNR{a[$3]=$2;next}($1 ~ /^#/){print;next}($3 in a){$1=[chr];$2=a[$3];print | "sort -n -k2"}' FS="\t" OFS="\t" [chr].cprs.txt [FILE].fix.vcf > [FILE].cprs.vcf

  • NR==FNR{a[$3]=$2;next}: Generates index from first file.
  • ($1 ~ /^#/){print;next}: Prints headers without any modification.
  • ($3 in a){$1=[chr];$2=a[$3];print | "sort -n -k2"}: If the rsID was found in the index, modifies the CHROM and POS columns, and then sorts SNPs by position in increasing order.

Warning: this command will erase any SNP left with no position info.

Output:

  • [FILE].cprs.vcf

Auto-conversion

This script loops through all the autosomic chromosome files and fixes the beagle outputs. Customize it for your needs:

for i in {1..22}; do
  printf "\nConverting chr$i ...\n"
  ./fcgene --bgl [file].chr$i.phased --oformat vcf --out [file].chr$i.raw
  gunzip [file].chr$i.raw_vcf.gz
  mv [file].chr$i.raw_vcf [file].chr$i.raw.vcf
  rm [file].chr$i.raw_*
  
  printf "\nFixing chr$i ...\n"
  awk -v chr=$i '/##source=fcGENE/ { print; print "##contig=<ID="chr",assembly=b37>"; next }1' "[file].chr$i.raw.vcf" > temp.vcf
  sed -i 's/##FILTER=<ID/##FORMAT=<ID/1' temp.vcf
  sed -i 's/##FORMAT=<ID=PASS/##FILTER=<ID=PASS/1' temp.vcf
  awk -v chr=$i 'NR==FNR{a[$3]=$2;next}($1 ~ /^#/){print;next}($3 in a){$1=chr;$2=a[$3];print | "sort -n -k2"}' FS="\t" OFS="\t" "chr_$i.cprs.txt" temp.vcf > "[file].chr$i.fix.vcf"
  rm temp.vcf
  
  printf "\nFinished chr$i ...\n"

done

Subset VCF

After converting from Beagle3 to VCF and , software designed for

Program:

Requirements:

  • VCF file
  • List of sample IDs

Command: bcftools view [FILE].vcf -S samples.ids -o [FILE].sub.vcf -M2 -c 2:minor -O v

  • view: bcftools mode used for extracting data.
  • -S [file]: reads a list of IDs (one ID per line) and keepsonly those individuals.
  • -M2: Only outputs biallelic SNPs.
  • -c 2:minor Discards monomorphic SNPs (minor allele count = 0) and singletons (minor allele count = 1)
  • -O: Outputs to VCF

Output: [FILE].sub.vcf

Re-phasing

Basic Phasing

The basic mode of SHAPEIT, fast and requires little input, but more prone to errors.

Requirements:

  • Converted VCF file (fixed and with SNP positions)
  • Genetic map

Command: shapeit -V [FILE].cprs.vcf -M genetic_map.[chr].txt -O [FILE].ph.vcf

  • -V: VCF file to be phased
  • -M: Genetic map file for that chromosome
  • -O: Output name

Output:

  • [FILE].phs.haps
  • [FILE].phs.sample

Convert to VCF

To output SHAPEIT results to a VCF (again), use:

shapeit -convert --input-haps [FILE] --output-vcf [FILE].vcf

Phasing with Reference

Source: SHAPEIT: Reference

Since the output VCF lost all phasing information during the conversion, it is necessary to re-phase the data using a reference panel.

Requirements:

NOTE: check all the data is in the same reference genome build (NCBI b37)

It is recommended to test the files before attempting the phasing.

Command: shapeit -check --input-vcf [FILE].cprs.vcf -M genetic_map.txt --input-ref reference.haplotypes.gz reference.legend.gz reference.sample --output-log [FILE].aligments

Alignment Errors

SHAPEIT will produce an error warning when:

  • The input file and the reference have incompatible alleles (strand issues).
  • The input file has SNPs not present in the reference.

Two log files will be created:

  • myLogFile.snp.strand: describes in detail all the problems found.
  • myLogFile.snp.strand.exclude: a list of the physical positions of all problems

If this problem occurs, add to the command above the following flag: --exclude-snp myLogFile.snp.strand.exclude

Using specific groups

echo 'XYZ' > grp.list

Start Reference Phasing

If the files pass the previous tests, start the phasing process.

Command: shapeit --input-vcf [FILE].cprs.vcf -M genetic_map.txt --input-ref reference.haplotypes.gz reference.legend.gz reference.sample -O [FILE].refphs

Optional flags:

  • --exclude-snp myLogFile.snp.strand.exclude
  • --include-grp grp.list or --include-grp grp.list

Output:

  • [FILE].refphs.haps
  • [FILE].refphs.sample

Convert to VCF

To output SHAPEIT results to a VCF (again), use:

shapeit -convert --input-haps [FILE] --output-vcf [FILE].vcf

Concatenation

Documentation: bcftools concat

If the dataset is split in two or more VCF files (e.g. in chromosomes), it is necessary to concatenate them to create a single fused file. This procedure will create a compressed VCF file (.vcf.gz).

Program:

  • bcftools

Prerequisites:

  • All source files must have the same sample columns appearing in the same order.
  • Input files must be sorted by chromosome and positions.
  • Files must be given in the correct order.
  • (Optional) A file containing a ordered list of the input files

Command: bcftools concat -f [VCF files list] -o [Output file name] -O z

Output:

  • [FILE].vcf.gz