Skip to content

Automated genome preparation for single or dual hybrid strains

Compare
Choose a tag to compare
@FelixKrueger FelixKrueger released this 18 May 16:08
· 187 commits to master since this release

SNPsplit


  • Changed sorting command for BAM files to also work with Samtools versions 1.3+
  • The sorting report for single-end files is now also written to the report files.
  • Added the # of SNPs used for the allele-discrimination to the report file to make it easier to spot errors
  • Now removing CR and LF line endings when reading in the SNP file. For SNP annotation files copied from a Windows machine we saw problems with no allele-specific reads for genome 2 at all which was due to the invisible \r character for the SNP call

SNPsplit_genome_preparation


Added whole new functionality to construct single- or dual-hybrid genomes starting from VCF files which are obtainable from the Mouse Genomes Project (http://www.sanger.ac.uk/science/data/mouse-genomes-project), here is a brief description of what it does:

SNPsplit_genome_preparation is designed to read in a variant call files from the Mouse Genomes Project (e.g. this latest file: ftp://ftp-mouse.sanger.ac.uk/current_snps/mgp.v5.merged.snps_all.dbSNP142.vcf.gz) and generate new genome versions where the strain SNPs are either incorporated into the new genome (full sequence) or masked by the ambiguity nucleobase 'N' (N-masking).

SNPsplit_genome_preparation may be run in two different modes:

Single strain mode:

  1. The VCF file is read and filtered for high-confidence SNPs in the strain specified with strain
  2. The reference genome (given with --reference_genome <genome>) is read into memory, and the filtered high-confidence SNP positions are incorporated either as N-masking (default) or full sequence (option --full_sequence)

Dual strain mode:

  1. The VCF file is read and filtered for high-confidence SNPs in the strain specified with --strain <name>
  2. The reference genome (given with --reference_genome <genome>) is read into memory, and the filtered high-confidence SNP positions are incorporated as full sequence and optionally as N-masking
  3. The VCF file is read one more time and filtered for high-confidence SNPs in strain 2 specified with --strain2 <name>
  4. The filtered high-confidence SNP positions of strain 2 are incorporated as full sequence and optionally as N-masking
  5. The SNP information of strain and strain 2 relative to the reference genome build are compared, and a new Ref/SNP annotation is constructed whereby the new Ref/SNP information will be Strain/Strain2 (and no longer the standard reference genome strain Black6 (C57BL/6J))
    6.The full genome sequence given with --strain <name> is read into memory, and the high-confidence SNP positions between Strain and Strain2 are incorporated as full sequence and optionally as N-masking

The resulting .fa files are ready to be indexed with your favourite aligner. Proved and tested aligners include Bowtie2, Tophat, STAR, Hisat2, HiCUP and Bismark. Please note that STAR and Hisat2 may require you to disable soft-clipping, please see the SNPsplit manual more details

Both the SNP filtering and the genome preparation write out little report files for record keeping.