Automated genome preparation for single or dual hybrid strains
SNPsplit
- Changed sorting command for BAM files to also work with Samtools versions 1.3+
- The sorting report for single-end files is now also written to the report files.
- Added the # of SNPs used for the allele-discrimination to the report file to make it easier to spot errors
- Now removing CR and LF line endings when reading in the SNP file. For SNP annotation files copied from a Windows machine we saw problems with no allele-specific reads for genome 2 at all which was due to the invisible \r character for the SNP call
SNPsplit_genome_preparation
Added whole new functionality to construct single- or dual-hybrid genomes starting from VCF files which are obtainable from the Mouse Genomes Project (http://www.sanger.ac.uk/science/data/mouse-genomes-project), here is a brief description of what it does:
SNPsplit_genome_preparation
is designed to read in a variant call files from the Mouse Genomes Project (e.g. this latest file: ftp://ftp-mouse.sanger.ac.uk/current_snps/mgp.v5.merged.snps_all.dbSNP142.vcf.gz) and generate new genome versions where the strain SNPs are either incorporated into the new genome (full sequence) or masked by the ambiguity nucleobase 'N' (N-masking).
SNPsplit_genome_preparation
may be run in two different modes:
Single strain mode:
- The VCF file is read and filtered for high-confidence SNPs in the strain specified with strain
- The reference genome (given with
--reference_genome <genome>
) is read into memory, and the filtered high-confidence SNP positions are incorporated either as N-masking (default) or full sequence (option--full_sequence
)
Dual strain mode:
- The VCF file is read and filtered for high-confidence SNPs in the strain specified with
--strain <name>
- The reference genome (given with
--reference_genome <genome>
) is read into memory, and the filtered high-confidence SNP positions are incorporated as full sequence and optionally as N-masking - The VCF file is read one more time and filtered for high-confidence SNPs in strain 2 specified with
--strain2 <name>
- The filtered high-confidence SNP positions of strain 2 are incorporated as full sequence and optionally as N-masking
- The SNP information of strain and strain 2 relative to the reference genome build are compared, and a new Ref/SNP annotation is constructed whereby the new Ref/SNP information will be Strain/Strain2 (and no longer the standard reference genome strain Black6 (C57BL/6J))
6.The full genome sequence given with--strain <name>
is read into memory, and the high-confidence SNP positions between Strain and Strain2 are incorporated as full sequence and optionally as N-masking
The resulting .fa
files are ready to be indexed with your favourite aligner. Proved and tested aligners include Bowtie2
, Tophat
, STAR
, Hisat2
, HiCUP
and Bismark
. Please note that STAR
and Hisat2
may require you to disable soft-clipping, please see the SNPsplit manual more details
Both the SNP filtering and the genome preparation write out little report files for record keeping.