07 Jan 11:47

FelixKrueger

6e49c58

MGP v8 annotations and GRCm39 Latest

Latest

v0.6.0 - GRCm39 genome build and new Docs

Restructured the documentation, using mkdocs. The new User Guide lives at this address: http://felixkrueger.github.io/SNPsplit/
Reworked all of SNPsplit to reflect changes of the Mouse Genomes Project. This includes the overdue switch-over to the latest v8 annotation (available here) and the GRCm39 mouse genome build
Kept the old v5 (and v7) genome build instructions as legacy documentation

SNPsplit

Added an option --single_end to skip the paired-end auto-detection entirely (which failed for e.g. alignments with STAR see here)

SNPsplit_genome_preparation

Changed the chromosome detection regex to a non-greedy match (so it only uses the NAME entry following the ID=NAME, up to, but not including the first ,)

Assets 2

26 Jul 07:02

FelixKrueger

0.5.0

65cf8d7

v7-Genome Preparation

v0.5.0

SNPsplit_genome_preparation

Added option --v7_MGP; now also accepts the v7 file (mgp_REL2005_snps_indels.vcf.gz) of Mouse Genomes Project which may be downloaded here: ftp://ftp-mouse.sanger.ac.uk/REL-2004-v7-SNPs_Indels/mgp_REL2005_snps_indels.vcf.gz. INDEL variants are being skipped (this is noted in the report). This new version adds a number of additional strains to choose from, now amounting to 51 strains in total:

Available genomes to choose from are:

SEA_GnJ
SM_J
ST_bJ
CAST_EiJ
BALB_cByJ
NON_LtJ
FVB_NJ
RIIIS_J
CE_J
NZO_HlLtJ
C58_J
BTBR_T+_Itpr3tf_J
MOLF_EiJ
BUB_BnJ
C57L_J
CZECHII_EiJ
C57BL_10J
B10.RIII
AKR_J
C3H_HeJ
LP_J
DBA_2J
QSi3
ZALENDE_EiJ
A_J
PL_J
129S1_SvImJ
NZW_LacJ
PWK_PhJ
C57BL_10SnJ
C57BR_cdJ
QSi5
C57BL_6NJ
SWR_J
MA_MyJ
C3H_HeH
SPRET_EiJ
LEWES_EiJ
WSB_EiJ
129P2_OlaHsd
CBA_J
SJL_J
BALB_cJ
KK_HiJ
JF1_MsJ
NZB_B1NJ
I_LnJ
DBA_1J
129S5SvEvBrd
NOD_ShiLtJ
RF_J

If the file mgp_REL2005_snps_indels.vcf.gz is given, --v7 is set automatically.

now attempts to extract the fields FORMAT and INFO from the VCF file automatically, to get access to the required information GT (genotype) and FI (filter). See more here.

Assets 2

29 Sep 09:51

FelixKrueger

0.4.0

36fd82a

0.4.0 - Soft-clipping, YAML and more

SNPsplit now supports soft-clipping of reads (CIGAR operation S).
SNPsplit now writes important statistics out in YAML format to enable easier integration into MultiQC. If tag2sort is called via SNPsplit itself, the ...sort.yaml file will be integrated into the main ...SNPsplit_report.yaml file (and deleted afterwards)
Added option --skip_tag2sort to allow the separation of the allele-tagging and allele-sorting (tag2sort) processes. This might be desired to add a de-duplication step such as markduplicates or deduplicate_bismark for Nextflow pipelines
For genomes that consist of chromosomes for which SNPs are recorded, and scaffolds for which there are no SNPs, now all chromosomes and scaffolds are printed to both the N-masked and full sequence genomes (see here).
Added auto-detection of single-end or paired-end files. This avoids accidentally processing paired-end files in single-end mode see here.
Now making use of variable genome_build instead of using GRCm38 invariably

Assets 2

28 May 13:58

FelixKrueger

0.3.4

5df9762

v0.3.4 - Added SNPsplit to Bioconda

Changed /usr/bin/perl to /usr/bin/env perl, which was required for adding SNPsplit to bioconda. Thanks to @vivekbhr for these changes.
Fixed output-path handling for paired-end and Hi-C mode (was only working for single-end files).

tag2sort

Added option -o/--output_dir to specify an output directory.

Assets 2

15 Jun 14:13

FelixKrueger

0.3.3

8385f79

Fixed allele-assignment for certain SNPs in --bisulfite mode

v0.3.3

SNPsplit

Changed FindBin qw($Bin) to FindBin qw($RealBin) so that symlinks to tag2sort are resolved properly.
In certain cases, specific SNPs were only used for the allele assignment if they were methylated. In more detail: In cases where the SNP was either C/G (REF/ALT) or G/C (REF/ALT), and the read was on the opposing strand, only the methylated form of the C on the reverse strand had previously been allowed as a valid expected base. This has now been changed so that both G and A are considered valid for the strain containing a G at the SNP position (see also this issue).
Changed the way in which C>T SNPs are handled in the allele-tagging report (note that this was merely a report/interpretation thing and did not have any effect the on the actual results). Previously, reads without a call for genome 1 or genome 2 had been listed as:
reads did not contain one of the expected bases at known SNP positions.
In a bisulfite setting this also included C>T SNPs however, and hence the number could have been rather high (>10%). I have now changed this so that reads which had at least one C>T SNP and were unassignable at the same time are scored differently:
reads that were unassignable contained C>T SNPs preventing the assignment
Changed all instances of zcat to gunzip -c in SNPsplit and SNPsplit_genome_preparation to prevent errors on certain OSX platforms

Assets 2

29 Mar 10:15

FelixKrueger

0.3.2

8dcb8d2

v0.3.2 - Much improved SNP genome preparation

SNPsplit

Changed the samtools command throughout SNPsplit to now correctly use the path supplied by the user with --samtools_path. Thanks to Kenzo Hillion for spotting this (see here).
Option --genome_build [NAME] should now work as intended (used to be --build only).

SNPsplit_genome_preparation

Relaxed SNP filtering criteria to now support multiple homozygous variants for the same position in the genome. This step should incresae the number of usable SNPs slightly (but noticably). See here
Changed the SNP filtering for --dual_hybrid mode to only include positions where both strains had a high confidence call (irrespective of the nature of the call). This step should greatly reduce the number of false positive allele calls. See here for more details.
Added a check to SNPsplit_genome_preparation that produces a [FATAL ERROR] if the stored chromosome names are not the same as the ones in the VCF file (which is a rather common mistake when people use the Ensembl VCF file but get the genome from UCSC. This should change soon if and when Ensembl adopts the same standard used by NCBI/UCSC).
Added a new version of the genome preparation script that can deal with the latest version of the VCF file for the old NCBIM37 genome build ("mgp.v2.snps.annot.reformat.vcf.gz"). The script is called "SNPsplit_genome_preparation_v2VCF" and may be found in the folder "outdated_VCF_versions" on Github. Please note that this does not include the changes to we made the current version (see above).

Assets 2

18 May 16:08

FelixKrueger

0.3.0

10bf621

Automated genome preparation for single or dual hybrid strains

SNPsplit

Changed sorting command for BAM files to also work with Samtools versions 1.3+
The sorting report for single-end files is now also written to the report files.
Added the # of SNPs used for the allele-discrimination to the report file to make it easier to spot errors
Now removing CR and LF line endings when reading in the SNP file. For SNP annotation files copied from a Windows machine we saw problems with no allele-specific reads for genome 2 at all which was due to the invisible \r character for the SNP call

SNPsplit_genome_preparation

Added whole new functionality to construct single- or dual-hybrid genomes starting from VCF files which are obtainable from the Mouse Genomes Project (http://www.sanger.ac.uk/science/data/mouse-genomes-project), here is a brief description of what it does:

SNPsplit_genome_preparation is designed to read in a variant call files from the Mouse Genomes Project (e.g. this latest file: ftp://ftp-mouse.sanger.ac.uk/current_snps/mgp.v5.merged.snps_all.dbSNP142.vcf.gz) and generate new genome versions where the strain SNPs are either incorporated into the new genome (full sequence) or masked by the ambiguity nucleobase 'N' (N-masking).

SNPsplit_genome_preparation may be run in two different modes:

Single strain mode:

The VCF file is read and filtered for high-confidence SNPs in the strain specified with strain
The reference genome (given with --reference_genome <genome>) is read into memory, and the filtered high-confidence SNP positions are incorporated either as N-masking (default) or full sequence (option --full_sequence)

Dual strain mode:

The VCF file is read and filtered for high-confidence SNPs in the strain specified with --strain <name>
The reference genome (given with --reference_genome <genome>) is read into memory, and the filtered high-confidence SNP positions are incorporated as full sequence and optionally as N-masking
The VCF file is read one more time and filtered for high-confidence SNPs in strain 2 specified with --strain2 <name>
The filtered high-confidence SNP positions of strain 2 are incorporated as full sequence and optionally as N-masking
The SNP information of strain and strain 2 relative to the reference genome build are compared, and a new Ref/SNP annotation is constructed whereby the new Ref/SNP information will be Strain/Strain2 (and no longer the standard reference genome strain Black6 (C57BL/6J))
6.The full genome sequence given with --strain <name> is read into memory, and the high-confidence SNP positions between Strain and Strain2 are incorporated as full sequence and optionally as N-masking

The resulting .fa files are ready to be indexed with your favourite aligner. Proved and tested aligners include Bowtie2, Tophat, STAR, Hisat2, HiCUP and Bismark. Please note that STAR and Hisat2 may require you to disable soft-clipping, please see the SNPsplit manual more details

Both the SNP filtering and the genome preparation write out little report files for record keeping.

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.6.0 - GRCm39 genome build and new Docs

SNPsplit

SNPsplit_genome_preparation

v0.5.0

SNPsplit_genome_preparation

tag2sort

v0.3.3

SNPsplit

SNPsplit

SNPsplit_genome_preparation

SNPsplit

SNPsplit_genome_preparation

Single strain mode:

Dual strain mode:

Releases: FelixKrueger/SNPsplit

MGP v8 annotations and GRCm39

v0.6.0 - GRCm39 genome build and new Docs

SNPsplit

SNPsplit_genome_preparation

v7-Genome Preparation

v0.5.0

SNPsplit_genome_preparation

0.4.0 - Soft-clipping, YAML and more

v0.3.4 - Added SNPsplit to Bioconda

tag2sort

Fixed allele-assignment for certain SNPs in --bisulfite mode

v0.3.3

SNPsplit

v0.3.2 - Much improved SNP genome preparation

SNPsplit

SNPsplit_genome_preparation

Automated genome preparation for single or dual hybrid strains

SNPsplit

SNPsplit_genome_preparation

Single strain mode:

Dual strain mode: