Releases: FelixKrueger/SNPsplit
MGP v8 annotations and GRCm39
v0.6.0 - GRCm39 genome build and new Docs
-
Restructured the documentation, using
mkdocs
. The new User Guide lives at this address: http://felixkrueger.github.io/SNPsplit/ -
Reworked all of SNPsplit to reflect changes of the Mouse Genomes Project. This includes the overdue switch-over to the latest v8 annotation (available here) and the GRCm39 mouse genome build
-
Kept the old v5 (and v7) genome build instructions as legacy documentation
SNPsplit
- Added an option
--single_end
to skip the paired-end auto-detection entirely (which failed for e.g. alignments with STAR see here)
SNPsplit_genome_preparation
- Changed the chromosome detection regex to a non-greedy match (so it only uses the NAME entry following the ID=NAME, up to, but not including the first
,
)
v7-Genome Preparation
v0.5.0
SNPsplit_genome_preparation
- Added option
--v7_MGP
; now also accepts the v7 file (mgp_REL2005_snps_indels.vcf.gz
) of Mouse Genomes Project which may be downloaded here: ftp://ftp-mouse.sanger.ac.uk/REL-2004-v7-SNPs_Indels/mgp_REL2005_snps_indels.vcf.gz. INDEL variants are being skipped (this is noted in the report). This new version adds a number of additional strains to choose from, now amounting to 51 strains in total:
Available genomes to choose from are:
SEA_GnJ
SM_J
ST_bJ
CAST_EiJ
BALB_cByJ
NON_LtJ
FVB_NJ
RIIIS_J
CE_J
NZO_HlLtJ
C58_J
BTBR_T+_Itpr3tf_J
MOLF_EiJ
BUB_BnJ
C57L_J
CZECHII_EiJ
C57BL_10J
B10.RIII
AKR_J
C3H_HeJ
LP_J
DBA_2J
QSi3
ZALENDE_EiJ
A_J
PL_J
129S1_SvImJ
NZW_LacJ
PWK_PhJ
C57BL_10SnJ
C57BR_cdJ
QSi5
C57BL_6NJ
SWR_J
MA_MyJ
C3H_HeH
SPRET_EiJ
LEWES_EiJ
WSB_EiJ
129P2_OlaHsd
CBA_J
SJL_J
BALB_cJ
KK_HiJ
JF1_MsJ
NZB_B1NJ
I_LnJ
DBA_1J
129S5SvEvBrd
NOD_ShiLtJ
RF_J
If the file mgp_REL2005_snps_indels.vcf.gz
is given, --v7
is set automatically.
- now attempts to extract the fields
FORMAT
andINFO
from the VCF file automatically, to get access to the required informationGT
(genotype) andFI
(filter). See more here.
0.4.0 - Soft-clipping, YAML and more
-
SNPsplit now supports soft-clipping of reads (
CIGAR
operationS
). -
SNPsplit now writes important statistics out in YAML format to enable easier integration into
MultiQC
. Iftag2sort
is called viaSNPsplit
itself, the...sort.yaml
file will be integrated into the main...SNPsplit_report.yaml
file (and deleted afterwards) -
Added option
--skip_tag2sort
to allow the separation of the allele-tagging and allele-sorting (tag2sort
) processes. This might be desired to add a de-duplication step such asmarkduplicates
ordeduplicate_bismark
for Nextflow pipelines -
For genomes that consist of chromosomes for which SNPs are recorded, and scaffolds for which there are no SNPs, now all chromosomes and scaffolds are printed to both the N-masked and full sequence genomes (see here).
-
Added auto-detection of single-end or paired-end files. This avoids accidentally processing paired-end files in single-end mode see here.
-
Now making use of variable genome_build instead of using GRCm38 invariably
v0.3.4 - Added SNPsplit to Bioconda
-
Changed
/usr/bin/perl
to/usr/bin/env perl
, which was required for adding SNPsplit to bioconda. Thanks to @vivekbhr for these changes. -
Fixed output-path handling for paired-end and Hi-C mode (was only working for single-end files).
tag2sort
- Added option
-o/--output_dir
to specify an output directory.
Fixed allele-assignment for certain SNPs in --bisulfite mode
v0.3.3
SNPsplit
-
Changed
FindBin qw($Bin)
toFindBin qw($RealBin)
so that symlinks totag2sort
are resolved properly. -
In certain cases, specific SNPs were only used for the allele assignment if they were methylated. In more detail: In cases where the SNP was either C/G (REF/ALT) or G/C (REF/ALT), and the read was on the opposing strand, only the methylated form of the C on the reverse strand had previously been allowed as a valid expected base. This has now been changed so that both G and A are considered valid for the strain containing a G at the SNP position (see also this issue).
-
Changed the way in which C>T SNPs are handled in the allele-tagging report (note that this was merely a report/interpretation thing and did not have any effect the on the actual results). Previously, reads without a call for genome 1 or genome 2 had been listed as:
reads did not contain one of the expected bases at known SNP positions.
In a bisulfite setting this also included C>T SNPs however, and hence the number could have been rather high (>10%). I have now changed this so that reads which had at least one C>T SNP and were unassignable at the same time are scored differently:
reads that were unassignable contained C>T SNPs preventing the assignment -
Changed all instances of
zcat
togunzip -c
inSNPsplit
andSNPsplit_genome_preparation
to prevent errors on certain OSX platforms
v0.3.2 - Much improved SNP genome preparation
SNPsplit
-
Changed the
samtools
command throughout SNPsplit to now correctly use the path supplied by the user with--samtools_path
. Thanks to Kenzo Hillion for spotting this (see here). -
Option
--genome_build [NAME]
should now work as intended (used to be--build
only).
SNPsplit_genome_preparation
-
Relaxed SNP filtering criteria to now support multiple homozygous variants for the same position in the genome. This step should incresae the number of usable SNPs slightly (but noticably). See here
-
Changed the SNP filtering for
--dual_hybrid
mode to only include positions where both strains had a high confidence call (irrespective of the nature of the call). This step should greatly reduce the number of false positive allele calls. See here for more details. -
Added a check to
SNPsplit_genome_preparation
that produces a [FATAL ERROR] if the stored chromosome names are not the same as the ones in the VCF file (which is a rather common mistake when people use the Ensembl VCF file but get the genome from UCSC. This should change soon if and when Ensembl adopts the same standard used by NCBI/UCSC). -
Added a new version of the genome preparation script that can deal with the latest version of the VCF file for the old NCBIM37 genome build ("mgp.v2.snps.annot.reformat.vcf.gz"). The script is called "SNPsplit_genome_preparation_v2VCF" and may be found in the folder "outdated_VCF_versions" on Github. Please note that this does not include the changes to we made the current version (see above).
Automated genome preparation for single or dual hybrid strains
SNPsplit
- Changed sorting command for BAM files to also work with Samtools versions 1.3+
- The sorting report for single-end files is now also written to the report files.
- Added the # of SNPs used for the allele-discrimination to the report file to make it easier to spot errors
- Now removing CR and LF line endings when reading in the SNP file. For SNP annotation files copied from a Windows machine we saw problems with no allele-specific reads for genome 2 at all which was due to the invisible \r character for the SNP call
SNPsplit_genome_preparation
Added whole new functionality to construct single- or dual-hybrid genomes starting from VCF files which are obtainable from the Mouse Genomes Project (http://www.sanger.ac.uk/science/data/mouse-genomes-project), here is a brief description of what it does:
SNPsplit_genome_preparation
is designed to read in a variant call files from the Mouse Genomes Project (e.g. this latest file: ftp://ftp-mouse.sanger.ac.uk/current_snps/mgp.v5.merged.snps_all.dbSNP142.vcf.gz) and generate new genome versions where the strain SNPs are either incorporated into the new genome (full sequence) or masked by the ambiguity nucleobase 'N' (N-masking).
SNPsplit_genome_preparation
may be run in two different modes:
Single strain mode:
- The VCF file is read and filtered for high-confidence SNPs in the strain specified with strain
- The reference genome (given with
--reference_genome <genome>
) is read into memory, and the filtered high-confidence SNP positions are incorporated either as N-masking (default) or full sequence (option--full_sequence
)
Dual strain mode:
- The VCF file is read and filtered for high-confidence SNPs in the strain specified with
--strain <name>
- The reference genome (given with
--reference_genome <genome>
) is read into memory, and the filtered high-confidence SNP positions are incorporated as full sequence and optionally as N-masking - The VCF file is read one more time and filtered for high-confidence SNPs in strain 2 specified with
--strain2 <name>
- The filtered high-confidence SNP positions of strain 2 are incorporated as full sequence and optionally as N-masking
- The SNP information of strain and strain 2 relative to the reference genome build are compared, and a new Ref/SNP annotation is constructed whereby the new Ref/SNP information will be Strain/Strain2 (and no longer the standard reference genome strain Black6 (C57BL/6J))
6.The full genome sequence given with--strain <name>
is read into memory, and the high-confidence SNP positions between Strain and Strain2 are incorporated as full sequence and optionally as N-masking
The resulting .fa
files are ready to be indexed with your favourite aligner. Proved and tested aligners include Bowtie2
, Tophat
, STAR
, Hisat2
, HiCUP
and Bismark
. Please note that STAR
and Hisat2
may require you to disable soft-clipping, please see the SNPsplit manual more details
Both the SNP filtering and the genome preparation write out little report files for record keeping.