-
Notifications
You must be signed in to change notification settings - Fork 133
BamToFastq
Pierre Lindenbaum edited this page Nov 20, 2013
·
11 revisions
##Motivation
implementation of of https://twitter.com/DNAntonie/status/402909852277932032 " +
Shrink your FASTQ.bz2 files by 40+% using this one weird tip -> order them by alignment to reference before compression
##Compilation See also Compilation.
$ ant bam2fastq
##Options
Name | Description |
---|---|
-v | print version and exit. |
-E (name) | restrict to that enzyme. Can be called multiple times. Optional |
-t (dir) | set temporary directory . Optional |
-F (fastq) | Save fastq_R1 to file (default: stdout) . Optional. |
-R (fastq) | Save fastq_R2 to file (default: interlaced with forward) . Optional |
-r | repair: insert missing read |
-N (int) | max records in memory. Optional |
##Warning
- illumina read is filtered is always "n"
- illumina control number is always 0
- Illumina index sequence is lost.
##Example
$ bwa mem -M human_g1k_v37.fasta Sample1_L001_R1_001.fastq.gz Sample2_S5_L001_R2_001.fastq.gz |\
java -jar dist/bam2fastq.jar -F tmpR1.fastq.gz -R tmpR2.fastq.gz
before:
$ ls -lah Sample1_L001_R1_001.fastq.gz Sample2_S5_L001_R2_001.fastq.gz
-rw-r--r-- 1 lindenb lindenb 181M Jun 14 15:20 Sample1_L001_R1_001.fastq.gz
-rw-r--r-- 1 lindenb lindenb 190M Jun 14 15:20 Sample1_L001_R2_001.fastq.gz
after (these are Haloplex Data, with a lot of duplicates )
$ ls -lah tmpR1.fastq.gz tmpR2.fastq.gz
-rw-rw-r-- 1 lindenb lindenb 96M Nov 20 17:10 tmpR1.fastq.gz
-rw-rw-r-- 1 lindenb lindenb 106M Nov 20 17:10 tmpR2.fastq.gz
check the number of reads
$ gunzip -c Sample1_L001_R1_001.fastq.gz | wc -l
5824676
$ gunzip -c tmpR1.fastq.gz | wc -l
5824676
verify one read
$ gunzip -c Sample1_L001_R1_001.fastq.gz | cat -n | head -n 4
1 @M00491:25:000000000-A46H3:1:1101:11697:2045 1:N:0:5
2 AGATCGGAAGAGCACACGTCTGAACTCCAGTCACACATTGGCAAATAGCATGCCGAGGTACGCTTAAAAAAAAAACGACGCGAGGCAGGGGGGGAGGAAGCAGGGGAGCAACAGGGGGAAGGGAAGGGAAGAGAAGAAGAACGAACGAAAG
3 +
4 AAAAAAAA1AC1FFGCGA0AFFBGAGHHFF2GBGHH0B2DBCF101111D211B////A11///B/1DE1E/>>E//?///</<><C////<?9-9-99A-;/---;---;-9--9=---------9:AF---9//:/9/:9---9-:-9-
$ gunzip -c tmpR1.fastq.gz | cat -n | grep -A 3 -w "@M00491:25:000000000-A46H3:1:1101:11697:2045"
5771577 @M00491:25:000000000-A46H3:1:1101:11697:2045 1:N:0:1
5771578 AGATCGGAAGAGCACACGTCTGAACTCCAGTCACACATTGGCAAATAGCATGCCGAGGTACGCTTAAAAAAAAAACGACGCGAGGCAGGGGGGGAGGAAGCAGGGGAGCAACAGGGGGAAGGGAAGGGAAGAGAAGAAGAACGAACGAAAG
5771579 +
5771580 AAAAAAAA1AC1FFGCGA0AFFBGAGHHFF2GBGHH0B2DBCF101111D211B////A11///B/1DE1E/>>E//?///</<><C////<?9-9-99A-;/---;---;-9--9=---------9:AF---9//:/9/:9---9-:-9-
$ java -jar dist/bam2fastq.jar \
-F tmpR1.fastq.gz -R tmpR2.fastq.gz file.bam
(...)
-rw-r--r-- 1 lindenb lindenb 565M Nov 18 10:44 Sample_S1_L001_R1_001.fastq.gz
-rw-r--r-- 1 lindenb lindenb 649M Nov 18 10:45 Sample_S1_L001_R2_001.fastq.gz
-rw-rw-r-- 1 lindenb lindenb 470M Nov 20 16:17 tmpR1.fastq.gz.fastq.gz
-rw-rw-r-- 1 lindenb lindenb 554M Nov 20 16:17 tmpR2.fastq.gz.fastq.gz