Skip to content
Pierre Lindenbaum edited this page Nov 20, 2013 · 11 revisions

##Motivation

implementation of of https://twitter.com/DNAntonie/status/402909852277932032 " +

Shrink your FASTQ.bz2 files by 40+% using this one weird tip -> order them by alignment to reference before compression

##Compilation See also Compilation.

$ ant bam2fastq

##Options

Name Description
-v print version and exit.
-E (name) restrict to that enzyme. Can be called multiple times. Optional
-t (dir) set temporary directory . Optional
-F (fastq) Save fastq_R1 to file (default: stdout) . Optional. gz or bz2 extensions should be ok
-R (fastq) Save fastq_R2 to file (default: interlaced with forward) . Optional. gz or bz2 extensions should be ok.
-r repair: insert missing read
-N (int) max records in memory. Optional

##Warning

  • illumina read is filtered is always "n"
  • illumina control number is always 0
  • Illumina index sequence is lost.

##Example

Example 1 : piping bwa mem

$ bwa mem -M  human_g1k_v37.fasta  Sample1_L001_R1_001.fastq.gz Sample2_S5_L001_R2_001.fastq.gz |\
  java -jar dist/bam2fastq.jar  -F tmpR1.fastq.gz -R tmpR2.fastq.gz

before:

$ ls -lah Sample1_L001_R1_001.fastq.gz Sample2_S5_L001_R2_001.fastq.gz
-rw-r--r-- 1 lindenb lindenb 181M Jun 14 15:20 Sample1_L001_R1_001.fastq.gz
-rw-r--r-- 1 lindenb lindenb 190M Jun 14 15:20 Sample1_L001_R2_001.fastq.gz

after (these are Haloplex Data, with a lot of duplicates )

$ ls -lah tmpR1.fastq.gz  tmpR2.fastq.gz
-rw-rw-r-- 1 lindenb lindenb  96M Nov 20 17:10 tmpR1.fastq.gz
-rw-rw-r-- 1 lindenb lindenb 106M Nov 20 17:10 tmpR2.fastq.gz

check the number of reads

$ gunzip -c Sample1_L001_R1_001.fastq.gz | wc -l
5824676
$ gunzip -c tmpR1.fastq.gz | wc -l
5824676

verify one read

$ gunzip -c Sample1_L001_R1_001.fastq.gz | cat -n | head -n 4
     1	@M00491:25:000000000-A46H3:1:1101:11697:2045 1:N:0:5
     2	AGATCGGAAGAGCACACGTCTGAACTCCAGTCACACATTGGCAAATAGCATGCCGAGGTACGCTTAAAAAAAAAACGACGCGAGGCAGGGGGGGAGGAAGCAGGGGAGCAACAGGGGGAAGGGAAGGGAAGAGAAGAAGAACGAACGAAAG
     3	+
     4	AAAAAAAA1AC1FFGCGA0AFFBGAGHHFF2GBGHH0B2DBCF101111D211B////A11///B/1DE1E/>>E//?///</<><C////<?9-9-99A-;/---;---;-9--9=---------9:AF---9//:/9/:9---9-:-9-


$ gunzip -c tmpR1.fastq.gz | cat -n | grep  -A 3 -w "@M00491:25:000000000-A46H3:1:1101:11697:2045"
5771577	@M00491:25:000000000-A46H3:1:1101:11697:2045 1:N:0:1
5771578	AGATCGGAAGAGCACACGTCTGAACTCCAGTCACACATTGGCAAATAGCATGCCGAGGTACGCTTAAAAAAAAAACGACGCGAGGCAGGGGGGGAGGAAGCAGGGGAGCAACAGGGGGAAGGGAAGGGAAGAGAAGAAGAACGAACGAAAG
5771579	+
5771580	AAAAAAAA1AC1FFGCGA0AFFBGAGHHFF2GBGHH0B2DBCF101111D211B////A11///B/1DE1E/>>E//?///</<><C////<?9-9-99A-;/---;---;-9--9=---------9:AF---9//:/9/:9---9-:-9-

Example 2 from FASTQ

$ java -jar dist/bam2fastq.jar \
    -F tmpR1.fastq.gz -R tmpR2.fastq.gz file.bam

(...)
-rw-r--r-- 1 lindenb lindenb 565M Nov 18 10:44 Sample_S1_L001_R1_001.fastq.gz
-rw-r--r-- 1 lindenb lindenb 649M Nov 18 10:45 Sample_S1_L001_R2_001.fastq.gz
-rw-rw-r-- 1 lindenb lindenb 470M Nov 20 16:17 tmpR1.fastq.gz.fastq.gz
-rw-rw-r-- 1 lindenb lindenb 554M Nov 20 16:17 tmpR2.fastq.gz.fastq.gz
Clone this wiki locally