Skip to content
Olavhaasie edited this page Jan 8, 2018 · 5 revisions

Quick Start Example

Here is a simple example for generating an Illumina Mate-Pair library, checking the alignment against your reference genome, and then converting the output alignment file to fastq format for use in most other tools.

Prerequisites

  1. Download and install samtools: http://samtools.sourceforge.net/ and add executable to your $PATH
  2. Download latest Picard jar file: http://sourceforge.net/projects/picard/files/, and extract .jar files somewhere that makes sense like $HOME/jars/, for the rest of this I will assume that all jar files are located in $HOME/jars/

Generate and check simulated reads

  1. Download SimSeq.jar, error_profile.txt, and AlligatorMito.fa
  2. put SImSeq.jar in your jar directory ($HOME/jars/) and put error_profile.txt and AlligatorMito.fa in a folder where you want to work
  3. Generate the simulated reads: java -jar -Xmx2048m $HOME/jars/SimSeq.jar -1 100 -2 100 -e error_profile.txt --insert_size 3000 --insert_stdev 300 --mate_pair --mate_frag 500 --mate_frag_stdev 50 --mate_pulldown_error_p 0.3 --read_number 10000 --read_prefix AMito_ --reference AlligatorMito.fa --duplicate_probability 0.01 --out out.sam
  4. Convert sam output to bam: samtools view -bS -T AlligatorMito.fa -t AlligatorMito.size -o out.bam out.sam
  5. Sort the bam file: samtools sort out.bam out.sorted
  6. Index the sorted bam file: samtools index out.sorted.bam
  7. Check the alignment in tview: samtools tview out.sorted.bam AlligatorMito.fa

Convert Sam output to Fastq

Run the picard program SamToFastq: java -jar -Xmx2048 $HOME/jars/SamToFastq.jar INPUT=out.sorted.bam FASTQ=library.1.fastq SECOND_END_FASTQ=library.2.fastq INCLUDE_NON_PF_READS=true VALIDATION_STRINGENCY=SILENT

Notes:

  1. setting the VALIDATION_STRINGENCY=SILENT is probably not ideal, but for now it is necessary because samtools isn't putting the length of the reference sequence into the header, which is expected by Picard. You get an error about the first read having a coordinate outside of the range of the reference sequence which has a length of 0. If you use Picard to somehow fix the header given the reference sequence this should work without telling it not to validate your sequence.
  2. The current sam alignment for a chimeric read shows the fragment starting at the left-most coordinate. Since the rest of the sequence occurs before the left most coordinate I use a soft clipping for the remainder of the sequence. This does not effect the output fastq file generated by the above method, but when you are looking at the alignment in samtools tview you will see some sequences as short as 1 nucleotide. This is normal, and doesn't mean that the actual sam records are that short.