Skip to content
Evan Denmark edited this page Aug 1, 2014 · 5 revisions

Welcome to the SAMtoCIRCOS wiki!

This software was made at the Woods Hole Oceanographic Institution by Matthew Neave and Evan Denmark. This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.

Purpose: The purpose of SAMtoCIRCOS is just that: it takes in a SAM file and outputs all files needed to run CIRCOS visualization software for prokaryotic genome visualization. With input of a SAM file, the program will output a coverage file, karyotype file, and ends file. Additionally, if a FASTA file is given as input, the GC content for each subunit and gaps in the scaffolds will be calculated.

The program takes in a SAM file as a required argument. Optionally, you may provide a windowSize (the size of each subunit when calculating coverage and GC content) and an end_size (the distance from the end of the scaffold in which mate pairs are calculated). You DO NOT have to provide these, as the defaults are set to windowSize = 1000 and end_size = 500. A FASTA file is also an additional optional argument. If a FASTA is provided IN ADDITION TO A SAM FILE, the program will calculate the GC content for each subunit and the gaps in the scaffolds. The default for this optional parameter is NONE, so you do not have to provide one. If it is not provided only 3 files will be produced. If it is provided, 5 files will be produced. The SAM file will provide the program with scaffolds and the program will split each of the scaffolds into smaller sections (subunits) to make the information outputted more useful.

Use: On a Linux/Unix based system, the program can be run in the command line:

[user@hostnetwork ~] SAMtoCIRCOS.py sam_file.sam -windowSize 1000 -end_size 500 -fasta fasta_file.fasta

Use the – h as help.

Outputs: The coverage file produced gives the scaffold name, each subunit within each scaffold, and how many times a read maps to that subunit. If no reads are mapped to the subunit, it will be given a value of zero. In most cases (but not all) when there are consecutive subunits within a scaffold of coverage zero, there is a gap spanning these subunits.

The ends file produced gives the original scaffold with the original mates start and end position on the original scaffold as well as its mate's scaffold start and position. For this version of the program, mates are 100 base pairs long because the program used to generate the mates (HiSeq) creates mates of this length.

The karyotype file produced gives the scaffold name, its length, and the color in which the scaffold will appear when it is run through a visualization software.

The GC file produced gives the scaffold name, each subunit within each scaffold, and the GC content of each scaffold. In most cases when the GC content of consecutive subunits is zero, it is due to a gap spanning these scaffolds. In addition, if a windowSize is not specified in the command line, the default is 1000. Therefore, one would expect every subunit to have a GC content percentage to no more than one decimal place (ex. 541 GC out of 1000 base pairs results in a 54.1% GC content). However, in many cases, the GC content goes far beyond 1 decimal place because of gaps within the subunit. Some gaps may be only a single nucleotide. Because this program does not count gaps as nucleotides when calculating GC content, this results in a fraction with a denominator other than 1000, giving a percentage with many decimals. Of course, this only applies when the default 1000 is used as the windowSize.

The gap file produced gives the scaffold in which the gap resides and the start and end position of the gap. This program defines a gap as unknown nucleotides, given as "N" in the FASTA file.

In addition to the file produced the following will be displayed to the user:

  • The number of reads in the SAM file that do not match to any scaffolds
  • Warnings if any scaffolds did not have any reads matched to them
  • Warnings if the scaffolds provided in your SAM are different than those in the FASTA (possibly due to a recombination of your scaffolds with outside software)
Clone this wiki locally