Pick out best reference from a sam file #65

averagehat · 2015-10-28T21:46:47Z

What would be a reasonable metric for choosing the best-fitting reference from a mapping?
I am thinking something like average # reads mapped at each position + total reads mapped - unmapped position. We could also take into account the quality and mapping quality of reads. We could use a dataframe for this:

www.github.com/averagehat/bioframes

necrolyte2 · 2015-10-28T21:54:27Z

You can sort them I think

% Reference coverage
Avg depth
mapped reads

Having complete coverage is the most important I'm pretty sure
@mmelendrez
@InaMBerry

averagehat · 2015-10-28T22:15:12Z

Yes. You could pick the best reference (by a wide margin), re-run the mapping automatically, but still save all the data from the original mapping so the PI can confirm the choice.

averagehat · 2016-03-16T19:47:05Z

and 2) can be derived from the pileup
PileUp Columns:

refId   position(1-based)    reference    depth   readBases   baseQauls

can be achieved with idxstats
samtools idxstats X.bam
idxStats columns: (refId may be star indicating unmapped)

refId    sequenceLength   mappedReadCount   unmappedReadCount

for pileup:
groupby refId
county missing positions (use sequenceLength)
sum depths

then: average depths (using sequenceLength)

for idxstats:
just read the values

necrolyte2 · 2016-03-17T17:39:33Z

Seems that you could utilize samtools depth

Usage: samtools depth [options] in1.bam [in2.bam [...]]
Options:
   -a                  output all positions (including zero depth)
   -a -a (or -aa)      output absolutely all positions, including unused ref. sequences
   -b <bed>            list of positions or regions
   -f <list>           list of input BAM filenames, one per line [null]
   -l <int>            read length threshold (ignore reads shorter than <int>)
   -d/-m <int>         maximum coverage depth [8000]
   -q <int>            base quality threshold
   -Q <int>            mapping quality threshold
   -r <chr:from-to>    region
      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
      --reference FILE
               Reference sequence FASTA FILE [null]

The output is a simple tab-separated table with three columns: reference name,
position, and coverage depth.  Note that positions with zero coverage may be
omitted by default; see the -a option.

averagehat · 2016-03-18T17:04:21Z

I'm not sure if it makes sence to include both avg. depth and # mapped reads, because they represent basically the same information (assuming all references are about the same length.)

I'd propose the following weighted equation for deciding which is the best:

(coverageRatio * 1.5) * (mappedReads / totalReads)

1.5 could be whatever weight value seems appropriate.
Of course, we could include lots of other information in this, if we wanted to.

@mmelendrez thoughts?

averagehat added this to the Analyze Alignments milestone Feb 25, 2016

averagehat self-assigned this Feb 25, 2016

averagehat added the Easy label Feb 25, 2016

averagehat added the needs discussion label Mar 18, 2016

averagehat self-assigned this Mar 30, 2016

averagehat added the in progress label Mar 30, 2016

necrolyte2 added ready and removed in progress ready labels Apr 5, 2016

averagehat added the in progress label Apr 6, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pick out best reference from a sam file #65

Pick out best reference from a sam file #65

averagehat commented Oct 28, 2015

necrolyte2 commented Oct 28, 2015

mapped reads

averagehat commented Oct 28, 2015

averagehat commented Mar 16, 2016

necrolyte2 commented Mar 17, 2016

averagehat commented Mar 18, 2016

Pick out best reference from a sam file #65

Pick out best reference from a sam file #65

Comments

averagehat commented Oct 28, 2015

necrolyte2 commented Oct 28, 2015

mapped reads

averagehat commented Oct 28, 2015

averagehat commented Mar 16, 2016

necrolyte2 commented Mar 17, 2016

averagehat commented Mar 18, 2016