Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pick out best reference from a sam file #65

Open
averagehat opened this issue Oct 28, 2015 · 5 comments
Open

Pick out best reference from a sam file #65

averagehat opened this issue Oct 28, 2015 · 5 comments

Comments

@averagehat
Copy link
Contributor

What would be a reasonable metric for choosing the best-fitting reference from a mapping?
I am thinking something like average # reads mapped at each position + total reads mapped - unmapped position. We could also take into account the quality and mapping quality of reads. We could use a dataframe for this:

www.github.com/averagehat/bioframes

@necrolyte2
Copy link
Member

You can sort them I think

  1. % Reference coverage
  2. Avg depth
  3. mapped reads

Having complete coverage is the most important I'm pretty sure
@mmelendrez
@InaMBerry

@averagehat
Copy link
Contributor Author

Yes. You could pick the best reference (by a wide margin), re-run the mapping automatically, but still save all the data from the original mapping so the PI can confirm the choice.

@averagehat averagehat added this to the Analyze Alignments milestone Feb 25, 2016
@averagehat averagehat self-assigned this Feb 25, 2016
@averagehat
Copy link
Contributor Author

  1. and 2) can be derived from the pileup
    PileUp Columns:
refId   position(1-based)    reference    depth   readBases   baseQauls
  1. can be achieved with idxstats
    samtools idxstats X.bam
    idxStats columns: (refId may be star indicating unmapped)
refId    sequenceLength   mappedReadCount   unmappedReadCount

for pileup:
groupby refId
county missing positions (use sequenceLength)
sum depths

then: average depths (using sequenceLength)

for idxstats:
just read the values

@necrolyte2
Copy link
Member

Seems that you could utilize samtools depth

Usage: samtools depth [options] in1.bam [in2.bam [...]]
Options:
   -a                  output all positions (including zero depth)
   -a -a (or -aa)      output absolutely all positions, including unused ref. sequences
   -b <bed>            list of positions or regions
   -f <list>           list of input BAM filenames, one per line [null]
   -l <int>            read length threshold (ignore reads shorter than <int>)
   -d/-m <int>         maximum coverage depth [8000]
   -q <int>            base quality threshold
   -Q <int>            mapping quality threshold
   -r <chr:from-to>    region
      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
      --reference FILE
               Reference sequence FASTA FILE [null]

The output is a simple tab-separated table with three columns: reference name,
position, and coverage depth.  Note that positions with zero coverage may be
omitted by default; see the -a option.

@averagehat
Copy link
Contributor Author

I'm not sure if it makes sence to include both avg. depth and # mapped reads, because they represent basically the same information (assuming all references are about the same length.)

I'd propose the following weighted equation for deciding which is the best:

(coverageRatio * 1.5) * (mappedReads / totalReads) 

1.5 could be whatever weight value seems appropriate.
Of course, we could include lots of other information in this, if we wanted to.

@mmelendrez thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants