Skip to content

Commit

Permalink
doc: explain the new -x option in genome2access.py
Browse files Browse the repository at this point in the history
  • Loading branch information
etal committed May 1, 2015
1 parent c3e8f9e commit 73f457b
Show file tree
Hide file tree
Showing 2 changed files with 29 additions and 10 deletions.
37 changes: 28 additions & 9 deletions doc/scripts.rst
Original file line number Diff line number Diff line change
@@ -1,17 +1,36 @@
Additional scripts
==================

refFlat2bed.py
Generate a BED file of the genes or exons in the reference genome given in
UCSC refFlat.txt format.
This script can be used in case the original BED file of targeted intervals
is unavailable. Subsequent steps of the pipeline will remove probes that
did not receive sufficient coverage, including those exons or genes that
were not targeted by the sequencing library. However, better results are
expected from CNVkit if the true targeted intervals can be provided.

genome2access.py:
Calculate the sequence-accessible coordinates in chromosomes from the given
reference genome, treating long spans of 'N' characters as the inaccessible
regions.

CNVkit will compute "antitarget" bins only within the accessible genomic
regions specified in the "access" file produced by this script. If there are
many small excluded/inaccessible regions in the genome, then small,
less-reliable antitarget bins would be squeezed into the remaining
accessible regions. The ``-s`` option tells the script to ignore short
regions that would otherwise be excluded as inaccessible, allowing larger
antitarget bins to overlap them.

Additional regions to exclude can also be given with the ``-x`` option. This
option can be used more than once to exclude several BED files listing
different sets of regions. For example, "excludable" regions of poor
mappability have been precalculated by others and are available from the
`UCSC FTP Server <ftp://hgdownload.soe.ucsc.edu/goldenPath/>`_
(see `here for hg19
<ftp://hgdownload.soe.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeMapability/>`_).


refFlat2bed.py
Generate a BED file of the genes or exons in the reference genome given in
UCSC refFlat.txt format. (Download the input file from `UCSC Genome
Bioinformatics <http://hgdownload.soe.ucsc.edu/downloads.html>`_).

This script can be used in case the original BED file of targeted intervals
is unavailable. Subsequent steps of the pipeline will remove probes that
did not receive sufficient coverage, including those exons or genes that
were not targeted by the sequencing library. However, CNVkit will give much
better results if the true targeted intervals can be provided.

2 changes: 1 addition & 1 deletion scripts/genome2access.py
Original file line number Diff line number Diff line change
Expand Up @@ -174,7 +174,7 @@ def next_or_inf(iterable):
AP = argparse.ArgumentParser(description=__doc__)
AP.add_argument("fa_fname",
help="Genome FASTA file name")
AP.add_argument("-s", "--min-gap-size", type=int, default=100,
AP.add_argument("-s", "--min-gap-size", type=int, default=5000,
help="""Minimum gap size between accessible sequence
regions. Regions separated by less than this distance will
be joined together. [Default: %(default)s]""")
Expand Down

0 comments on commit 73f457b

Please sign in to comment.