Skip to content

Commit

Permalink
Merge branch 'master' into Version1.1.0
Browse files Browse the repository at this point in the history
  • Loading branch information
knutdrand committed Apr 6, 2018
2 parents f9eb041 + 9d07b9a commit 1dc611a
Show file tree
Hide file tree
Showing 19 changed files with 347 additions and 113 deletions.
12 changes: 8 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,13 +40,17 @@ Note that even though Graph Peak Caller is typically used with vg, it is possibl
### Step 2: Call peaks
Finally, we can call peaks by using the *callpeaks* command:
```
graph_peak_caller callpeaks -g graph.nobg -s alignments.json -f FRAGMENT_LENGTH -r READ_LENGTH
graph_peak_caller callpeaks -g graph.nobg -s alignments.json
```

Make sure to change *FRAGMENT_LENGTH* and *READ_LENGTH* with numbers matching your data. If you do not know the fragment length of your ChIP-seq experiment, you can use Macs' predictd command to estimate it: `macs predict -t alignments.bam`.

If you do not have a control track, use your sample reads as control and change True to False in the above command. Changing True to False is important as Graph Peak Caller will generate the background signal in a different way when the sample is used as control.

Note that you easily can send a list of graphs and input files to run on multiple graphs. This is useful if you e.g. have one graph and one input file for each chromosome:
```
graph_peak_caller callpeaks -g graph_chr*.nobg -s alignments_chr*.json
```
On a unix command line, the * will work as a wildcard. `-g` and `-s` can also take multiple file names separated by space.

### Output
The peak caller will create various output files, some of which are less relevant and only useful for debugging. The two files containing peak information are:
* **(base_name)max_paths.intervalcollection**: This file contains the peaks in a JSON format. This file can be read by graph peak caller in a Python script, e.g. by doing:
Expand All @@ -69,7 +73,7 @@ Change *chromosome* to the chromosome that you want to be used when writing the


## Advanced usage
If you want to do Peak Calling on a whole-genome reference, vg will typically produce one graph for each chromosome, and it is best to divide the peak calling into one process for each chromosome. Check out [this guide](https://github.com/uio-bmi/graph_peak_caller/wiki/Graph-based-ChIP-seq-tutorial) for a detailed explanation on how that can be done.
If you want to do Peak Calling on a whole-genome reference, vg will typically produce one graph for each chromosome, and it is best to divide the peak calling into one process for each chromosome. You can simply do that by sending multiple graphs and input files to the `callpeaks` command, but if you are running on huge data, it will be faster to start one callpeaks process for each chromosome and run these in parallel. An important thing to keep in mind then, is that each process should only run until p-values have been computed before continuing, since q-values needs to be computed from all p-values. Check out [this guide](https://github.com/uio-bmi/graph_peak_caller/wiki/Graph-based-ChIP-seq-tutorial) for a detailed explanation on how that can be done.

# Reproducing the results from the Graph Peak Caller manuscript.
Follow [this guide](https://github.com/uio-bmi/graph_peak_caller/wiki/Reproducing-the-results-in-Graph-Peak-Caller-Paper) in order to run the ChIP-seq experiments presented in the manuscript.
Expand Down
Empty file modified __init__.py
100755 → 100644
Empty file.
32 changes: 17 additions & 15 deletions benchmarking/simple_chip_seq_pipeline.sh
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,8 @@ fi
read_length=$(cat macs_output_whole_run.txt | gawk 'match($0, /tag size = ([0-9]+)/, ary) {print ary[1]}' )
echo "Found read length: $read_length"
fragment_length=$(cat macs_output_whole_run.txt | gawk 'match($0, /fragment length is ([0-9]+)/, ary) {print ary[1]}' )
echo "Found fragment length: $fragment_length"

#echo "Found fragment length: $fragment_length"


# Step 3: Map reads
Expand Down Expand Up @@ -131,6 +132,16 @@ echo "Removing low quality reads."
#cp filtered.json filtered_low_qual_reads_removed.json


# Get fragment length

#if [ ! -f shift_estimation_log.txt ]; then
# graph_peak_caller estimate_shift $chromosomes $graph_dir/ filtered_low_qual_reads_removed_ 3 100 > shift_estimation_log.txt 2>&1
#else
# echo "Not finding fragment length. Already done."
#fi
#fragment_length=$(cat shift_estimation_log.txt | gawk 'match($0, /Found shift: ([0-9]+)/, ary) {print ary[1]}' )
#echo "Fragment length is $fragment_length"


# Step 5: Split filtered into chromosomes
if ls filtered_low_qual_reads_removed_* 1> /dev/null 2>&1; then
Expand All @@ -149,31 +160,22 @@ fi

unique_reads=$(tail -n 1 count_unique_reads_output.txt)

#if [ ! -f unique_reads.txt ]; then
# echo "Counting unique reads"
# unique_reads=$(pcregrep -o1 '"sequence": "([ACGTNacgtn]{20,})"' filtered_low_qual_reads_removed.json | sort | uniq | wc -l)
# echo $unique_reads > unique_reads.txt
# echo "$unique_reads unique reads in total"
#else
# echo "Using cached count for unique reads."
# unique_reads=$(tail -n 1 unique_reads.txt)
#fi

# Step 6 run peak caller to get p-values for every chromosome
pids=""
RESULT=0
for chromosome in $(echo $chromosomes | tr "," "\n")
do
if [ ! -f ${chromosome}_pvalues_values.npy ]; then
graph_peak_caller callpeaks_whole_genome $chromosome \
-d $graph_dir \
-s filtered_low_qual_reads_removed_ \
-n "" \
graph_peak_caller callpeaks \
-g $graph_dir/$chromosome.nobg \
-s filtered_low_qual_reads_removed_$chromosome.json \
-n ${chromosome}_ \
-f $fragment_length \
-r $read_length \
-p True \
-u $unique_reads \
-g $genome_size \
-G $genome_size \
> log_before_p_values_$chromosome.txt 2>&1 &
pids="$pids $!"
echo "Peak calling (until p-values) for chr $chromosome started as process. Log will be written to $work_dir/log_before_p_values_$chromosome.txt"
Expand Down
4 changes: 2 additions & 2 deletions graph_peak_caller/callpeaks.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
import json
import sys


class Configuration:
def __init__(self):
self.read_length = None
Expand Down Expand Up @@ -47,6 +48,7 @@ def run_pre_callpeaks(self, input_reads, control_reads):
sample_pileup = get_fragment_pileup(
self.graph, input_reads, self.config,
self._reporter)

background_func = get_background_track_from_control
except json.decoder.JSONDecodeError as e:
logging.debug(e)
Expand Down Expand Up @@ -155,8 +157,6 @@ def __postprocess(self):
self.touched_nodes
).run()
self._reporter.add("hole_cleaned", self.pre_processed_peaks)

logging.info("Not removing small peaks")
self.filtered_peaks = self.pre_processed_peaks

def __get_max_paths(self):
Expand Down
149 changes: 146 additions & 3 deletions graph_peak_caller/callpeaks_interface.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import logging
import os
import numpy as np

import offsetbasedgraph as obg
from pyvg.conversion import vg_json_file_to_interval_collection
Expand All @@ -10,9 +11,20 @@
from .peakfasta import PeakFasta
from .reporter import Reporter
from .intervals import UniqueIntervals
from .shiftestimation import MultiGraphShiftEstimator
import sys


def estimate_read_length(file_name, graph_name):
graph = obg.Graph.from_file(graph_name)
if file_name.endswith(".intervalcollection"):
intervals = obg.IntervalCollection.create_generator_from_file(
file_name, graph=graph)
else:
intervals = vg_json_file_to_interval_collection(file_name, graph)
return int(np.median([i.length() for i in intervals]))


def get_confiugration(args):
config = Configuration()
config.has_control = args.control is not None
Expand All @@ -32,8 +44,6 @@ def get_intervals(args):
samples = iclass(parse_input_file(args.sample, args.graph))
control_name = args.sample if args.control is None else args.control
controls = iclass(parse_input_file(control_name, args.graph))


return samples, controls


Expand All @@ -46,6 +56,7 @@ def get_callpeaks(args):


def parse_input_file(input, graph):
logging.info("Parsing input file %s" % input)
try:
if isinstance(input, obg.IntervalCollection):
return input
Expand All @@ -65,6 +76,7 @@ def parse_input_file(input, graph):
logging.critical("Input file %s not found. Aborting. " % input)
sys.exit(1)


def run_callpeaks_interface(args):
caller = get_callpeaks(args)
inputs, controls = get_intervals(args)
Expand All @@ -90,6 +102,137 @@ def find_or_create_linear_map(graph, linear_map_name):
logging.info("Done creating linear map")


def _get_file_names(pattern):
from glob import glob

if pattern is None:
return None

assert isinstance(pattern, list)
pattern = sorted(pattern)

if isinstance(pattern, list):
if len(pattern) > 1:
return pattern
else:
if "*" in pattern[0]:
return sorted(glob(pattern[0]))

return sorted(pattern)


def run_callpeaks2(args):
graphs = _get_file_names(args.graph)
samples = _get_file_names(args.sample)
controls = _get_file_names(args.control)
if controls is None:
controls = samples

if len(samples) != len(graphs):
logging.critical("""Number of sample files must be equal to number of graph files.
Found %d graphs and %d sample files.
Use --verbose 2 for more details.""" % (len(graphs), len(samples)))
sys.exit(1)

logging.debug("Fragment length input: %s" % args.fragment_length)
logging.info("Sample files: %s" % samples)
logging.debug("Control files: %s" % samples)
logging.debug("Graph files: %s" % graphs)

logging.info("Using graphs: %s " % graphs)
sequence_graph_file_names = [fn + ".sequences" for fn in graphs]
logging.info("Will use sequence graphs. %s" % sequence_graph_file_names)
sequence_retrievers = (obg.SequenceGraph.from_file(fn)
for fn in sequence_graph_file_names)

data_dir = os.path.dirname(graphs[0])
if data_dir == "":
data_dir = "./"

logging.info("Using graphs from data directory %s" % data_dir)
if len(graphs) > 1:
# Make custom names for different runs
print([fn.split(data_dir, 1)[-1] for fn in graphs])
names = [fn.split(data_dir, 1)[-1].rsplit(".", 1)[0] for fn in graphs]
else:
names = [""]

logging.info("Will use %s as extra experiments names for each run, based on graph file names."
"If only running on single graph, this should be empty. " % names)

linear_map_file_names = []
for i, graph_file_name in enumerate(graphs):
linear_map_name = graph_file_name.split(".nobg")[0] + "_linear_map.npz"
if not os.path.isfile(linear_map_name):
logging.warning("Did not find linear map for "
" for graph %s. Will create." % graph_file_name)
graph = obg.Graph.from_file(graphs[i])
create_linear_map(graph, linear_map_name)
else:
logging.info(
"Found linear map %s that will be used." % linear_map_name)
linear_map_file_names.append(linear_map_name)

logging.info("Will use linear maps: %s" % linear_map_file_names)

config = Configuration()
if args.keep_duplicates == "True":
config.keep_duplicates = True
logging.info("Keeping duplicates")

if args.fragment_length is None:
logging.info("Fragment length was not specified. Will now"
" predict fragment length.")

min_m = 5 if args.min_fold_enrichment is None else int(args.min_fold_enrichment)
max_m = 50 if args.max_fold_enrichment is None else int(args.max_fold_enrichment)
config.fragment_length = int(MultiGraphShiftEstimator.from_files(
graphs, samples, min_m, max_m).get_estimates())
logging.info("Estimated fragment length to be %d" % config.fragment_length)
assert config.fragment_length < 1000, "Suspiciously high fragment length. Probably a bad estimate." \
" Change -m/-M to try to get more paired peaks."
else:
config.fragment_length = int(args.fragment_length)
if args.read_length is None:
config.read_length = estimate_read_length(samples[0], graphs[0])
logging.info("Estimated read length to %s" % config.read_length)
else:
config.read_length = int(args.read_length)

if config.fragment_length < config.read_length:
logging.critical("Fragment length is smaller than read length. Cannot call peaks.")
sys.exit(1)

if args.genome_size is not None:
genome_size = int(args.genome_size)
config.min_background = int(args.unique_reads) * int(args.fragment_length) / genome_size

logging.info(
"Computed min background signal to be %.3f using fragment length %f, "
" %d unique reads, and genome size %d" % (config.min_background,
config.fragment_length,
int(args.unique_reads),
int(genome_size)))
else:
logging.info("Not using min background.")
config.min_background = None

out_name = args.out_name if args.out_name is not None else ""
reporter = Reporter(out_name)
config.has_control = args.control is not None
caller = MultipleGraphsCallpeaks(
names,
graphs,
samples,
controls,
linear_map_file_names,
config, reporter,
sequence_retrievers=sequence_retrievers,
stop_after_p_values=args.stop_after_p_values == "True",
)
caller.run()


def run_callpeaks_whole_genome(args):
logging.info("Running run_callpeaks_whole_genome")

Expand All @@ -109,7 +252,7 @@ def run_callpeaks_whole_genome(args):
if not os.path.isfile(linear_map_name):
logging.warning("Did not find linear map for "
"chromosome %s. Will create." % chrom)
graph = obg.Graph.from_numpy_file(graph_file_names[i])
graph = obg.Graph.from_file(graph_file_names[i])
create_linear_map(graph, linear_map_name)
else:
logging.info("Found linear map %s that will be used." % linear_map_name)
Expand Down
Loading

0 comments on commit 1dc611a

Please sign in to comment.