Merge branch 'master' into Version1.1.0

uio-bmi · Apr 6, 2018 · 1dc611a · 1dc611a
2 parents f9eb041 + 9d07b9a
commit 1dc611a
Show file tree

Hide file tree

Showing 19 changed files with 347 additions and 113 deletions.
diff --git a/README.md b/README.md
@@ -40,13 +40,17 @@ Note that even though Graph Peak Caller is typically used with vg, it is possibl
 ### Step 2: Call peaks
 Finally, we can call peaks by using the *callpeaks* command:
 ```
-graph_peak_caller callpeaks -g graph.nobg -s alignments.json -f FRAGMENT_LENGTH -r READ_LENGTH
+graph_peak_caller callpeaks -g graph.nobg -s alignments.json
 ```
 
-Make sure to change *FRAGMENT_LENGTH* and *READ_LENGTH* with numbers matching your data. If you do not know the fragment length of your ChIP-seq experiment, you can use Macs' predictd command to estimate it: `macs predict -t alignments.bam`.
-
 If you do not have a control track, use your sample reads as control and change True to False in the above command. Changing True to False is important as Graph Peak Caller will generate the background signal in a different way when the sample is used as control. 
 
+Note that you easily can send a list of graphs and input files to run on multiple graphs. This is useful if you e.g. have one graph and one input file for each chromosome:
+```
+graph_peak_caller callpeaks -g graph_chr*.nobg -s alignments_chr*.json
+```
+On a unix command line, the * will work as a wildcard. `-g` and `-s` can also take multiple file names separated by space.
+
 ### Output
 The peak caller will create various output files, some of which are less relevant and only useful for debugging. The two files containing peak information are:
 * **(base_name)max_paths.intervalcollection**: This file contains the peaks in a JSON format. This file can be read by graph peak caller in a Python script, e.g. by doing:
@@ -69,7 +73,7 @@ Change *chromosome* to the chromosome that you want to be used when writing the
 
 
 ## Advanced usage
-If you want to do Peak Calling on a whole-genome reference, vg will typically produce one graph for each chromosome, and it is best to divide the peak calling into one process for each chromosome. Check out [this guide](https://github.com/uio-bmi/graph_peak_caller/wiki/Graph-based-ChIP-seq-tutorial) for a detailed explanation on how that can be done.
+If you want to do Peak Calling on a whole-genome reference, vg will typically produce one graph for each chromosome, and it is best to divide the peak calling into one process for each chromosome. You can simply do that by sending multiple graphs and input files to the `callpeaks` command, but if you are running on huge data, it will be faster to start one callpeaks process for each chromosome and run these in parallel. An important thing to keep in mind then, is that each process should only run until p-values have been computed before continuing, since q-values needs to be computed from all p-values. Check out [this guide](https://github.com/uio-bmi/graph_peak_caller/wiki/Graph-based-ChIP-seq-tutorial) for a detailed explanation on how that can be done.
 
 # Reproducing the results from the Graph Peak Caller manuscript.
 Follow [this guide](https://github.com/uio-bmi/graph_peak_caller/wiki/Reproducing-the-results-in-Graph-Peak-Caller-Paper) in order to run the ChIP-seq experiments presented in the manuscript.

diff --git a/__init__.py b/__init__.py
diff --git a/benchmarking/simple_chip_seq_pipeline.sh b/benchmarking/simple_chip_seq_pipeline.sh
@@ -93,7 +93,8 @@ fi
 read_length=$(cat macs_output_whole_run.txt | gawk 'match($0,  /tag size = ([0-9]+)/, ary) {print ary[1]}' )
 echo "Found read length: $read_length"
 fragment_length=$(cat macs_output_whole_run.txt | gawk 'match($0,  /fragment length is ([0-9]+)/, ary) {print ary[1]}' )
-echo "Found fragment length: $fragment_length"
+
+#echo "Found fragment length: $fragment_length"
 
 
 # Step 3: Map reads
@@ -131,6 +132,16 @@ echo "Removing low quality reads."
 #cp filtered.json filtered_low_qual_reads_removed.json
 
 
+# Get fragment length
+
+#if [ ! -f shift_estimation_log.txt ]; then
+#    graph_peak_caller estimate_shift $chromosomes $graph_dir/ filtered_low_qual_reads_removed_ 3 100 > shift_estimation_log.txt 2>&1
+#else
+#    echo "Not finding fragment length. Already done."
+#fi
+#fragment_length=$(cat shift_estimation_log.txt | gawk 'match($0,  /Found shift: ([0-9]+)/, ary) {print ary[1]}' )
+#echo "Fragment length is $fragment_length"
+
 
 # Step 5: Split filtered into chromosomes
 if ls filtered_low_qual_reads_removed_* 1> /dev/null 2>&1; then
@@ -149,31 +160,22 @@ fi
 
 unique_reads=$(tail -n 1 count_unique_reads_output.txt)
 
-#if [ ! -f unique_reads.txt ]; then
-#    echo "Counting unique reads"
-#    unique_reads=$(pcregrep -o1 '"sequence": "([ACGTNacgtn]{20,})"' filtered_low_qual_reads_removed.json | sort | uniq | wc -l)
-#    echo $unique_reads > unique_reads.txt
-#    echo "$unique_reads unique reads in total"
-#else
-#    echo "Using cached count for unique reads."
-#    unique_reads=$(tail -n 1 unique_reads.txt)
-#fi
 
 # Step 6 run peak caller to get p-values for every chromosome
 pids=""
 RESULT=0
 for chromosome in $(echo $chromosomes | tr "," "\n")
 do
     if [ ! -f ${chromosome}_pvalues_values.npy ]; then
-        graph_peak_caller callpeaks_whole_genome $chromosome \
-            -d $graph_dir \
-            -s filtered_low_qual_reads_removed_ \
-            -n "" \
+        graph_peak_caller callpeaks \
+            -g $graph_dir/$chromosome.nobg \
+            -s filtered_low_qual_reads_removed_$chromosome.json \
+            -n ${chromosome}_ \
             -f $fragment_length \
             -r $read_length \
             -p True \
             -u $unique_reads \
-            -g $genome_size \
+            -G $genome_size \
             > log_before_p_values_$chromosome.txt 2>&1 &
             pids="$pids $!"
         echo "Peak calling (until p-values) for chr $chromosome started as process. Log will be written to $work_dir/log_before_p_values_$chromosome.txt"

diff --git a/graph_peak_caller/callpeaks.py b/graph_peak_caller/callpeaks.py
@@ -10,6 +10,7 @@
 import json
 import sys
 
+
 class Configuration:
     def __init__(self):
         self.read_length = None
@@ -47,6 +48,7 @@ def run_pre_callpeaks(self, input_reads, control_reads):
             sample_pileup = get_fragment_pileup(
                 self.graph, input_reads, self.config,
                 self._reporter)
+
             background_func = get_background_track_from_control
         except json.decoder.JSONDecodeError as e:
             logging.debug(e)
@@ -155,8 +157,6 @@ def __postprocess(self):
             self.touched_nodes
         ).run()
         self._reporter.add("hole_cleaned", self.pre_processed_peaks)
-
-        logging.info("Not removing small peaks")
         self.filtered_peaks = self.pre_processed_peaks
 
     def __get_max_paths(self):

diff --git a/graph_peak_caller/callpeaks_interface.py b/graph_peak_caller/callpeaks_interface.py
@@ -1,5 +1,6 @@
 import logging
 import os
+import numpy as np
 
 import offsetbasedgraph as obg
 from pyvg.conversion import vg_json_file_to_interval_collection
@@ -10,9 +11,20 @@
 from .peakfasta import PeakFasta
 from .reporter import Reporter
 from .intervals import UniqueIntervals
+from .shiftestimation import MultiGraphShiftEstimator
 import sys
 
 
+def estimate_read_length(file_name, graph_name):
+    graph = obg.Graph.from_file(graph_name)
+    if file_name.endswith(".intervalcollection"):
+        intervals = obg.IntervalCollection.create_generator_from_file(
+            file_name, graph=graph)
+    else:
+        intervals = vg_json_file_to_interval_collection(file_name, graph)
+    return int(np.median([i.length() for i in intervals]))
+
+
 def get_confiugration(args):
     config = Configuration()
     config.has_control = args.control is not None
@@ -32,8 +44,6 @@ def get_intervals(args):
     samples = iclass(parse_input_file(args.sample, args.graph))
     control_name = args.sample if args.control is None else args.control
     controls = iclass(parse_input_file(control_name, args.graph))
-
-
     return samples, controls
 
 
@@ -46,6 +56,7 @@ def get_callpeaks(args):
 
 
 def parse_input_file(input, graph):
+    logging.info("Parsing input file %s" % input)
     try:
         if isinstance(input, obg.IntervalCollection):
             return input
@@ -65,6 +76,7 @@ def parse_input_file(input, graph):
         logging.critical("Input file %s not found. Aborting. " % input)
         sys.exit(1)
 
+
 def run_callpeaks_interface(args):
     caller = get_callpeaks(args)
     inputs, controls = get_intervals(args)
@@ -90,6 +102,137 @@ def find_or_create_linear_map(graph, linear_map_name):
         logging.info("Done creating linear map")
 
 
+def _get_file_names(pattern):
+    from glob import glob
+
+    if pattern is None:
+        return None
+
+    assert isinstance(pattern, list)
+    pattern = sorted(pattern)
+
+    if isinstance(pattern, list):
+        if len(pattern) > 1:
+            return pattern
+        else:
+            if "*" in pattern[0]:
+                return sorted(glob(pattern[0]))
+
+    return sorted(pattern)
+
+
+def run_callpeaks2(args):
+    graphs = _get_file_names(args.graph)
+    samples = _get_file_names(args.sample)
+    controls = _get_file_names(args.control)
+    if controls is None:
+        controls = samples
+
+    if len(samples) != len(graphs):
+        logging.critical("""Number of sample files must be equal to number of graph files.
+                         Found %d graphs and %d sample files.
+                         Use --verbose 2 for more details.""" % (len(graphs), len(samples)))
+        sys.exit(1)
+
+    logging.debug("Fragment length input: %s" % args.fragment_length)
+    logging.info("Sample files: %s" % samples)
+    logging.debug("Control files: %s" % samples)
+    logging.debug("Graph files: %s" % graphs)
+
+    logging.info("Using graphs: %s " % graphs)
+    sequence_graph_file_names = [fn + ".sequences" for fn in graphs]
+    logging.info("Will use sequence graphs. %s" % sequence_graph_file_names)
+    sequence_retrievers = (obg.SequenceGraph.from_file(fn)
+                           for fn in sequence_graph_file_names)
+
+    data_dir = os.path.dirname(graphs[0])
+    if data_dir == "":
+        data_dir = "./"
+
+    logging.info("Using graphs from data directory %s" % data_dir)
+    if len(graphs) > 1:
+        # Make custom names for different runs
+        print([fn.split(data_dir, 1)[-1] for fn in graphs])
+        names = [fn.split(data_dir, 1)[-1].rsplit(".", 1)[0] for fn in graphs]
+    else:
+        names = [""]
+
+    logging.info("Will use %s as extra experiments names for each run, based on graph file names."
+                 "If only running on single graph, this should be empty. " % names)
+
+    linear_map_file_names = []
+    for i, graph_file_name in enumerate(graphs):
+        linear_map_name = graph_file_name.split(".nobg")[0] + "_linear_map.npz"
+        if not os.path.isfile(linear_map_name):
+            logging.warning("Did not find linear map for "
+                            " for graph %s. Will create." % graph_file_name)
+            graph = obg.Graph.from_file(graphs[i])
+            create_linear_map(graph, linear_map_name)
+        else:
+            logging.info(
+                "Found linear map %s that will be used." % linear_map_name)
+        linear_map_file_names.append(linear_map_name)
+
+    logging.info("Will use linear maps: %s" % linear_map_file_names)
+
+    config = Configuration()
+    if args.keep_duplicates == "True":
+        config.keep_duplicates = True
+        logging.info("Keeping duplicates")
+
+    if args.fragment_length is None:
+        logging.info("Fragment length was not specified. Will now"
+                     " predict fragment length.")
+
+        min_m = 5 if args.min_fold_enrichment is None else int(args.min_fold_enrichment)
+        max_m = 50 if args.max_fold_enrichment is None else int(args.max_fold_enrichment)
+        config.fragment_length = int(MultiGraphShiftEstimator.from_files(
+            graphs, samples, min_m, max_m).get_estimates())
+        logging.info("Estimated fragment length to be %d" % config.fragment_length)
+        assert config.fragment_length < 1000, "Suspiciously high fragment length. Probably a bad estimate." \
+                                       " Change -m/-M to try to get more paired peaks."
+    else:
+        config.fragment_length = int(args.fragment_length)
+    if args.read_length is None:
+        config.read_length = estimate_read_length(samples[0], graphs[0])
+        logging.info("Estimated read length to %s" % config.read_length)
+    else:
+        config.read_length = int(args.read_length)
+
+    if config.fragment_length < config.read_length:
+        logging.critical("Fragment length is smaller than read length. Cannot call peaks.")
+        sys.exit(1)
+
+    if args.genome_size is not None:
+        genome_size = int(args.genome_size)
+        config.min_background = int(args.unique_reads) * int(args.fragment_length) / genome_size
+
+        logging.info(
+            "Computed min background signal to be %.3f using fragment length %f, "
+            " %d unique reads, and genome size %d" % (config.min_background,
+                                                      config.fragment_length,
+                                                      int(args.unique_reads),
+                                                      int(genome_size)))
+    else:
+        logging.info("Not using min background.")
+        config.min_background = None
+
+    out_name = args.out_name if args.out_name is not None else ""
+    reporter = Reporter(out_name)
+    config.has_control = args.control is not None
+    caller = MultipleGraphsCallpeaks(
+        names,
+        graphs,
+        samples,
+        controls,
+        linear_map_file_names,
+        config, reporter,
+        sequence_retrievers=sequence_retrievers,
+        stop_after_p_values=args.stop_after_p_values == "True",
+    )
+    caller.run()
+
+
 def run_callpeaks_whole_genome(args):
     logging.info("Running run_callpeaks_whole_genome")
 
@@ -109,7 +252,7 @@ def run_callpeaks_whole_genome(args):
         if not os.path.isfile(linear_map_name):
             logging.warning("Did not find linear map for "
                             "chromosome %s. Will create." % chrom)
-            graph = obg.Graph.from_numpy_file(graph_file_names[i])
+            graph = obg.Graph.from_file(graph_file_names[i])
             create_linear_map(graph, linear_map_name)
         else:
             logging.info("Found linear map %s that will be used." % linear_map_name)