diff --git a/README.md b/README.md index df7cbe8..dd058f0 100644 --- a/README.md +++ b/README.md @@ -40,13 +40,17 @@ Note that even though Graph Peak Caller is typically used with vg, it is possibl ### Step 2: Call peaks Finally, we can call peaks by using the *callpeaks* command: ``` -graph_peak_caller callpeaks -g graph.nobg -s alignments.json -f FRAGMENT_LENGTH -r READ_LENGTH +graph_peak_caller callpeaks -g graph.nobg -s alignments.json ``` -Make sure to change *FRAGMENT_LENGTH* and *READ_LENGTH* with numbers matching your data. If you do not know the fragment length of your ChIP-seq experiment, you can use Macs' predictd command to estimate it: `macs predict -t alignments.bam`. - If you do not have a control track, use your sample reads as control and change True to False in the above command. Changing True to False is important as Graph Peak Caller will generate the background signal in a different way when the sample is used as control. +Note that you easily can send a list of graphs and input files to run on multiple graphs. This is useful if you e.g. have one graph and one input file for each chromosome: +``` +graph_peak_caller callpeaks -g graph_chr*.nobg -s alignments_chr*.json +``` +On a unix command line, the * will work as a wildcard. `-g` and `-s` can also take multiple file names separated by space. + ### Output The peak caller will create various output files, some of which are less relevant and only useful for debugging. The two files containing peak information are: * **(base_name)max_paths.intervalcollection**: This file contains the peaks in a JSON format. This file can be read by graph peak caller in a Python script, e.g. by doing: @@ -69,7 +73,7 @@ Change *chromosome* to the chromosome that you want to be used when writing the ## Advanced usage -If you want to do Peak Calling on a whole-genome reference, vg will typically produce one graph for each chromosome, and it is best to divide the peak calling into one process for each chromosome. Check out [this guide](https://github.com/uio-bmi/graph_peak_caller/wiki/Graph-based-ChIP-seq-tutorial) for a detailed explanation on how that can be done. +If you want to do Peak Calling on a whole-genome reference, vg will typically produce one graph for each chromosome, and it is best to divide the peak calling into one process for each chromosome. You can simply do that by sending multiple graphs and input files to the `callpeaks` command, but if you are running on huge data, it will be faster to start one callpeaks process for each chromosome and run these in parallel. An important thing to keep in mind then, is that each process should only run until p-values have been computed before continuing, since q-values needs to be computed from all p-values. Check out [this guide](https://github.com/uio-bmi/graph_peak_caller/wiki/Graph-based-ChIP-seq-tutorial) for a detailed explanation on how that can be done. # Reproducing the results from the Graph Peak Caller manuscript. Follow [this guide](https://github.com/uio-bmi/graph_peak_caller/wiki/Reproducing-the-results-in-Graph-Peak-Caller-Paper) in order to run the ChIP-seq experiments presented in the manuscript. diff --git a/__init__.py b/__init__.py old mode 100755 new mode 100644 diff --git a/benchmarking/simple_chip_seq_pipeline.sh b/benchmarking/simple_chip_seq_pipeline.sh index eab1aa3..8e7705d 100755 --- a/benchmarking/simple_chip_seq_pipeline.sh +++ b/benchmarking/simple_chip_seq_pipeline.sh @@ -93,7 +93,8 @@ fi read_length=$(cat macs_output_whole_run.txt | gawk 'match($0, /tag size = ([0-9]+)/, ary) {print ary[1]}' ) echo "Found read length: $read_length" fragment_length=$(cat macs_output_whole_run.txt | gawk 'match($0, /fragment length is ([0-9]+)/, ary) {print ary[1]}' ) -echo "Found fragment length: $fragment_length" + +#echo "Found fragment length: $fragment_length" # Step 3: Map reads @@ -131,6 +132,16 @@ echo "Removing low quality reads." #cp filtered.json filtered_low_qual_reads_removed.json +# Get fragment length + +#if [ ! -f shift_estimation_log.txt ]; then +# graph_peak_caller estimate_shift $chromosomes $graph_dir/ filtered_low_qual_reads_removed_ 3 100 > shift_estimation_log.txt 2>&1 +#else +# echo "Not finding fragment length. Already done." +#fi +#fragment_length=$(cat shift_estimation_log.txt | gawk 'match($0, /Found shift: ([0-9]+)/, ary) {print ary[1]}' ) +#echo "Fragment length is $fragment_length" + # Step 5: Split filtered into chromosomes if ls filtered_low_qual_reads_removed_* 1> /dev/null 2>&1; then @@ -149,15 +160,6 @@ fi unique_reads=$(tail -n 1 count_unique_reads_output.txt) -#if [ ! -f unique_reads.txt ]; then -# echo "Counting unique reads" -# unique_reads=$(pcregrep -o1 '"sequence": "([ACGTNacgtn]{20,})"' filtered_low_qual_reads_removed.json | sort | uniq | wc -l) -# echo $unique_reads > unique_reads.txt -# echo "$unique_reads unique reads in total" -#else -# echo "Using cached count for unique reads." -# unique_reads=$(tail -n 1 unique_reads.txt) -#fi # Step 6 run peak caller to get p-values for every chromosome pids="" @@ -165,15 +167,15 @@ RESULT=0 for chromosome in $(echo $chromosomes | tr "," "\n") do if [ ! -f ${chromosome}_pvalues_values.npy ]; then - graph_peak_caller callpeaks_whole_genome $chromosome \ - -d $graph_dir \ - -s filtered_low_qual_reads_removed_ \ - -n "" \ + graph_peak_caller callpeaks \ + -g $graph_dir/$chromosome.nobg \ + -s filtered_low_qual_reads_removed_$chromosome.json \ + -n ${chromosome}_ \ -f $fragment_length \ -r $read_length \ -p True \ -u $unique_reads \ - -g $genome_size \ + -G $genome_size \ > log_before_p_values_$chromosome.txt 2>&1 & pids="$pids $!" echo "Peak calling (until p-values) for chr $chromosome started as process. Log will be written to $work_dir/log_before_p_values_$chromosome.txt" diff --git a/graph_peak_caller/callpeaks.py b/graph_peak_caller/callpeaks.py index fba3a11..9dd2722 100644 --- a/graph_peak_caller/callpeaks.py +++ b/graph_peak_caller/callpeaks.py @@ -10,6 +10,7 @@ import json import sys + class Configuration: def __init__(self): self.read_length = None @@ -47,6 +48,7 @@ def run_pre_callpeaks(self, input_reads, control_reads): sample_pileup = get_fragment_pileup( self.graph, input_reads, self.config, self._reporter) + background_func = get_background_track_from_control except json.decoder.JSONDecodeError as e: logging.debug(e) @@ -155,8 +157,6 @@ def __postprocess(self): self.touched_nodes ).run() self._reporter.add("hole_cleaned", self.pre_processed_peaks) - - logging.info("Not removing small peaks") self.filtered_peaks = self.pre_processed_peaks def __get_max_paths(self): diff --git a/graph_peak_caller/callpeaks_interface.py b/graph_peak_caller/callpeaks_interface.py index da2993b..730792e 100644 --- a/graph_peak_caller/callpeaks_interface.py +++ b/graph_peak_caller/callpeaks_interface.py @@ -1,5 +1,6 @@ import logging import os +import numpy as np import offsetbasedgraph as obg from pyvg.conversion import vg_json_file_to_interval_collection @@ -10,9 +11,20 @@ from .peakfasta import PeakFasta from .reporter import Reporter from .intervals import UniqueIntervals +from .shiftestimation import MultiGraphShiftEstimator import sys +def estimate_read_length(file_name, graph_name): + graph = obg.Graph.from_file(graph_name) + if file_name.endswith(".intervalcollection"): + intervals = obg.IntervalCollection.create_generator_from_file( + file_name, graph=graph) + else: + intervals = vg_json_file_to_interval_collection(file_name, graph) + return int(np.median([i.length() for i in intervals])) + + def get_confiugration(args): config = Configuration() config.has_control = args.control is not None @@ -32,8 +44,6 @@ def get_intervals(args): samples = iclass(parse_input_file(args.sample, args.graph)) control_name = args.sample if args.control is None else args.control controls = iclass(parse_input_file(control_name, args.graph)) - - return samples, controls @@ -46,6 +56,7 @@ def get_callpeaks(args): def parse_input_file(input, graph): + logging.info("Parsing input file %s" % input) try: if isinstance(input, obg.IntervalCollection): return input @@ -65,6 +76,7 @@ def parse_input_file(input, graph): logging.critical("Input file %s not found. Aborting. " % input) sys.exit(1) + def run_callpeaks_interface(args): caller = get_callpeaks(args) inputs, controls = get_intervals(args) @@ -90,6 +102,137 @@ def find_or_create_linear_map(graph, linear_map_name): logging.info("Done creating linear map") +def _get_file_names(pattern): + from glob import glob + + if pattern is None: + return None + + assert isinstance(pattern, list) + pattern = sorted(pattern) + + if isinstance(pattern, list): + if len(pattern) > 1: + return pattern + else: + if "*" in pattern[0]: + return sorted(glob(pattern[0])) + + return sorted(pattern) + + +def run_callpeaks2(args): + graphs = _get_file_names(args.graph) + samples = _get_file_names(args.sample) + controls = _get_file_names(args.control) + if controls is None: + controls = samples + + if len(samples) != len(graphs): + logging.critical("""Number of sample files must be equal to number of graph files. + Found %d graphs and %d sample files. + Use --verbose 2 for more details.""" % (len(graphs), len(samples))) + sys.exit(1) + + logging.debug("Fragment length input: %s" % args.fragment_length) + logging.info("Sample files: %s" % samples) + logging.debug("Control files: %s" % samples) + logging.debug("Graph files: %s" % graphs) + + logging.info("Using graphs: %s " % graphs) + sequence_graph_file_names = [fn + ".sequences" for fn in graphs] + logging.info("Will use sequence graphs. %s" % sequence_graph_file_names) + sequence_retrievers = (obg.SequenceGraph.from_file(fn) + for fn in sequence_graph_file_names) + + data_dir = os.path.dirname(graphs[0]) + if data_dir == "": + data_dir = "./" + + logging.info("Using graphs from data directory %s" % data_dir) + if len(graphs) > 1: + # Make custom names for different runs + print([fn.split(data_dir, 1)[-1] for fn in graphs]) + names = [fn.split(data_dir, 1)[-1].rsplit(".", 1)[0] for fn in graphs] + else: + names = [""] + + logging.info("Will use %s as extra experiments names for each run, based on graph file names." + "If only running on single graph, this should be empty. " % names) + + linear_map_file_names = [] + for i, graph_file_name in enumerate(graphs): + linear_map_name = graph_file_name.split(".nobg")[0] + "_linear_map.npz" + if not os.path.isfile(linear_map_name): + logging.warning("Did not find linear map for " + " for graph %s. Will create." % graph_file_name) + graph = obg.Graph.from_file(graphs[i]) + create_linear_map(graph, linear_map_name) + else: + logging.info( + "Found linear map %s that will be used." % linear_map_name) + linear_map_file_names.append(linear_map_name) + + logging.info("Will use linear maps: %s" % linear_map_file_names) + + config = Configuration() + if args.keep_duplicates == "True": + config.keep_duplicates = True + logging.info("Keeping duplicates") + + if args.fragment_length is None: + logging.info("Fragment length was not specified. Will now" + " predict fragment length.") + + min_m = 5 if args.min_fold_enrichment is None else int(args.min_fold_enrichment) + max_m = 50 if args.max_fold_enrichment is None else int(args.max_fold_enrichment) + config.fragment_length = int(MultiGraphShiftEstimator.from_files( + graphs, samples, min_m, max_m).get_estimates()) + logging.info("Estimated fragment length to be %d" % config.fragment_length) + assert config.fragment_length < 1000, "Suspiciously high fragment length. Probably a bad estimate." \ + " Change -m/-M to try to get more paired peaks." + else: + config.fragment_length = int(args.fragment_length) + if args.read_length is None: + config.read_length = estimate_read_length(samples[0], graphs[0]) + logging.info("Estimated read length to %s" % config.read_length) + else: + config.read_length = int(args.read_length) + + if config.fragment_length < config.read_length: + logging.critical("Fragment length is smaller than read length. Cannot call peaks.") + sys.exit(1) + + if args.genome_size is not None: + genome_size = int(args.genome_size) + config.min_background = int(args.unique_reads) * int(args.fragment_length) / genome_size + + logging.info( + "Computed min background signal to be %.3f using fragment length %f, " + " %d unique reads, and genome size %d" % (config.min_background, + config.fragment_length, + int(args.unique_reads), + int(genome_size))) + else: + logging.info("Not using min background.") + config.min_background = None + + out_name = args.out_name if args.out_name is not None else "" + reporter = Reporter(out_name) + config.has_control = args.control is not None + caller = MultipleGraphsCallpeaks( + names, + graphs, + samples, + controls, + linear_map_file_names, + config, reporter, + sequence_retrievers=sequence_retrievers, + stop_after_p_values=args.stop_after_p_values == "True", + ) + caller.run() + + def run_callpeaks_whole_genome(args): logging.info("Running run_callpeaks_whole_genome") @@ -109,7 +252,7 @@ def run_callpeaks_whole_genome(args): if not os.path.isfile(linear_map_name): logging.warning("Did not find linear map for " "chromosome %s. Will create." % chrom) - graph = obg.Graph.from_numpy_file(graph_file_names[i]) + graph = obg.Graph.from_file(graph_file_names[i]) create_linear_map(graph, linear_map_name) else: logging.info("Found linear map %s that will be used." % linear_map_name) diff --git a/graph_peak_caller/command_line_interface.py b/graph_peak_caller/command_line_interface.py index fa62bac..649ca93 100644 --- a/graph_peak_caller/command_line_interface.py +++ b/graph_peak_caller/command_line_interface.py @@ -16,7 +16,7 @@ from graph_peak_caller.callpeaks_interface import \ run_callpeaks_interface, run_callpeaks_whole_genome,\ - run_callpeaks_whole_genome_from_p_values + run_callpeaks_whole_genome_from_p_values, run_callpeaks2 from graph_peak_caller.analysis.analysis_interface import analyse_peaks_whole_genome,\ analyse_peaks, peaks_to_fasta, linear_peaks_to_fasta,\ @@ -40,54 +40,43 @@ def version(args): interface = \ { + 'callpeaks': { - 'help': 'Call peaks on a single graph.', - 'requires_graph': True, - 'arguments': - [ - ('-m/--linear_map', "Optional. Linear map file name. Will look for '" - " files matching graph name if not set or create if" - " no file is found."), - ('-s/--sample', 'File name to a vg JSON file or intervalcollection file.'), - ('-c/--control', '(Optional) File name to a vg JSON file or intervalcollection file. ' - 'Only include if a separate control is used.'), - ('-n/--out_name', 'Optional. Will be prepended to all output files. Default is nothing.'), - ('-f/--fragment_length', 'The fragment length used in this ChIP-seq experiment. If unknown, set to an ' - 'arbitrary number, e.g. 200. However, for good results, this number should be accurate.'), - ('-r/--read_length', 'The read length.'), - ('-q/--q_value_threshold', 'Optional. Q value threshold. Default 0.05.') - ], - 'method': run_callpeaks_interface - }, - 'callpeaks_whole_genome': - { - 'help': 'Callpeaks on whole genome, using one graph for each chromosome.', + 'help': 'Call peaks one or multiple graphs.', 'arguments': [ - ('chromosomes', 'Comma-separated list of chromosomes to use, e.g. 1,2,X,8,Y'), - ('-d/--data_dir', 'Path to data directory containing ' - 'ob graphs and linear maps.'), - ('-s/--sample', 'Sample reads base name. Will use ' - 'files *_[chromosome].json where * is the base name'), - ('-c/--control', 'Optional. Control reads base ame. Will use ' - 'files *_[chromosome].json where * is the base name'), - - ('-f/--fragment_length', 'Fragment length.'), - ('-r/--read_length', 'Read length'), - ('-p/--stop_after_p_values', 'Optional. True or False (default). Whether to ' - 'only run until p-value track is ' - 'computed (before peak calling)'), - ('-u/--unique_reads', 'Number of unique reads. ' - 'Found by calling count_unique_reads'), - ('-g/--genome_size', 'Number of base pairs covered by ' - 'graphs in total (on a linear genome)'), + ('-g/--graph', 'Graph(s) file name. Either a single file name, e.g. graph.nobg or a ' + 'pattern with the * wildcard, such as graph_*.nobg. In the latter case' + ', * will be replaced by the names specified in --chromosomes.'), + ('-s/--sample', 'Sample alignments. Use wildcard * to specify multiple files'), + ('-c/--control', 'Optional. Use wildcard * to specify multiple files.' + ' Use only this option if you have separate control alignments'), + ('-f/--fragment_length', 'Optional. Specify if you know the exact fragment length. ' + 'Will be estimated if not specified.'), + ('-r/--read_length', 'Optional. '), + ('-p/--stop_after_p_values', 'Optional. Default False. Set to True in order to' + 'stop the whole run after p-values has been computed.' + ' Useful if wanting to run multiple chromosomes in parallell, and' + 'waiting before continuing from p-values.'), ('-n/--out_name', 'Optional. Out base name. Prepended to output files.'), + ('-u/--unique_reads', 'Optional. Number of unique reads. ' + 'Found by calling count_unique_reads'), + ('-G/--genome_size', 'Optional. Number of base pairs covered by ' + 'graphs in total (on a linear genome). ' + 'If not set, will be estimated if running on single graph. ' + 'Must be set if running on multiple graphs.'), ('-D/--keep_duplicates', 'Optional. Set to True in order to keep ' - 'duplicate input alignments.') + 'duplicate input alignments.'), + ('-m/--min_fold_enrichment', 'Optional. Minimum fold enrichment required for ' + 'candidate peaks when estimating fragment length. Default 5.'), + ('-M /--max_fold_enrichment', 'Optional. Maximum fold enrichment required for ' + 'candidate peaks when estimating fragment length. Default 50.'), + ], - 'method': run_callpeaks_whole_genome + 'method': run_callpeaks2, }, + 'callpeaks_whole_genome_from_p_values': { 'help': 'Callpeaks on whole genome from pvalues.', @@ -300,6 +289,10 @@ def version(args): 'list of chromosomes to use, e.g. 1,2,X,8,Y'), ('ob_graphs_location', 'Location of graph files'), ('sample_reads_base_name', 'Will use files [base_name][chromosome].json'), + ('min_fold_enrichment', 'Optional. Minimum fold enrichment when ' + 'finding candidates. Default is 5.'), + ('max_fold_enrichment', 'Optional. Maximum fold enrichment when ' + 'finding candidates. Default is 50.') ], 'method': shift_estimation }, @@ -381,9 +374,13 @@ def run_argument_parser(args): if "Optional" in help or "optional" in help: required = False + nargs = None + if "wildcard" in help: + nargs = "+" + subparser.add_argument(short_command, long_command, dest=long_command.replace("--", ""), help=help, - required=required) + required=required, nargs=nargs) else: subparser.add_argument(argument, help=help) subparser.set_defaults(func=interface[command]["method"]) diff --git a/graph_peak_caller/haplotyping.py b/graph_peak_caller/haplotyping.py index 591c9fc..e30c579 100644 --- a/graph_peak_caller/haplotyping.py +++ b/graph_peak_caller/haplotyping.py @@ -3,6 +3,7 @@ from offsetbasedgraph import IndexedInterval import logging + class HaploTyper: """ Simple naive haplotyper @@ -42,8 +43,6 @@ def get_maximum_interval_through_graph(self): next_blocks = graph.adj_list[current_block] if len(next_blocks) < 1: print("Stopping at %d" % current_block) - print(nodes) - break next = None diff --git a/graph_peak_caller/multiplegraphscallpeaks.py b/graph_peak_caller/multiplegraphscallpeaks.py index 93531ee..a3076c3 100644 --- a/graph_peak_caller/multiplegraphscallpeaks.py +++ b/graph_peak_caller/multiplegraphscallpeaks.py @@ -9,6 +9,7 @@ from .peakfasta import PeakFasta + class MultipleGraphsCallpeaks: def __init__(self, graph_names, graph_file_names, @@ -50,7 +51,7 @@ def count_number_of_unique_reads(cls, sample_reads): n_unique += 1 interval_hashes.add(hash) - print("Found %d duplicates" % n_duplicates) + logging.info("Found %d duplicates" % n_duplicates) logging.info("In total %d unique reads" % n_unique) return n_unique @@ -73,10 +74,18 @@ def get_intervals(self, sample, control, graph): logging.info("Sample is already intervalcollection.") return sample, control elif sample.endswith(".intervalcollection"): - sample = obg.IntervalCollection.create_generator_from_file( - sample, graph=graph) - control = obg.IntervalCollection.create_generator_from_file( - control, graph=graph) + try: + sample = obg.IntervalCollection.from_file( + sample, graph=graph) + except OSError: + sample = obg.IntervalCollection.from_file( + sample, graph=graph, text_file=True) + try: + control = obg.IntervalCollection.from_file( + control, graph=graph) + except: + control = obg.IntervalCollection.from_file( + control, graph=graph, text_file=True) else: logging.info("Creating interval collections from files") sample = vg_json_file_to_interval_collection(sample, graph) @@ -92,8 +101,8 @@ def run_to_p_values(self): for name, graph_file_name, sample, control, lin_map in \ zip(self.names, self.graph_file_names, self.samples, self.controls, self.linear_maps): - logging.info("Running %s" % name) - ob_graph = obg.Graph.from_numpy_file( + logging.info("Running to p values, %s" % name) + ob_graph = obg.Graph.from_file( graph_file_name) sample, control = self.get_intervals(sample, control, ob_graph) config = self._config.copy() @@ -110,6 +119,7 @@ def create_joined_q_value_mapping(self): def run_from_p_values(self, only_chromosome=None): for i, name in enumerate(self.names): + logging.info("Name: %s" % name) if only_chromosome is not None: if only_chromosome != name: logging.info("Skipping %s" % str(name)) @@ -121,10 +131,12 @@ def run_from_p_values(self, only_chromosome=None): caller = CallPeaks(ob_graph, self._config, self._reporter.get_sub_reporter(name)) caller.p_to_q_values_mapping = self._q_value_mapping + if name != "": + name += "_" caller.p_values_pileup = SparseValues.from_sparse_files( - self._reporter._base_name + name + "_" + "pvalues") + self._reporter._base_name + name + "pvalues") caller.touched_nodes = set(np.load( - self._reporter._base_name + name + "_" + "touched_nodes.npy")) + self._reporter._base_name + name + "touched_nodes.npy")) caller.get_q_values() caller.call_peaks_from_q_values() if self.sequence_retrievers is not None: @@ -135,6 +147,6 @@ def run_from_p_values(self, only_chromosome=None): continue PeakFasta(sequencegraph).write_max_path_sequences( - self._reporter._base_name + name + "_sequences.fasta", caller.max_path_peaks) + self._reporter._base_name + name + "sequences.fasta", caller.max_path_peaks) #caller.save_max_path_sequences_to_fasta_file( # "sequences.fasta", self.sequence_retrievers.__next__()) diff --git a/graph_peak_caller/preprocess_interface.py b/graph_peak_caller/preprocess_interface.py index 276f79d..93b063d 100644 --- a/graph_peak_caller/preprocess_interface.py +++ b/graph_peak_caller/preprocess_interface.py @@ -2,7 +2,8 @@ from pyvg.conversion import vg_json_file_to_interval_collection,\ json_file_to_obg_numpy_graph import offsetbasedgraph as obg -from graph_peak_caller.shiftestimation.shift_estimation_multigraph import MultiGraphShiftEstimator +from graph_peak_caller.shiftestimation.shift_estimation_multigraph \ + import MultiGraphShiftEstimator from graph_peak_caller.util import create_linear_map from graph_peak_caller.multiplegraphscallpeaks import MultipleGraphsCallpeaks @@ -16,23 +17,27 @@ def count_unique_reads_interface(args): count_unique_reads(chromosomes, graph_file_names, reads_file_names) + def shift_estimation(args): chromosomes = args.chromosomes.split(",") - graphs = [args.ob_graphs_location + "/" + chrom + ".nobg" for chrom in chromosomes] - print(graphs) + graphs = [args.ob_graphs_location + "/" + chrom + ".nobg" + for chrom in chromosomes] logging.info("Will try to use graphs %s" % graphs) sample_file_names = [args.sample_reads_base_name + chrom + ".json" for chrom in chromosomes] logging.info("Will use reads from %s" % sample_file_names) - estimator = MultiGraphShiftEstimator.from_files( - chromosomes, graphs, sample_file_names) + min_m = 5 if args.min_fold_enrichment is None else int(args.min_fold_enrichment) + max_m = 50 if args.max_fold_enrichment is None else int(args.max_fold_enrichment) - estimator.to_linear_bed_file("linear_bed.bed", read_length=36) + estimator = MultiGraphShiftEstimator.from_files(graphs, sample_file_names, + min_fold_enrichment=min_m, + max_fold_enrichment=max_m) d = estimator.get_estimates() logging.info("Found shift: %d" % d) + def count_unique_reads(chromosomes, graph_file_names, reads_file_names): graphs = (obg.GraphWithReversals.from_numpy_file(f) for f in graph_file_names) diff --git a/graph_peak_caller/reporter.py b/graph_peak_caller/reporter.py index d05e93e..24ddbdb 100644 --- a/graph_peak_caller/reporter.py +++ b/graph_peak_caller/reporter.py @@ -59,4 +59,7 @@ def add(self, name, data): logging.info("Skipping reporting of %s", name) def get_sub_reporter(self, name): - return self.__class__(self._base_name + name + "_") + if name != "": + name += "_" + + return self.__class__(self._base_name + name) diff --git a/graph_peak_caller/sample/__init__.py b/graph_peak_caller/sample/__init__.py index 2279c28..77c267a 100644 --- a/graph_peak_caller/sample/__init__.py +++ b/graph_peak_caller/sample/__init__.py @@ -3,6 +3,7 @@ def get_fragment_pileup(graph, input_intervals, info, reporter=None): - logging.info("Creating fragment pileup") + logging.info("Creating fragment pileup, using fragment length %d " + "and read length %d" % (info.fragment_length, info.read_length)) spg = SamplePileupGenerator(graph, info.fragment_length-info.read_length) return spg.run(input_intervals, reporter) diff --git a/graph_peak_caller/sample/sparsegraphpileup.py b/graph_peak_caller/sample/sparsegraphpileup.py index 7d474bf..3da0120 100644 --- a/graph_peak_caller/sample/sparsegraphpileup.py +++ b/graph_peak_caller/sample/sparsegraphpileup.py @@ -36,6 +36,7 @@ def __str__(self): class ReadsAdder: def __init__(self, graph, pileup, keep_ends=False): + self._graph = graph self.min_id = graph.min_node self._node_indexes = graph.node_indexes self._pileup = pileup @@ -47,12 +48,17 @@ def _handle_interval(self, interval): # self._pileup.add_interval(interval) end_pos = interval.end_position rp = end_pos.region_path_id - if rp < 0: - self.add_open_neg_interval(interval) - self._neg_ends[rp].append(end_pos.offset) - else: - self.add_open_pos_interval(interval) - self._pos_ends[rp].append(end_pos.offset) + try: + if rp < 0: + self.add_open_neg_interval(interval) + self._neg_ends[rp].append(end_pos.offset) + else: + self.add_open_pos_interval(interval) + self._pos_ends[rp].append(end_pos.offset) + except KeyError: + logging.critical("Error with read/interval %s" % interval) + print(self._graph) + raise def get_pos_ends(self): return self._pos_ends @@ -207,6 +213,7 @@ def _add_end(self, index): class SamplePileupGenerator: def __init__(self, graph, extension): + logging.info("Using extension %d when extending reads. " % extension) if extension < 0: raise Exception("Invalid extension size %d used. Must be positive. Is fragment length < read length?" % extension) self._pileup = SparseGraphPileup(graph) diff --git a/graph_peak_caller/shiftestimation/__init__.py b/graph_peak_caller/shiftestimation/__init__.py new file mode 100644 index 0000000..163c2f1 --- /dev/null +++ b/graph_peak_caller/shiftestimation/__init__.py @@ -0,0 +1,2 @@ +from .shiftestimation import * +from .shift_estimation_multigraph import MultiGraphShiftEstimator diff --git a/graph_peak_caller/shiftestimation/shift_estimation_multigraph.py b/graph_peak_caller/shiftestimation/shift_estimation_multigraph.py index 80fc2a8..46631fe 100644 --- a/graph_peak_caller/shiftestimation/shift_estimation_multigraph.py +++ b/graph_peak_caller/shiftestimation/shift_estimation_multigraph.py @@ -4,14 +4,65 @@ class MultiGraphShiftEstimator(object): - def __init__(self, position_tuples, genome_size): + def __init__(self, position_tuples, genome_size, + min_fold_enrichment=5, max_fold_enrichment=50): self._position_tuples = position_tuples self.genome_size = genome_size + self.min_fold_enrichment = min_fold_enrichment + self.max_fold_enrichment = max_fold_enrichment def get_estimates(self): - opt = Opt() + opt = Opt(self.min_fold_enrichment, self.max_fold_enrichment) opt.gsize = self.genome_size treatment = Treatment(self._position_tuples) peakmodel = PeakModel(opt, treatment) peakmodel.build() return round(peakmodel.d) + + @classmethod + def from_files(cls, graph_file_names, interval_json_file_names, min_fold_enrichment=5, + max_fold_enrichment=50): + start_positions = { + "+": {str(i): [] for i, _ in enumerate(graph_file_names)}, + "-": {str(i): [] for i, _ in enumerate(graph_file_names)} + } + genome_size = 0 + i = 0 + for graph, intervals in zip(graph_file_names, + interval_json_file_names): + linear_filter = LinearFilter.from_vg_json_reads_and_graph( + intervals, graph) + + positions = linear_filter.find_start_positions() + start_positions["+"][str(i)] = positions["+"] + start_positions["-"][str(i)] = positions["-"] + i += 1 + + genome_size += linear_filter._indexed_interval.length() + + logging.info("Using genome size %d" % genome_size) + return cls(start_positions, genome_size, min_fold_enrichment, max_fold_enrichment) + + @classmethod + def _from_files(cls, graph_file_names, interval_json_file_names): + start_positions = { + "+": {name: [] for name in chrom_names}, + "-": {name: [] for name in chrom_names} + } + genome_size = 0 + + for name, graph, intervals in zip(chrom_names, + graph_file_names, + interval_json_file_names): + + linear_filter = LinearFilter.from_vg_json_reads_and_graph( + intervals, graph) + + positions = linear_filter.find_start_positions() + start_positions["+"][name] = positions["+"] + start_positions["-"][name] = positions["-"] + + genome_size += linear_filter._indexed_interval.length() + + logging.info("Using genome size %d" % genome_size) + return cls(start_positions, genome_size) diff --git a/graph_peak_caller/shiftestimation/shiftestimation.py b/graph_peak_caller/shiftestimation/shiftestimation.py index 86bf79c..13b6d4b 100644 --- a/graph_peak_caller/shiftestimation/shiftestimation.py +++ b/graph_peak_caller/shiftestimation/shiftestimation.py @@ -14,6 +14,7 @@ import numpy as np import logging + class NotEnoughPairsException(Exception): def __init__(self, value): self.value = value @@ -49,9 +50,11 @@ def get_locations_by_chr(self, chrom): class Opt: - def __init__(self): - self.umfold = 50 - self.lmfold = 5 + def __init__(self, lmfold=5, umfold=50): + logging.info("Initing shift estimator options with" + " fold enrichment min/max %d/%d. " % (lmfold, umfold)) + self.umfold = umfold + self.lmfold = lmfold self.bw = 300 self.info = logging.info self.debug = logging.debug @@ -106,9 +109,13 @@ def build(self): self.info("#2 number of paired peaks: %d" % (num_paired_peakpos)) if num_paired_peakpos < 100: logging.warning("Too few paired peaks (%d) to get good shift estimate" % num_paired_peakpos) - raise NotEnoughPairsException("No enough pairs to build model") + raise NotEnoughPairsException("Not enough pairs to build model. " + "Try a broader range by setting -m lower and/or -M higher.") elif num_paired_peakpos < self.max_pairnum: - logging.warning("Few paired peaks (%d) than %d! Model may not be build well! Lower your MFOLD parameter may erase this warning. Now I will use %d pairs to build model!" % (num_paired_peakpos, self.max_pairnum,num_paired_peakpos_picked)) + logging.warning("Few paired peaks (%d) when estimating fragment length, which may lead" + "to bad estimate of fragment length. Set min " + "fold enrichment (-m lower) or max (-M) higher in " + "order get more paired peaks." % (num_paired_peakpos)) self.__paired_peak_model(paired_peakpos) def __paired_peak_model(self, paired_peakpos): @@ -324,13 +331,11 @@ def __naive_peak_pos(self, pos_list, plus_strand): top_pos = self.__find_top_pos(horizon_line, peak_length) return (top_pos[len(top_pos)//2]+start) - def __find_top_pos(self, horizon_line, peak_length): m = np.max(horizon_line) return np.where(horizon_line == m)[0] - # smooth function from SciPy cookbook: http://www.scipy.org/Cookbook/SignalSmooth def smooth(x, window_len=11, window='hanning'): """smooth the data using a window with requested size. diff --git a/graph_peak_caller/sparsediffs.py b/graph_peak_caller/sparsediffs.py index 73e294b..5178aac 100644 --- a/graph_peak_caller/sparsediffs.py +++ b/graph_peak_caller/sparsediffs.py @@ -103,6 +103,8 @@ def to_sparse_files(self, file_base_name): np.save(file_base_name + "_indexes.npy", self._indices) np.save(file_base_name + "_values.npy", self._diffs) + logging.info("Wrote to %s and %s" % (file_base_name + "_indexes.npy", + file_base_name + "_values.npy")) def to_bed_graph(self, filename): logging.warning("Not writing to %s", filename) diff --git a/graph_peak_caller/sparsepvalues.py b/graph_peak_caller/sparsepvalues.py index bc856f7..3e39969 100644 --- a/graph_peak_caller/sparsepvalues.py +++ b/graph_peak_caller/sparsepvalues.py @@ -1,3 +1,4 @@ +from glob import glob import pickle import numpy as np import os @@ -76,9 +77,7 @@ def __get_sub_counts(cls, sparse_values): def from_files(cls, base_file_name): search = base_file_name logging.info("Searching for files starting with %s" % search) - files = (f for f in os.listdir() - if f.startswith(search) and - "pvalues" in f and f.endswith("_indexes.npy")) + files = glob(base_file_name + "*pvalues_indexes.npy") sub_counts = [] p_values = [] for filename in files: diff --git a/tests/arabidopsis_integration_test/run.sh b/tests/arabidopsis_integration_test/run.sh index 5742746..a9191b2 100755 --- a/tests/arabidopsis_integration_test/run.sh +++ b/tests/arabidopsis_integration_test/run.sh @@ -3,11 +3,12 @@ graph_dir="graphs/" for chromosome in $(seq 1 2) do - graph_peak_caller callpeaks_whole_genome $chromosome \ - -d $graph_dir -s arabidopsis_sample/sample_ -n "" -f 230 -r 50 \ + graph_peak_caller callpeaks \ + -g $graph_dir/$chromosome.nobg -s arabidopsis_sample/sample_$chromosome.json -f 230 -r 50 \ -p True \ -u 250000 \ - -g 135000000 & + -G 135000000 \ + -n ${chromosome}_ & done wait @@ -19,5 +20,5 @@ do done wait -run + diff --git a/tests/test_multiplegraphscallpeaks.py b/tests/test_multiplegraphscallpeaks.py index 8bcae9b..a1a5927 100644 --- a/tests/test_multiplegraphscallpeaks.py +++ b/tests/test_multiplegraphscallpeaks.py @@ -63,7 +63,7 @@ def _create_data(self): node_offset += 3 graph.convert_to_numpy_backend() SequenceGraph.create_empty_from_ob_graph(graph).to_file(chromosome + ".nobg.sequences") - graph.to_numpy_file(chromosome + ".nobg") + graph.to_file(chromosome + ".nobg") def _create_reads(self, chrom_number, chrom, graph): i = chrom_number @@ -100,6 +100,7 @@ def test_run_from_init(self): def test_run_from_init_in_two_steps(self): + set_logging_config(2) caller = MultipleGraphsCallpeaks( self.chromosomes, [chrom + ".nobg" for chrom in self.chromosomes], @@ -145,13 +146,13 @@ def _create_reads(self, *args): def test_typical_run(self): print(" ========= Running start ====") - run_argument_parser(["callpeaks_whole_genome", ','.join(self.chromosomes), - "-d", "./", - "-s", "test_sample_chrom.intervalcollection", + run_argument_parser(["callpeaks", + "-g", "*.nobg", + "-s", "test_sample_*.intervalcollection", "-f", "%s" % self.fragment_length, "-r", "%s" % self.read_length, "-u", "100", - "-g", "150", + "-G", "150", "-n", "multigraphs_", "-p", "True", "-D", "True"])