This folder contains the functions to evaluate the results of our model and benchmarking tools via among others TAD size calculation, statistical measures like the Jaccard index and Venn diagrams. An exact description of the scripts to run the tools for benchmarking can be found in the folder ./tools_benchmarking
and the associated README.
The folder structure is shown below. The main scripts, which can be run independently, are marked. Below the purpose of each main script is discussed and it is described how to call each of these scripts.
├── tad_detection │ ├── evaluation │ │ ├── tad_dicts │ │ ├── evaluate.py │ │ ├── parameters.json │ │ └── utils_evaluate.py
In the folder tad_dicts, dictionaries containing the tad region information for different methods, cell lines, and resolutions can be found.
The functions in utils_model.py
are used as helper functions calculate the Jaccard index and further statistics, generate Venn-Diagrams and calculate the TAD region size.
usage: evaluate.py [-h] --path_parameters_json PATH_PARAMETERS_JSON
Run evaluation pipeline on multiple experiments and comparing results by
calculating statistics (Size TADs, Count TADs, Jaccard index etc.), creating
Venn-Diagrams and Hi-C maps with TAD regions.
optional arguments:
-h, --help show this help message and exit
--path_parameters_json PATH_PARAMETERS_JSON
path to JSON with parameters.
The results of this script are located in the given output directory.
The parameters.json
file contains the used parameters for the different functions in this folder. In the parameters.json file several variables can be set, which will be described below:
parameters.json
variables:
cell_line: cell line, for which evaluation script is performed (and which kind of data is inputted)
"GM12878"
chromosomes: chromosomes, for which dataset in preprocessing.py or node annotations node_annotations.py or chromosome length dict in chr_len_dict.py should be created
"all", ["1", "2", ...]
dataset_name: name of dataset
"gm12878_no_filter_no_binary_graph_conv_supervised"
node_feature_encoding: genomic annotations used in dataset creation
["CTCF", "RAD21", "SMC3", "Number_HousekeepingGenes"]
output_directory: output directory of statistics and polts generated by evaluate.py, for example:
"./output/"
paths_predicted_tads_per_tad_prediction_methods: paths of dictionaries with predicted TADs
["./tad_detection/evaluation/results/Arrowhead_GM12878_100kb_dict.p",
"./tad_detection/evaluation/results/TopDom_GM12878_100kb_dict.p"]
path_topdom_bed: path to a bed file
"./tad_detection/evaluation/tools_benchmarking/TopDomTests/bin/bintest.bed"
scaling_factor: resolution of Hi-C adjacency matrix
25000, 100000
tad_prediction_methods: list of strings of the tad prediction methods to evaluate
["Arrowhead","TopDom"]