XGB-TFBSContext contains code and notebook for "An Ensemble Approach to Elucidating Transcription Factor Binding Specificity and Occupancy". The repository includes the following files and folders.
The main dependencies are:
- numpy
- scipy
- pybedtools
- pysam
- pybigwig
- pandas
- scikit-learn
- xgboost
- seaborn
All the can be easily installed to a conda environment using:
#create a conda environment
conda env create -f environment.yml
#Activate the environment
source activate <env name>
This host the Ipython notebooks for the Machine learning modelling and also for the PBM-DNase modelling of TF binding specificity by reranking, reweighing and background correction.
Included also is the core module for XGB-TFBSContext.
Folder contains some of the data for training the XGBoost model. These include the Clustered DNase data and the genome-wide transcription start sites. It also contains the k-mer count files used by PBM-DNase.
On the local repository, additional files like the human genome, ChIP-seq peaks, PBM intensity data and k-mer scores, and the DNA shape files are included. These are not included here due to the enormous space they take. These should be downloaded separately as described below and in the respective Ipython notebooks.
The results from feature importance studies and the plots are stored here.
Some stand alone modules for feature importance studies are included.
- all_feats.py: Runs full feature importance by elimination studies
- rerank.py : Main module for improving PBM in vivo prediction by re-reranking.
- test_xgb_svm_gbc_sgd.py : Module for investigating the performance of XGB, Gradient boosting, support vector machines and stochastic gradient descend.
DNAShape information downloaded from ftp://rohslab.usc.edu/hg19/ - hg19.HelT.wig.bw - hg19.MGW.wig.bw - hg19.ProT.wig.bw - hg19.Roll.wig.bw
The original Seed and Wobble algorithms from PBMAnalysisSuite
A modified version of the algorithm to take in k-mer frequency counts from http://www.bioinf.ict.ru.ac.za/counts_SnW
An executable motif algorithm based on Gibbs sampling from hierarchicalANOVA