Skip to content

Latest commit

 

History

History
49 lines (42 loc) · 4.08 KB

README.md

File metadata and controls

49 lines (42 loc) · 4.08 KB

iDeep

we proposed a deep learning based framework, iDeep, to fuse heterogeneous data for predicting RNA-protein interaction sites. The deep learning framework can not only learn the hidden feature patterns from individual source of data, but also extracted the shared representation across them. In addition, the convolutional neural network in iDeep can automatically identify binding motifs. To validate our proposed method over other methods, we perform experiments on large-scale CLIP-seq datasets. The comprehensive results indicated the huge advantage of iDeep, which performs much better than the state-of-the-art methods.

Dependency

keras 1.2.0 library and its backend is theano 0.9
sklearn
h5py, install it using "pip install h5py"
python 2.7

Content

./datasets: the training and testing dataset with extracted features, label and sequence.
./cbust_folder: Cluster-buster tool is used to generate motif features.
./pwms_folder: 102 PWMs from CISBP-RNA (Position Weight Matrix).
./predicted_motifs: detected binding motifs for individual proteins from iDeep. and it also includes the report file ame.html from AME in MEME suite, it reporte the enrichment score for the predicted motifs.
./ideep.py: the python code, it can be ran to reproduce our results.
./make_feature_table.py: it is modified based on primescore.

Usage

python ideep.py [-h] [--data_dir <data_directory>] [--train TRAIN]
[--model_dir MODEL_DIR] [--predict PREDICT]
[--out_file OUT_FILE] [--seq SEQ] [--region_type REGION_TYPE]
[--cobinding COBINDING] [--structure STRUCTURE]
[--motif MOTIF] [--batch_size BATCH_SIZE] [--n_epochs N_EPOCHS]

In our default setting, we will use seq, region_type, cobinding and structure, the features are generated by iONMF (https://github.com/mstrazar/iONMF). Thus, if you use default setting, the data_dir need have the following files: sequences.fa.gz, matrix_RegionType.tab.gz, matrix_RNAfold.tab.gz, matrix_Cobinding.tab.gz, motif_fea.gz, and label file matrix_Response.tab.gz with 0 and 1. If you set the corrsponding option to be TRUE, you need have the corresponding data.

Use example

1. Train the model using your data (currently only support fix-length sequences, it defaults to use sequence, region type, structure, clip cobidning modularity):
python ideep.py --train=True --data_dir=datasets/clip/10_PARCLIP_ELAVL1A_hg19/5000/training_sample_0/ --model_dir=models

--model_dir: the dir used to save the trained model, which is used for prediction step.
--data_dir configure your dir that contains training featrues file (sequences.fa.gz, matrix_RegionType.tab.gz, matrix_RNAfold.tab.gz, matrix_Cobinding.tab.gz) and label file (matrix_Response.tab.gz).

2. predict the binding probability for your sequences (you need use the same dir for saved models in training step):
python ideep.py --predict=True --data_dir=datasets/clip/10_PARCLIP_ELAVL1A_hg19/5000/test_sample_0/ --model_dir=models --out_file=YOUR_OUTFILE

--model_dir: The saved dir for models in training step.
--data_dir: configure your dir that contains testing featrues file (sequences.fa.gz, matrix_RegionType.tab.gz, matrix_RNAfold.tab.gz, matrix_Cobinding.tab.gz), and the prediction probability for your sequences are saved in <YOUR_OUTFILE>, each line corresponds to the preobability of being RBP binding site for the sequence in fasta file.

Reference
Xiaoyong Pan and Hong-Bin Shen. RNA-protein binding motifs mining with a new hybrid deep learning based cross-domain knowledge integration approach. BMC Bioinformatics, 2017, 18:136. DOI: 10.1186/s12859-017-1561-8

Contact: Xiaoyong Pan (xypan172436atgmail.com)