Accurately predicting protein function via deep learning with domain-guided structure information
Here we provide instructions for two use cases: (1) Retraining our model on our or your data. (2) Testing data on trained models.
If you encounter any bugs or issues, feel free to contact us.
Pytorch: 1.12.0
DGL: 1.1.0
You can download our models from ./data/download_link.txt
and get our trained model.
You should prepare your configure file as requirement (see ./configure/[mf/cc/bp].yaml
as an example), following items must be clarified:
name: mf # The ontology you want to choose: mf/bp/cc. Make sure it matches the file name of configuration file (mf.yaml/bp.yaml/cc.yaml).
mlb: ./mlb/mf_go.mlb # The predicted labels used in DPFunc, which is generated automatically during training.
results: ./results # The directory to save predicted results of test data.
base:
interpro_whole: ./data/interpro/{}.pkl # The interpro files of proteins. Each interpro file corresponds an array with x columns, where each column is an interpro property (IPR...), which can be seen in './data/inter_idx.pkl'
residue_feature: # The residue-level esm features of your test proteins. The details can be found in later sections.
pdb_points: # The coordinate file of proteins, generated by `./DataProcess/generate_points.py`. The details can be found in later sections.
train:
name: train
pid_list_file: ./data/mf_train_used_pid_list.pkl
pid_go_file: ./data/mf_train_go.txt
pid_pdb_file: ./data/PDB/graph_feature/mf_train_whole_pdb_part{}.pkl
train_file_count: 7
interpro_file: ./data/mf_train_interpro.pkl # The path of interpro file including training proteins, which is generated automatically during training.
valid:
name: valid
pid_list_file: ./data/mf_test1_used_pid_list.pkl
pid_go_file: ./data/mf_test1_go.txt
pid_pdb_file: ./data/PDB/graph_feature/mf_test1_whole_pdb_part0.pkl
interpro_file: ./data/mf_test1_interpro.pkl # The path of interpro file including validated proteins, which is generated automatically during training.
test:
name: test
pid_list_file: ./data/mf_test2_used_pid_list.pkl # The test protein list ('.pkl' format).
pid_go_file: ./data/mf_test2_go.txt # The test proteins GO (for evaluation if provided).
pid_pdb_file: ./data/PDB/graph_feature/mf_test2_whole_pdb_part0.pkl # The structure graphs of test proteins.
interpro_file: ./data/mf_test2_interpro.pkl # The path of interpro file including test proteins, which is generated automatically during training.
Notably, to generate pid_pdb_file
, you need complete the following steps:
-
For
pid_pdb_file
:1.1 You should place your PDB files of proteins (5NTC_RAT.pdb, 6PGL_SCHPO.pdb, ...) at
./data/PDB/PDB_folder/
.1.2 Use
generate_points.py
to generate the coordinate files of proteins, the result file will be placed at./data/pdb_points.pkl
.python ./DataProcess/generate_points.py -i ./data/mf_test2_used_pid_list.pkl -o pdb_points
1.3 Use pre-trained language model (
esm
or other PLLMs) to generate the residue features. As the number of proteins may be too large, we suggest that users should partition the whole data into several parts and an additional map filemap_pid_esm_file
(dict
format) is also needed to map the part id of each proteins.1.4 Based on
pdb_points.pkl
,map_pid_esm_file.pkl
, andpdb_residue_esm_embeddings_part{part_id}.pkl
, usingprocess_graph.py
to generate the structure graphs for test data. (Note: change the paths in the file)python ./DataProcess/process_graph.py -d mf
If you have prepared the data, you can train our model on your data as follows (Ensure that your configure file is right):
python DPFunc_main.py -d mf -n 0 -e 15 -p temp_model
arguments:
-d: the ontology (mf/cc/bp)
-n: gpu number (default: 0)
-e: training epoch (default: 15)
-p: the prefix of results (default: temp_model)
If you want to test proteins on trained models, you can easily comment out the training and validation code, as shown in DPFunc_pred.py
You can also download our trained model from: https://drive.google.com/file/d/1V0VTFTiB29ilbAIOZn0okBQWPlbOI3wN/view?usp=drive_link
Please feel free to contact us for any further questions.
- Wenkang Wang [email protected]
- Min Li [email protected]
Wang W, Shuai Y, Zeng M, et al. DPFunc: accurately predicting protein function via deep learning with domain-guided structure information[J]. Nature Communications, 2025, 16(1): 70.