Skip to content

CSUBioGroup/DPFunc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DPFunc

Accurately predicting protein function via deep learning with domain-guided structure information

Usage

Here we provide instructions for two use cases: (1) Retraining our model on our or your data. (2) Testing data on trained models.

If you encounter any bugs or issues, feel free to contact us.

Train DPFunc

Key Environment

Pytorch: 1.12.0
DGL: 1.1.0

Data Download

You can download our models from ./data/download_link.txt and get our trained model.

Data Construction

You should prepare your configure file as requirement (see ./configure/[mf/cc/bp].yaml as an example), following items must be clarified:

name: mf   # The ontology you want to choose: mf/bp/cc. Make sure it matches the file name of configuration file (mf.yaml/bp.yaml/cc.yaml).
mlb: ./mlb/mf_go.mlb  # The predicted labels used in DPFunc, which is generated automatically during training.
results: ./results  # The directory to save predicted results of test data.

base:
  interpro_whole: ./data/interpro/{}.pkl  # The interpro files of proteins. Each interpro file corresponds an array with x columns, where each column is an interpro property (IPR...), which can be seen in './data/inter_idx.pkl' 
  residue_feature: # The residue-level esm features of your test proteins. The details can be found in later sections.
  pdb_points: # The coordinate file of proteins, generated by `./DataProcess/generate_points.py`. The details can be found in later sections.

train:
  name: train
  pid_list_file: ./data/mf_train_used_pid_list.pkl
  pid_go_file: ./data/mf_train_go.txt
  pid_pdb_file: ./data/PDB/graph_feature/mf_train_whole_pdb_part{}.pkl
  train_file_count: 7
  interpro_file: ./data/mf_train_interpro.pkl # The path of interpro file including training proteins, which is generated automatically during training.

valid:
  name: valid
  pid_list_file: ./data/mf_test1_used_pid_list.pkl
  pid_go_file: ./data/mf_test1_go.txt
  pid_pdb_file: ./data/PDB/graph_feature/mf_test1_whole_pdb_part0.pkl
  interpro_file: ./data/mf_test1_interpro.pkl # The path of interpro file including validated proteins, which is generated automatically during training.
  
test:
  name: test
  pid_list_file: ./data/mf_test2_used_pid_list.pkl # The test protein list ('.pkl' format).
  pid_go_file: ./data/mf_test2_go.txt # The test proteins GO (for evaluation if provided).
  pid_pdb_file: ./data/PDB/graph_feature/mf_test2_whole_pdb_part0.pkl # The structure graphs of test proteins.
  interpro_file: ./data/mf_test2_interpro.pkl # The path of interpro file including test proteins, which is generated automatically during training.

Notably, to generate pid_pdb_file, you need complete the following steps:

  1. For pid_pdb_file:

    1.1 You should place your PDB files of proteins (5NTC_RAT.pdb, 6PGL_SCHPO.pdb, ...) at ./data/PDB/PDB_folder/.

    1.2 Use generate_points.py to generate the coordinate files of proteins, the result file will be placed at ./data/pdb_points.pkl.

    python ./DataProcess/generate_points.py -i ./data/mf_test2_used_pid_list.pkl -o pdb_points
    

    1.3 Use pre-trained language model (esm or other PLLMs) to generate the residue features. As the number of proteins may be too large, we suggest that users should partition the whole data into several parts and an additional map file map_pid_esm_file (dict format) is also needed to map the part id of each proteins.

    1.4 Based on pdb_points.pkl, map_pid_esm_file.pkl, and pdb_residue_esm_embeddings_part{part_id}.pkl, using process_graph.py to generate the structure graphs for test data. (Note: change the paths in the file)

    python ./DataProcess/process_graph.py -d mf
    

Train our model on our or your own data

If you have prepared the data, you can train our model on your data as follows (Ensure that your configure file is right):

python DPFunc_main.py -d mf -n 0 -e 15 -p temp_model

arguments:
    -d: the ontology (mf/cc/bp)
    -n: gpu number (default: 0)
    -e: training epoch (default: 15)
    -p: the prefix of results (default: temp_model)

Test

If you want to test proteins on trained models, you can easily comment out the training and validation code, as shown in DPFunc_pred.py

Model Download

You can also download our trained model from: https://drive.google.com/file/d/1V0VTFTiB29ilbAIOZn0okBQWPlbOI3wN/view?usp=drive_link

Contact

Please feel free to contact us for any further questions.

References

Wang W, Shuai Y, Zeng M, et al. DPFunc: accurately predicting protein function via deep learning with domain-guided structure information[J]. Nature Communications, 2025, 16(1): 70.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages