Skip to content

westlake-repl/Denovo-Pinal

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Pinal: Toward De Novo Protein Design from Natural Language

The repository is an official implementation of Pinal: Toward De Novo Protein Design from Natural Language

Quickly try our online server (16B) here

If you have any questions about the paper or the code, feel free to raise an issue!

Environment setup

Create and activate a new conda environment with Python 3.8.

conda create -n pinal python=3.8 --yes
conda activate pinal
pip install -r requirements.txt

Download model weights

We provide a script to download the pre-trained model weights, as shown below. Please download all files and put them in the weights directory, e.g., weights/Pinal/...

huggingface-cli download westlake-repl/Pinal \
                         --repo-type model \
                         --local-dir weights/

Model checkpoints

The weights directory contains 3 models:

Name Size
SaProt-T 760M
T2struc-1.2B 1.2B
T2struc-15B 15B

Inference with Pinal

Design protein from natural language instruction with only 3 lines of code!

from utils.design_utils import load_pinal, PinalDesign
load_pinal()
res = PinalDesign(desc="Actin.", num=10)
# res is a list of designed proteins, sorted by the probability per token. 

The above code will generate 10 de novo designed proteins based on the input description "Actin.", inferred by 1.2B T2struc and SaProt-T. If you want inference with T2struc-15B, you can set the environment variable T2struc_NAME before calling load_pinal(), as shown below.

import os
os.environ["T2struc_NAME"] = "T2struc-15B"

Warning: Inferencing with T2struc-15B requires at least 40GB GPU memory.

Predicting amino acid sequence with SaProt-T

Here, we provide a script for predicting amino acid sequences using natural language, enabling you to specify the desired structure.

from utils.design_utils import SaProtPrepareGenerationInputs, SaProtGeneration, load_SaProtT_and_tokenizers
desc = "Actin."
saprot, saprot_text_tokenizer, saprot_tokenizer = load_SaProtT_and_tokenizers()
structure = "dqdppafakewedfqfwifidtfpdqggqdifgqkkwafpdpppcvppdddridgtvrrvvvvvgtdmdgqdalqagpdpvsvlvvvvcvdcprvnhqqlnheyeyegaapydlvrllsvvccscpvsvhqwyayaylqlllcvlvvdqfawefaaalqwtkiwggdnsdtdnqlididrdhnvlllvllqvvvvvvvdhqddpnssvvssvcqlpqaaadldlvvqvvclvvdqpskdwdqdpvrdididtssrhvslccqcvvvsvvdpdhhslvsnvsslvsddpvrslvhqchyeyaysrvqhhcpqsnsqvsncvvddvphdgdydydnvrncssvssvsplspdpvnpvlidgsvncvvppssvnvvrhd"
SaProtInputDict = SaProtPrepareGenerationInputs([" ".join(list(structure))], desc, saprot_text_tokenizer, saprot_tokenizer)
seq = SaProtGeneration(saprot, SaProtInputDict, saprot_tokenizer)["sequence"]
print(seq)

The above code makes predictions based on Foldseek tokens. If you want to convert a 3D structure file (e.g., .pdb or .mmcif) into Foldseek tokens, you should download the binary file from here and place it in the assets/bin folder. The following code demonstrates how to use it.

from utils.foldseek_utils import get_struc_seq
pdb_path = "assets/8ac8.cif"
# Extract the "A" chain from the pdb file and encode it into a struc_seq
foldseek_seq = get_struc_seq("assets/bin/foldseek", pdb_path, ["A"])["A"][1].lower()
print(f"foldseek_seq: {foldseek_seq}")

Computational evaluation of the de novo designed proteins

For textual alignment, we recommend using ProTrek to calculate the sequence-text similarity score.

For foldability, we recommend using pLDDT and PAE, outputted by Alphafold series or ESMFold.

Other resources

About

A framework for text-guided protein design

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages