Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can you provide the pdb files (*_motif.pdb, *_clean.pdb, *_reference.pdb ) for Sequence-conditioned generation #5

Open
paperClub-hub opened this issue Aug 12, 2024 · 13 comments

Comments

@paperClub-hub
Copy link

@wxy-nlp I can not find pdb file [os.path.join('data-bin/scaffolding-pdbs/' + pdb + '_motif.pdb') ] when run scaffold_generate.py, Can you provide it?(https://github.com/microsoft/evodiff/tree/main/examples/scaffolding-pdbs, I search and find no *_motif.pdb )

@hhhhh789
Copy link

I've find the same issue, or are there any links telling me where to find such files?

@wxy-nlp
Copy link
Collaborator

wxy-nlp commented Aug 15, 2024

Hello!

Thank you for pointing out this problem. The "*_motif.pdb" files are used for double check that motif sequence is loaded correctly. In fact, it is not necessary, and we have removed the code for double check and no longer need the *_motif.pdb file now.

Please try again, and if you encounter any other issues, feel free to let us know!

@done520
Copy link

done520 commented Aug 15, 2024

@wxy-nlp thanks, it‘’s not necessary and I view the code and Understand the meaning. But some yaml files needed when load the mode, such as "..hydra/.yaml, ./experiment/lm/yaml" , matched to three pretrained models ?

@done520
Copy link

done520 commented Aug 15, 2024

@wxy-nlp Can you share me the "..hydra/.yaml, ./experiment/lm/yaml" matched to three pretrained models ? When I use the pretrained model dplm_150m with the yaml and motif scaffolding evaluation is very
bad compared with your results

@hhhhh789
Copy link

Thank you, it did work generating seqs and folded structures conditioned on motif from the updated versions. However, I encounter further problems when evaluation the success rate, it turns out that maybe evodiff repo left some files as well:

Traceback (most recent call last):
File "/project/playground/dplm/./analysis/motif_analysis.py", line 117, in
with open(reference_PDB) as f:
FileNotFoundError: [Errno 2] No such file or directory: 'data-bin/scaffolding-pdbs/1prw_reference.pdb'
(base)

I've tried to fix this problem copying 1prw_clean.pdb as 1prw_reference.pdb(mostly wrong operation), but find bugs at later lines when processing 4jhw:

1prw
1bcf
5tpn
3ixt
4jhw
Traceback (most recent call last):
File "project/playground/dplm/./analysis/motif_analysis.py", line 123, in
calc_rmsd_tmscore(
File "/sharefs/yupei/project/playground/dplm/./analysis/motif_analysis.py", line 81, in calc_rmsd_tmscore
assert (ref.select_atoms(ref_selection).resnames == u.select_atoms(u_selection).resnames).all(), "Resnames for
AssertionError: Resnames for motifRMSD do not match, check indexing
(base)

Could you explain how this mismatch happens and provide the right file?

@hhhhh789
Copy link

@wxy-nlp thanks, it‘’s not necessary and I view the code and Understand the meaning. But some yaml files needed when load the mode, such as "..hydra/.yaml, ./experiment/lm/yaml" , matched to three pretrained models ?

I've tackled this problem when loading models. Actually when you're loading models from huggling face, yaml files aren't require, please follow the code below:

def from_pretrained(cls, net_name, cfg_override={}, net_override={}):
        # net_name : dir of downloaded models from huggling face
        net = EsmForDPLM.from_pretrained(net_name, **net_override)
        return cls(cfg=cfg_override, net=net)

@wxy-nlp
Copy link
Collaborator

wxy-nlp commented Aug 16, 2024

Thank you, it did work generating seqs and folded structures conditioned on motif from the updated versions. However, I encounter further problems when evaluation the success rate, it turns out that maybe evodiff repo left some files as well:

Traceback (most recent call last): File "/project/playground/dplm/./analysis/motif_analysis.py", line 117, in with open(reference_PDB) as f: FileNotFoundError: [Errno 2] No such file or directory: 'data-bin/scaffolding-pdbs/1prw_reference.pdb' (base)

I've tried to fix this problem copying 1prw_clean.pdb as 1prw_reference.pdb(mostly wrong operation), but find bugs at later lines when processing 4jhw:

1prw 1bcf 5tpn 3ixt 4jhw Traceback (most recent call last): File "project/playground/dplm/./analysis/motif_analysis.py", line 123, in calc_rmsd_tmscore( File "/sharefs/yupei/project/playground/dplm/./analysis/motif_analysis.py", line 81, in calc_rmsd_tmscore assert (ref.select_atoms(ref_selection).resnames == u.select_atoms(u_selection).resnames).all(), "Resnames for AssertionError: Resnames for motifRMSD do not match, check indexing (base)

Could you explain how this mismatch happens and provide the right file?

Hello!
For the problem of missing "*_reference.pdb", we have uploaded all the pdb file we use in this link and you can download them.

For the error of "AssertionError: Resnames for motifRMSD do not match, check indexing", this is because the motif part of predicted sequence does not match the ground truth motif sequence.
You can check the following code in the "analysis/motif_analysis.ipynb",

def calc_rmsd_tmscore(pdb_name, reference_PDB, scaffold_pdb_path=None, scaffold_info_path=None, ref_motif_starts=[30], ref_motif_ends=[44], output_path=None):
    "Calculate RMSD between reference structure and generated structure over the defined motif regions"
    motif_df = pd.read_csv(os.path.join(scaffold_info_path, f'{pdb_name}.csv'), index_col=0) #, nrows=num_structures)
    results = []
    for pdb in os.listdir(os.path.join(scaffold_pdb_path, f'{pdb_name}')): # This needs to be in numerical order to match new_starts file
        if not pdb.endswith('.pdb'):
            continue
        ref = mda.Universe(reference_PDB)
        predict_PDB = os.path.join(os.path.join(scaffold_pdb_path, f'{pdb_name}'), pdb)
        u = mda.Universe(predict_PDB)

        ref_selection = 'name CA and resnum '
        u_selection = 'name CA and resnum '
        i = int(pdb.split('_')[1])
        new_motif_starts = literal_eval(motif_df['start_idxs'].iloc[i])
        new_motif_ends = literal_eval(motif_df['end_idxs'].iloc[i])

        for j in range(len(ref_motif_starts)):
            ref_selection += str(ref_motif_starts[j]) + ':' + str(ref_motif_ends[j]) + ' ' 
            u_selection += str(new_motif_starts[j]+1) + ':' + str(new_motif_ends[j]+1) + ' '
        # print("U SELECTION", u_selection)
        # print("SEQUENCE", i)
        # print("ref", ref.select_atoms(ref_selection).resnames)
        # print("gen", u.select_atoms(u_selection).resnames)
        # This asserts that the motif sequences are the same - if you get this error something about your indices are incorrect - check chain/numbering
        assert len(ref.select_atoms(ref_selection).resnames) == len(u.select_atoms(u_selection).resnames), "Motif \
                                                                    lengths do not match, check PDB preprocessing\
                                                                    for extra residues"

        assert (ref.select_atoms(ref_selection).resnames == u.select_atoms(u_selection).resnames).all(), "Resnames for\
                                                                    motifRMSD do not match, check indexing"

and uncomment these codes:

        print("U SELECTION", u_selection)
        print("SEQUENCE", i)
        print("ref", ref.select_atoms(ref_selection).resnames)
        print("gen", u.select_atoms(u_selection).resnames)

then you can see the difference between motif of the generated sequence and ground truth motif.
In most cases, the problem is caused by the wrong index. When we generate scaffold, we first determine the scaffold length and then save the start and end position of motif in the sequence.
For example, we can initialize the input sequence as following:
seq = XXXXELVAXXX
where the ELVA is motif, and X represents scaffold to be generated, so the start position of motif is 4, and the end position is 7. Therefore, after generating, we can pick the sub-sequence from index 4 to 7 to extract motif from the whole sequence. If the start position or end position is wrong, the motif extracted will be wrong, then the assert will be triggered.

@done520
Copy link

done520 commented Aug 16, 2024

@wxy-nlp thanks, it‘’s not necessary and I view the code and Understand the meaning. But some yaml files needed when load the mode, such as "..hydra/.yaml, ./experiment/lm/yaml" , matched to three pretrained models ?

I've tackled this problem when loading models. Actually when you're loading models from huggling face, yaml files aren't require, please follow the code below:

def from_pretrained(cls, net_name, cfg_override={}, net_override={}):
        # net_name : dir of downloaded models from huggling face
        net = EsmForDPLM.from_pretrained(net_name, **net_override)
        return cls(cfg=cfg_override, net=net)

@hhhhh789 Thanks. I mean that I got bad result and the pLDDT was lower (20 ~ 50 ) for "Sequence-conditioned generation: motif scaffolding" when I use the pretrained models both [dplm_150m] and [dplm_650m] download from huggling face,
But the model I trained used 10% total data get better and pLDDT was 70 ~ 87.8, so I think there was something wrong when the the pretrained model use or Inappropriate parameters , especially the "..hydra/_.yaml, ./experiment/lm/yaml", so, could you share me the "..hydra/.yaml, ./experiment/lm/yaml" you used during pretrain process @wxy-nlp

@wxy-nlp
Copy link
Collaborator

wxy-nlp commented Aug 16, 2024

@wxy-nlp thanks, it‘’s not necessary and I view the code and Understand the meaning. But some yaml files needed when load the mode, such as "..hydra/.yaml, ./experiment/lm/yaml" , matched to three pretrained models ?

I've tackled this problem when loading models. Actually when you're loading models from huggling face, yaml files aren't require, please follow the code below:

def from_pretrained(cls, net_name, cfg_override={}, net_override={}):
        # net_name : dir of downloaded models from huggling face
        net = EsmForDPLM.from_pretrained(net_name, **net_override)
        return cls(cfg=cfg_override, net=net)

@hhhhh789 Thanks. I mean that I got bad result and the pLDDT was lower (20 ~ 50 ) for "Sequence-conditioned generation: motif scaffolding" when I use the pretrained models both [dplm_150m] and [dplm_650m] download from huggling face,
But the model I trained used 10% total data get better and pLDDT was 70 ~ 87.8, so I think there was something wrong when the the pretrained model use or Inappropriate parameters , especially the "..hydra/_.yaml, ./experiment/lm/yaml", so, could you share me the "..hydra/.yaml, ./experiment/lm/yaml" you used during pretrain process @wxy-nlp

Hello done520,

the pLDDT was lower (20 ~ 50 ) for "Sequence-conditioned generation: motif scaffolding" when I use the pretrained models both [dplm_150m] and [dplm_650m] download from huggling face

this is weird, the pLDDT should not be such low. I tried to generate scaffold with [dplm_150m] and [dplm_650m] model downloaded from huggingface in my own machine using the following script:

export CUDA_VISIBLE_DEVICES=0

model_name=dplm_650m
output_dir=./generation-results/${model_name}_scaffold

mkdir -p generation-results

# Generate scaffold 
python scaffold_generate.py \
    --model_name airkingbd/${model_name} \
    --num_seqs 100 \
    --saveto $output_dir

and the result is right.
I think there may be something wrong in the way the model is loaded. Could you please share the code that you load the model? Thank you!

I guess you may directly set the model_name to a path. If that's the case, you can try to set the model_name to the model_id in the huggingface, which can be one of "airkingbd/dplm_150m", "airkingbd/dplm_650m" and "airkingbd/dplm_3b", and try again.

@done520
Copy link

done520 commented Aug 17, 2024

@wxy-nlp thanks.

1、 pretrained model

dplm_150m download from https://huggingface.co/airkingbd/dplm_150m/tree/main
and save to ./model_checkpoints/downs_pretrained_models/dplm_150m/checkpoints
-rw-r--r-- 1 users 765 Jul 31 09:14 config.json
-rw-r--r-- 1 users 1519 Jul 31 09:14 gitattributes
-rw-r--r-- 1 users 595359662 Jul 31 09:19 pytorch_model.bin
-rw-r--r-- 1 users 125 Jul 31 09:19 special_tokens_map.json
-rw-r--r-- 1 users 1126 Jul 31 09:19 tokenizer_config.json
-rw-r--r-- 1 users 93 Jul 31 09:19 vocab.txt

2、run

export CUDA_VISIBLE_DEVICES=5
python scaffold_generate.py --model_name model_checkpoints/downs_pretrained_models/dplm_150m/checkpoints --num_seqs 100 --saveto ./generation-results/new_dplm_150m_scaffold

when run as above got errors: No such file or directory model_checkpoints/downs_pretrained_models/.hydra/config.yaml
so I do like this:

mkdir -p model_checkpoints/downs_pretrained_models/.hydra
cp configs/experiment/lm/cond_dplm_150m.yaml   model_checkpoints/downs_pretrained_models/.hydra/config.yaml

python scaffold_generate.py --model_name model_checkpoints/downs_pretrained_models/dplm_150m/checkpoints --num_seqs 100  --saveto ./generation-results/new_dplm_150m_scaffold

got " IsADirectoryError: [Errno 21] Is a directory: 'model_checkpoints/downs_pretrained_models/dplm_15m/checkpoints ", and I made some change, 30min later I got results:

python scaffold_generate.py --model_name model_checkpoints/downs_pretrained_models/dplm_150m/checkpoints/pytorch_model.bin --num_seqs 100  --saveto ./generation-results/new_dplm_150m_scaffold 

But I find the fasta file has many special tokens and the pLDDT was very low(pLDDT 30 ~ 40):

>SEQUENCE_0_PDB_1prw
UBAUOUS<null_1>YTD<null_1>ISTDR<null_1>LREHDGD-YPD.CPEWQHIE.DCAAF<null_1><null_1>Z<null_1>RFSLFDKDGDGT
>SEQUENCE_1_PDB_1prw
YH.EDY<null_1>-FZBCZYWIBHR<null_1>W<null_1>DN.WZERLWKZHNDZPU-QDTUCCNYSDUNQ-HA<null_1>SEPWUA-NZAFSLFDKD
>SEQUENCE_2_PDB_1prw
P<null_1>ZHSCQZKYYRUHZYKAFUCMN.ZGIUOZK-ZUYEPHEVURYWZZAFNCZKFSLFDKDGDGTITTKELGTV<null_1>QY.BOVWEWTUCZUK
>SEQUENCE_3_PDB_1prw
NDUQS.EVBAIKU.A.TTBQIWA-UNQQM-EKKH.TIFSLFDKDGDGTITTKELGTVHBG.<null_1>GEETDREEDINEVDADGNGTIDFPEFLTMCR-D
>SEQUENCE_4_PDB_1prw
PBHODNCW-DFSLFDKDGDGTITTKELGTV<null_1>DW-CASC.P-VFHK-INEVDADGNGTIDFPEFLTMZA.ANC<null_1>FRERRBACQHCOWH-
>SEQUENCE_5_PDB_1prw
CNDDC-ZKCS-PZGEIZG-UVRZKUKMUNEKNHCZURDRAZBWSFQVNCELD..WRIWQ-FURTDZDMAYBBTYGDFSLFDKDGDGTITTKELGTVHO--EF
>SEQUENCE_6_PDB_1prw
ZFRV-TN.RLCIYVKLYRNNWWDDDGDTNOACBVRFFSLFDKDGDGTITTKELGTVNEMDQGQQWR-<null_1>Z.V-INEVDADGNGTIDFPEFLTMVEU
>SEQUENCE_7_PDB_1prw
OHB-EFRMWW.N<null_1>EG-QH<null_1>C<null_1>GFDHTR-QZ<null_1>QEHZ.ZGC-AM--FSLFDKDGDGTITTKELGTV-ZRE-CB.QM

3、my pretrained model.

I use my pretrained model and got model to predict(pLDDT 80 ~87):

python scaffold_generate.py --model_name  byprot-checkpoints/dplm_150m/checkpoints/best.ckpt --num_seqs 100 --saveto generation-results/dplm_150m_mytest_scaffold 

## fasta of 1prw
>SEQUENCE_0_PDB_1prw
MYVIRVSAKNEAGFSLFDKDGDGTITTKELGTVAVADLKALADKDKTASINEVDADGNGTIDFPEFLTMTNYLQRYLVLTVPTVDRIYSISAKKFGHMDKIQ
>SEQUENCE_1_PDB_1prw
TKEDVKTVKIKTRVIIKDTRIAKEKLTESGKALFSLFDKDGDGTITTKELGTVKEVNKRSIKKEIKKTKINEVDADGNGTIDFPEFLTMEIPDLIKETVRDI
>SEQUENCE_2_PDB_1prw
TPAGSPPTFLTPLEPVTVIEGYPAVLECQVSGVPKPTITWYRQGATIKFSPDFQMYYDGELYCLKVKFSLFDKDGDGTITTKELGTVATTKCELVVQDADSV
>SEQUENCE_3_PDB_1prw
RVNVETQKANVRSLDEYHAYLFRVCSRNEVAQGEPWETEDFSLFDKDGDGTITTKELGTVEKKSLKGVSFSATDNSINEVDADGNGTIDFPEFLTMQRIENV
>SEQUENCE_4_PDB_1prw
FSLPLEPVTVIEGEPARLEVKVSGDPKPKITWYRQTVPITPSEDFQVYYDGDVATLVIKEAFPEDSGVYRFSLFDKDGDGTITTKELGTVPVTVRSDATTPI
>SEQUENCE_5_PDB_1prw
IHLDCRVEPSGDPTLKVEWFFNGRSLTVSSRFQSTFDFGLVSLDIAYAYPEDSGVYTVRAVNPLGEATTTASLKVEGKEELEGTFSLFDKDGDGTITTKELG
>SEQUENCE_6_PDB_1prw
DGGSPIISYSVEFSLFDKDGDGTITTKELGTVKYVVPGLKRGLEYIFRINEVDADGNGTIDFPEFLTMARDPIAPPDPPTKEMVTDSTKTVDVAWDEPPKDG
>SEQUENCE_7_PDB_1prw
KPGGRQVSESGMPPTFLAPMENVTAVEGYPAVFDCKVIGPPKPKITWYRQGQPLKDSKEFSLFDKDGDGTITTKELGTVPEDDGVYKIKARNKYGINEVDAD
>SEQUENCE_8_PDB_1prw

So, I think there is something wrong during my precess and I am looking forward to your reply.

@wxy-nlp
Copy link
Collaborator

wxy-nlp commented Aug 17, 2024

Hello @done520,
I notice that you load model using the following script:

export CUDA_VISIBLE_DEVICES=5
python scaffold_generate.py --model_name model_checkpoints/downs_pretrained_models/dplm_150m/checkpoints --num_seqs 100 --saveto ./generation-results/new_dplm_150m_scaffold

where the --model_name is a path.
When the model_name is a path, the DiffusionProteinLanguageModel.from_pretrained(model_name) method will automatically load local pretrained checkpoint by yourself, NOT FROM HUGGINGFACE. Therefore, this requires .hydra/config.yaml that is generated during training.
I suggest that you can set the model_name to the model_id in the huggingface, which can be one of "airkingbd/dplm_150m", "airkingbd/dplm_650m" and "airkingbd/dplm_3b", like this:

model_name="airkingbd/dplm_150m"
python scaffold_generate.py --model_name ${model_name} --num_seqs 100 --saveto ./generation-results/dplm_150m_scaffold

This script will load model from huggingface correctly.

@wxy-nlp
Copy link
Collaborator

wxy-nlp commented Aug 17, 2024

Hello @done520,

Consider that you have downloaded models from huggingface,

dplm_150m download from https://huggingface.co/airkingbd/dplm_150m/tree/main
and save to ./model_checkpoints/downs_pretrained_models/dplm_150m/checkpoints

you can also update your code to our latest commit, which supports your original usage that directly specify the model_name to path of downloaded model from huggingface:

export CUDA_VISIBLE_DEVICES=5
python scaffold_generate.py --model_name model_checkpoints/downs_pretrained_models/dplm_150m/checkpoints --num_seqs 100 --saveto ./generation-results/new_dplm_150m_scaffold

@hhhhh789
Copy link

Thank you, it did work generating seqs and folded structures conditioned on motif from the updated versions. However, I encounter further problems when evaluation the success rate, it turns out that maybe evodiff repo left some files as well:
Traceback (most recent call last): File "/project/playground/dplm/./analysis/motif_analysis.py", line 117, in with open(reference_PDB) as f: FileNotFoundError: [Errno 2] No such file or directory: 'data-bin/scaffolding-pdbs/1prw_reference.pdb' (base)
I've tried to fix this problem copying 1prw_clean.pdb as 1prw_reference.pdb(mostly wrong operation), but find bugs at later lines when processing 4jhw:
1prw 1bcf 5tpn 3ixt 4jhw Traceback (most recent call last): File "project/playground/dplm/./analysis/motif_analysis.py", line 123, in calc_rmsd_tmscore( File "/sharefs/yupei/project/playground/dplm/./analysis/motif_analysis.py", line 81, in calc_rmsd_tmscore assert (ref.select_atoms(ref_selection).resnames == u.select_atoms(u_selection).resnames).all(), "Resnames for AssertionError: Resnames for motifRMSD do not match, check indexing (base)
Could you explain how this mismatch happens and provide the right file?

Hello! For the problem of missing "*_reference.pdb", we have uploaded all the pdb file we use in this link and you can download them.

For the error of "AssertionError: Resnames for motifRMSD do not match, check indexing", this is because the motif part of predicted sequence does not match the ground truth motif sequence. You can check the following code in the "analysis/motif_analysis.ipynb",

def calc_rmsd_tmscore(pdb_name, reference_PDB, scaffold_pdb_path=None, scaffold_info_path=None, ref_motif_starts=[30], ref_motif_ends=[44], output_path=None):
    "Calculate RMSD between reference structure and generated structure over the defined motif regions"
    motif_df = pd.read_csv(os.path.join(scaffold_info_path, f'{pdb_name}.csv'), index_col=0) #, nrows=num_structures)
    results = []
    for pdb in os.listdir(os.path.join(scaffold_pdb_path, f'{pdb_name}')): # This needs to be in numerical order to match new_starts file
        if not pdb.endswith('.pdb'):
            continue
        ref = mda.Universe(reference_PDB)
        predict_PDB = os.path.join(os.path.join(scaffold_pdb_path, f'{pdb_name}'), pdb)
        u = mda.Universe(predict_PDB)

        ref_selection = 'name CA and resnum '
        u_selection = 'name CA and resnum '
        i = int(pdb.split('_')[1])
        new_motif_starts = literal_eval(motif_df['start_idxs'].iloc[i])
        new_motif_ends = literal_eval(motif_df['end_idxs'].iloc[i])

        for j in range(len(ref_motif_starts)):
            ref_selection += str(ref_motif_starts[j]) + ':' + str(ref_motif_ends[j]) + ' ' 
            u_selection += str(new_motif_starts[j]+1) + ':' + str(new_motif_ends[j]+1) + ' '
        # print("U SELECTION", u_selection)
        # print("SEQUENCE", i)
        # print("ref", ref.select_atoms(ref_selection).resnames)
        # print("gen", u.select_atoms(u_selection).resnames)
        # This asserts that the motif sequences are the same - if you get this error something about your indices are incorrect - check chain/numbering
        assert len(ref.select_atoms(ref_selection).resnames) == len(u.select_atoms(u_selection).resnames), "Motif \
                                                                    lengths do not match, check PDB preprocessing\
                                                                    for extra residues"

        assert (ref.select_atoms(ref_selection).resnames == u.select_atoms(u_selection).resnames).all(), "Resnames for\
                                                                    motifRMSD do not match, check indexing"

and uncomment these codes:

        print("U SELECTION", u_selection)
        print("SEQUENCE", i)
        print("ref", ref.select_atoms(ref_selection).resnames)
        print("gen", u.select_atoms(u_selection).resnames)

then you can see the difference between motif of the generated sequence and ground truth motif. In most cases, the problem is caused by the wrong index. When we generate scaffold, we first determine the scaffold length and then save the start and end position of motif in the sequence. For example, we can initialize the input sequence as following: seq = XXXXELVAXXX where the ELVA is motif, and X represents scaffold to be generated, so the start position of motif is 4, and the end position is 7. Therefore, after generating, we can pick the sub-sequence from index 4 to 7 to extract motif from the whole sequence. If the start position or end position is wrong, the motif extracted will be wrong, then the assert will be triggered.

Thanks for your patience. Now it works for evaluation success rate!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants