-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[P1] Eval time model is not loaded: Unable to replicate results from paper for RoBERTa Base for Glue tasks like CoLa #114
Comments
@m-dev12 Hey! Thanks for the inputs! Here are some pointers about reproducing our results.
Note that here, we (1) use seed
Let me know if these help! and let me know if you have more questions! Thanks! |
Thank you for the detailed response @frankaging. As for my environment, I am using pyvene==0.1.1 and pyreft==0.0.6. Note that pyvene==0.1.2 did not work with Roberta(giving out an error for additional arg use_cache passed.) I believe it is related to this: stanfordnlp/pyvene#152 Is there anything you recommend me checking, as I believe ideally I should be replicate your logs. |
@m-dev12 Thanks for the follow up! I think it might boil down to different random state of the machine, which is hard to control, especially given how unstable it could be for datasets such as CoLA and RTE (e.g., even with the same random seed, there could be discrepancies across machines). You can try to follow this ticket to create an exact same env as we have locally, and test out if it helps: #102. Minor: note that ~61.6% is close to our number we reported in the paper (60.4) given the instability of the setup. I would also recommend to try different seeds e.g. 46/43/44, etc.. |
Thanks again @frankaging!!. Yes, I understand and yes the results are close to the reported results in the paper. This is related to this fix = stanfordnlp/pyvene#152 |
@m-dev12 Thanks! Opening an issue would help a lot, as I am slowly ramping up my workload on these two repos again for the summer! |
Thanks @frankaging! I have opened an issue on Pyreft. On a side note, could you please share commands with hyperparam config for other Glue tasks like MNLI, QNLI for RoBERTa, just in case there are any nuances not mentioned in the paper and since CoLA is a little unstable. Thanks again!. |
@m-dev12 Hey, thanks. The whole hyperparameter search space is outlined in Table 8 on pg. 25; Individual task hyperparameter configuration is outlined on the page that follows from Table 9 to Table 12. And we use 256 as our maximum sequence length for all tasks. I double checked, and I think most of the nuances are mentioned already, but maximum sequence length is indeed missing (and probably the only one? i think). We wrote this on pg. 23 last paragraph, which causes confusion:
Thus, we will add another sentence after this sentence to further clarify that we also use the same maximum sequence length which is 256. Our (If you want to do hyperparameter search as we did) You can also follow our hyperparameter search procedure:
|
@frankaging Sure, yes I have taken everything from the appendix of the paper, will double check any additional details from Wu et al (2024). Thanks a lot! |
Hi @frankaging , I was trying to replicate the results for Roberta Base with Loreft for the MNLI dataset with the configuration given in Table 9, but I got very different accuracy. For instance, for the command below, I get a validation matched accuracy of ~56% (although accuracy during training was ~82%).
Can you please share anything missing / different in your runs? |
@m-dev12 Thanks for raising the issue. This is our previous run with seed 42 (I think 45 should be quite similar given MNLI is stable, unlike CoLA/RTE):
Note that we rename our script, and intervention This is what i got from our old log
See my screenshot for the runname Based on this, it could a regression due to our recent code changes. Could you share your git top commit hash # of your branch? for pyvene and pyreft. I will check. |
I agree, results should ideally not vary so much between seeds. Also, the gap between train and validation loss seems to indicate some kind of overfitting? Here's the git top commit hash:
Attached is the environment file as well, i am using pyreft 0.0.5 and pyvene 0.1.1 Another Question: Is there any specific reason you train it for as long as 40 epochs? It takes quite long for MNLI and the loss curve does not seem to reduce much after a while. |
@m-dev12 just did a quick check - could not find recent changes that touch the saving and loading code path. But my guess is that classifier head (since for GLUE, we freeze everything else and train the head with interventions) is not properly loaded during evaluation time. Could you downgrade your transformers to |
Thanks! We were following previous works on the epoch selection. I agree, it is on the high side. Given the large gap, we can also start debugging with much lower epoch i think. Maybe epoch=5 or 10? Could you upgrade your pyvene to 0.1.2? Thanks! |
I agree, will try with lower epochs. I had switched to pyvene==0.1.1 from 0.1.2 initially, because it was giving out an error with Roberta models. stanfordnlp/pyvene#163 |
@m-dev12 another thing to check, we wrote a callback function in pyreft to load the best model at the end as: def _load_best_model(self):
logger.warning(f"Loading best model from {self.state.best_model_checkpoint} (score: {self.state.best_metric}).")
self.model.load_intervention(
f"{self.state.best_model_checkpoint}/intervenable_model",
include_model=True
) It should log something like "Loading best model from ...". Could you find this line in your log? Thanks! And do these file exist in your directory as well? as in |
@PinetreePantry Adding Peter here who will provide more detail since he owns the MNLI results, and he is reproducing the error on our cluster. Two minor things:
|
@m-dev12 I just checked in the fix for |
Yup I can see this in the logs:
You mean warmup ratio right? Sure, let me update pyvene and will run fresh experiments. |
@m-dev12 Thanks. This might require some changes from your end. Would it be possible for you to modify To load back the classifier heads and interventions you just need to call this trainer.model.load_intervention(
f"<your_best_checkpoint_dir>",
include_model=True
) Ideally if you run the same command, it should directly do the evaluation and print out the number. And could you also list out the files under the best checkpoint? |
Thanks @frankaging will try this. Meanwhile, I updated pyvene using: pip install git+https://github.com/stanfordnlp/pyvene.git But I am still getting this error:
|
Hey, could you try again by reinstalling? |
Hey @frankaging I tried this, and it gives out {'validation_matched_accuracy': 0.8193987521270562}. Do you happen to have an idea why did the model not load correctly. I have a few runs which are great in training but have similar poor eval accuracy. |
To fix this, i think it might be best just to ensure a manual loading after
This could introduce a small change in our |
Thanks @frankaging. I also believe there might be some issue with the evaluation loop. I printed out the dataset object. I believe the complete validation dataset 9.8k is going in for test set eval. But this should be 8.8k for test and 1k for validation. |
@m-dev12 Could you double check if you are printing the right object? We also print this in the log to make sure we do the split correctly in https://github.com/stanfordnlp/pyreft/blob/main/examples/loreft/train.py#L244 print("GLUE validation split (in training): ", len(in_train_eval_datasets))
print("GLUE validation split (testing): ", len(eval_datasets[train_dataset_str][test_split][0])) Could you search for the log above? Based on your eval batch size which is 32, and the eval step number 276 printed from your image, |
This is pretty weird. I was trying commenting out |
Got it, But, I was confused because when I print this: |
Yup, I observed the same and tried with a couple different previous runs and found different results as well :/ |
Hey, thanks for the info. I think you are printing On the eval loading: I think the easiest way is simply to manually loading after |
Hi @frankaging, So, I added this line below in my train scripts so that every time we manually load the best model. But strangely I notice the same phenomena again. But then when I rerun the script, comment out the trainer.train() line and add the same path but just manually this time. Then I get an accuracy of 82.4%. |
Hi @frankaging, Also, pyvene= 0.1.2 still gives the use_cache() error with Roberta. |
@m-dev12 Thanks for the update. Did you install from the source? |
@frankaging Yes, I installed from the source, |
@m-dev12 Thanks! Please try again. And let me know if you are still hitting the issue. On the model loading issue, could you adding this after reft_model.load_intervention(...) instead of The trainer might create another reference model for the best model, which is not used. |
Hi @frankaging I tried this but still get different accuracies with and without trainer.train(). |
@m-dev12 to manually load, might need to turn off this as well: https://github.com/stanfordnlp/pyreft/blob/main/examples/loreft/train.py#L383C12-L383C66 |
Hi @frankaging, I tried this but it did not change the results. And ideally I dont believe this should make a difference since the manual loading is after trainer.train(). I also added print statements inside the load_intervention function in pyvene to check if the path in the callback when best model path is loaded and when I manually load it is any different, but they were exactly the same. |
@m-dev12 Thanks! Could you add print statements in To simply print out the saved state dict, checking the parameter names, etc..? And also maybe eyeballing the weight tensor itself? This issue is probably shared for all GLUE tasks, it might be quicker to debug with a smaller dataset. |
Yes, I have been trying to debug with the CoLA dataset and with just 5 epochs, I did print the parameter names and eyeballed one of the layer intervention tensors and classifier weight tensor after loading the model in the train script itself using:
They seemed fine. For the path, I did add the print statements in the line you mentioned above:
|
Hi @frankaging pyvene=0.1.2 now works with Roberta, Thanks! On the model loading issue, in a bid to debug this, I tried the following:
While loading from the best checkpoint path gives me a different (mostly much better) accuracy Any ideas how we can remedy this? |
Hi @frankaging, Normal train.py with trainer.train() (note here I have tried different methods shared before to load the best checkpoint but it did not work out) train.py with trainer.train() commented and best model loaded from checkpoint. Attached is a debug notebook: |
[P0] Fixing LoReFT rotation layer hot loading problem (#114)
@m-dev12 Thanks for your inputs! I looked into this issue a bit more, and summarized my findings with the changes to fix this issue. See details listed here: #123. In short, it seems like when loading back the low-rank weight matrix of I am not sure why it is not doing the overwrite; but to fix this, we essentially want to reinit a new instance of rotate layer, and inject learned weights (i.e., selected column vectors) into it. |
Hi @frankaging, Thank you so much for the quick fix! I installed and tried out the new version (from your branch). The rotate layer weights are now matching, and the metrics now come out much higher! But, I now observe a mismatch in learned_source weight and bias between the two versions of scripts. From train script: Accuracy is higher now ~52% From eval script: Accuracy is lower now ~50% Attaching debug notebooks for your reference. |
@m-dev12 hey, quick question: it seems like these two screenshots are logging for different directories? .*30579 the first one, and .*44666 - is this expected? Thanks. |
Hi @frankaging Yes, that is expected because second ss is from eval notebook with trainer.train() commented so just creates a new log. |
Hey @frankaging Do you still get this same result for this COLA run post all the changes? I get around 56.9% accuracy for this same command now. |
@m-dev12 Hey, with the current ToT, this is what i got with the same command (the postfix
|
Thanks @frankaging, will try from current ToT. I haven't tried other seeds will try and check as well. |
@m-dev12 Thanks. And I just open sourced our old source code: https://github.com/stanfordnlp/pyreft/tree/main/examples/loreft/original_code We built You can also optionally use the code in that folder to replicate the results, hopefully easier. For instance, it does not have that strange model loading issue.. |
Using below configuration but unable to replicate paper results. Is there anything different that the authors have done in the paper? Got {'validation_matthews_correlation': 0.40104291315665774} finally instead of ~61%. Should SEED, or any other configuration be updated? Would be great if authors could share wandb logs for this as well. Thanks!
python train.py
-task glue
-data_dir ./data
-train_dataset cola
-eval_dataset cola
-model FacebookAI/roberta-base
-seed 42
-l all
-r 1
-p f3
-e 60
-lr 4e-4
-type LoreftIntervention
-batch_size 32
-output_dir ./output
-schedule linear
-wu 5e-3
-logging_steps 20
-allow_cls_grad
-metric_for_best_model matthews_correlation
-dropout 0.2
-test_split validation
@frankaging
The text was updated successfully, but these errors were encountered: