-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarification on Evaluation Results for Llama Guard 3 #633
Comments
Thanks for flagging @sheli-kohan! @albertodepaola Can you help take a look please? |
hi, It seems that the notebook correctly calls build_default_prompt(AgentType.USER, create_conversation(....), LlamaGuardVersion.LLAMA_GUARD_3.name). However, it looks like the prompt match to llama guard 2 format, which might be one source of the issue. You can check it here: https://github.com/meta-llama/llama-recipes/blame/main/src/llama_recipes/inference/prompt_format_utils.py#L61 i've tried to use PROMPT_INSTRUCTION as in Llama guard 3 model card but reach only AUCPR of 45% |
@sheli-kohan Thank you very much for digging into the source and pointing this out! I will take a look.
Do you mean using the correct special tokens still doesn't give right result? |
I've updated the prompt format to be compatible with Llama Guard 3, instead of Llama Guard 2. I believe the other differences stem from the way The current use of I would appreciate your input on this. |
Hi, I encountered the same problem! Tried to reproduce the Llama Guard 3 evaluation results using the provided examples and got an AP of 24%. Any help on this would be highly appreciated. Thanks in advance, M |
@init27 when I used the "safe" and "unsafe" token probabilities + fixed the prompt to llama guard 3, I reached AUCPR of 50. still lower than presented 62 for llama guard 2. |
Thanks for raising this issue @sheli-kohan . Indeed it does seem to be an issue with the notebook. We will endeavor to work out what's going on here, and hope to update the notebook in due course. |
Hi @tryrobbo , thanks for your assistance. this is a PR with some bug fixes and additional category-wise evaluation code (zero shot only). A succces criteria to solve this bug are reaching these metrics:
|
Hi, any update on this? |
System Info
Versions of relevant libraries:
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.23.2
[pip3] torch==2.0.1+cu118
[pip3] torchaudio==2.0.2+cu118
[pip3] torchvision==0.15.2+cu118
Information
🐛 Describe the bug
I am currently evaluating the Llama Guard 3 model using the evaluation notebook provided in the llama-recipes repo: Llama Guard Customization via Prompting and Fine-Tuning.
When I ran the evaluation on the ToxicChat dataset, I observed an average precision of 30.20%. This was with the following configurations: split="test".
However, I noticed a discrepancy when comparing this result to the Llama Guard Model Card, which reports an average precision of 62.6%. even though the metric is referred to Llama Guard , I believe this degradation means some error in this notebook.
In other matter, we are failing to replicate paper results also for open ai mod eval dataset by category (figure 2 in the paper). if you'll be able to share the library or code you used for this evaluation that will be very helpful
Could you please provide any insights or guidance on this difference in performance?
Thank you for your time and assistance.
Best regards,
Sheli Kohan
Error logs
average precision 30.02%
Expected behavior
average precision 62.6%
The text was updated successfully, but these errors were encountered: