Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarification on Evaluation Results for Llama Guard 3 #633

Open
1 of 2 tasks
sheli-kohan opened this issue Aug 15, 2024 · 10 comments
Open
1 of 2 tasks

Clarification on Evaluation Results for Llama Guard 3 #633

sheli-kohan opened this issue Aug 15, 2024 · 10 comments
Assignees

Comments

@sheli-kohan
Copy link

sheli-kohan commented Aug 15, 2024

System Info

Versions of relevant libraries:
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.23.2
[pip3] torch==2.0.1+cu118
[pip3] torchaudio==2.0.2+cu118
[pip3] torchvision==0.15.2+cu118

Information

  • The official example scripts
  • My own modified scripts

🐛 Describe the bug

I am currently evaluating the Llama Guard 3 model using the evaluation notebook provided in the llama-recipes repo: Llama Guard Customization via Prompting and Fine-Tuning.

When I ran the evaluation on the ToxicChat dataset, I observed an average precision of 30.20%. This was with the following configurations: split="test".

However, I noticed a discrepancy when comparing this result to the Llama Guard Model Card, which reports an average precision of 62.6%. even though the metric is referred to Llama Guard , I believe this degradation means some error in this notebook.

In other matter, we are failing to replicate paper results also for open ai mod eval dataset by category (figure 2 in the paper). if you'll be able to share the library or code you used for this evaluation that will be very helpful

Could you please provide any insights or guidance on this difference in performance?

Thank you for your time and assistance.

Best regards,

Sheli Kohan

Error logs

average precision 30.02%

Expected behavior

average precision 62.6%

@init27
Copy link
Contributor

init27 commented Aug 15, 2024

Thanks for flagging @sheli-kohan!

@albertodepaola Can you help take a look please?

@sheli-kohan
Copy link
Author

sheli-kohan commented Aug 18, 2024

hi,

It seems that the notebook correctly calls build_default_prompt(AgentType.USER, create_conversation(....), LlamaGuardVersion.LLAMA_GUARD_3.name). However, it looks like the prompt match to llama guard 2 format, which might be one source of the issue. You can check it here: https://github.com/meta-llama/llama-recipes/blame/main/src/llama_recipes/inference/prompt_format_utils.py#L61

i've tried to use PROMPT_INSTRUCTION as in Llama guard 3 model card but reach only AUCPR of 45%

@init27
Copy link
Contributor

init27 commented Aug 18, 2024

@sheli-kohan Thank you very much for digging into the source and pointing this out! I will take a look.

i've tried to use PROMPT_INSTRUCTION as in Llama guard 3 model card but reach only AUCPR of 45%

Do you mean using the correct special tokens still doesn't give right result?

@sheli-kohan
Copy link
Author

sheli-kohan commented Aug 19, 2024

I've updated the prompt format to be compatible with Llama Guard 3, instead of Llama Guard 2.

I believe the other differences stem from the way parse_logprobs(prompts, type: Type) calculates the class probabilities. Currently, it uses prompt["logprobs"][0][1] for this calculation. However, I would expect the calculation to focus on the token 'safe', or if the token is unsafe, on the violative class number that appears after the token 'S'. or in case of binary classification on the 'unsafe' token.
but I didn't find references to how you calculate AUCPR on Llama guard paper.

The current use of prompt["logprobs"][0][1] would only partially apply if I were still using Llama Guard 2.

I would appreciate your input on this.
thanks,
Sheli

@MLRadfys
Copy link

Hi,

I encountered the same problem! Tried to reproduce the Llama Guard 3 evaluation results using the provided examples and got an AP of 24%.
For me it looks like the model output is wrong, when compared to the GT labels.

Any help on this would be highly appreciated.

Thanks in advance,

M

@HamidShojanazeri
Copy link
Contributor

cc @albertodepaola

@sheli-kohan
Copy link
Author

sheli-kohan commented Aug 26, 2024

@init27 when I used the "safe" and "unsafe" token probabilities + fixed the prompt to llama guard 3, I reached AUCPR of 50. still lower than presented 62 for llama guard 2.

@tryrobbo
Copy link
Contributor

Thanks for raising this issue @sheli-kohan . Indeed it does seem to be an issue with the notebook. We will endeavor to work out what's going on here, and hope to update the notebook in due course.
Can I ask that you open a PR with the modifications you have made which have made an improvement so far. I'll discuss with colleagues how we can work out the issue here.
Thanks again for looking at this.
@tryrobbo (author of the LlamaGuard notebook)

@sheli-kohan
Copy link
Author

sheli-kohan commented Aug 29, 2024

Hi @tryrobbo , thanks for your assistance.

this is a PR with some bug fixes and additional category-wise evaluation code (zero shot only).
keep in mind that the suggested code still does not reach the desired metrics.

A succces criteria to solve this bug are reaching these metrics:

  • At least 62 AUPRC on binary classification on Llama Guard 3
  • Category-wise evaluation that reach the zero-shot ones presented in Fig. 2 in Llama guard paper

Screenshot 2024-08-29 at 12 44 43

@mlaugharn
Copy link

Hi, any update on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants