Clarification on Evaluation Results for Llama Guard 3 #633

sheli-kohan · 2024-08-15T09:08:13Z

System Info

Versions of relevant libraries:
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.23.2
[pip3] torch==2.0.1+cu118
[pip3] torchaudio==2.0.2+cu118
[pip3] torchvision==0.15.2+cu118

Information

The official example scripts
My own modified scripts

🐛 Describe the bug

I am currently evaluating the Llama Guard 3 model using the evaluation notebook provided in the llama-recipes repo: Llama Guard Customization via Prompting and Fine-Tuning.

When I ran the evaluation on the ToxicChat dataset, I observed an average precision of 30.20%. This was with the following configurations: split="test".

However, I noticed a discrepancy when comparing this result to the Llama Guard Model Card, which reports an average precision of 62.6%. even though the metric is referred to Llama Guard , I believe this degradation means some error in this notebook.

In other matter, we are failing to replicate paper results also for open ai mod eval dataset by category (figure 2 in the paper). if you'll be able to share the library or code you used for this evaluation that will be very helpful

Could you please provide any insights or guidance on this difference in performance?

Thank you for your time and assistance.

Best regards,

Sheli Kohan

Error logs

average precision 30.02%

Expected behavior

average precision 62.6%

init27 · 2024-08-15T20:09:14Z

Thanks for flagging @sheli-kohan!

@albertodepaola Can you help take a look please?

sheli-kohan · 2024-08-18T06:31:34Z

hi,

It seems that the notebook correctly calls build_default_prompt(AgentType.USER, create_conversation(....), LlamaGuardVersion.LLAMA_GUARD_3.name). However, it looks like the prompt match to llama guard 2 format, which might be one source of the issue. You can check it here: https://github.com/meta-llama/llama-recipes/blame/main/src/llama_recipes/inference/prompt_format_utils.py#L61

i've tried to use PROMPT_INSTRUCTION as in Llama guard 3 model card but reach only AUCPR of 45%

init27 · 2024-08-18T15:51:52Z

@sheli-kohan Thank you very much for digging into the source and pointing this out! I will take a look.

i've tried to use PROMPT_INSTRUCTION as in Llama guard 3 model card but reach only AUCPR of 45%

Do you mean using the correct special tokens still doesn't give right result?

sheli-kohan · 2024-08-19T06:21:45Z

I've updated the prompt format to be compatible with Llama Guard 3, instead of Llama Guard 2.

I believe the other differences stem from the way parse_logprobs(prompts, type: Type) calculates the class probabilities. Currently, it uses prompt["logprobs"][0][1] for this calculation. However, I would expect the calculation to focus on the token 'safe', or if the token is unsafe, on the violative class number that appears after the token 'S'. or in case of binary classification on the 'unsafe' token.
but I didn't find references to how you calculate AUCPR on Llama guard paper.

The current use of prompt["logprobs"][0][1] would only partially apply if I were still using Llama Guard 2.

I would appreciate your input on this.
thanks,
Sheli

MLRadfys · 2024-08-22T07:01:43Z

Hi,

I encountered the same problem! Tried to reproduce the Llama Guard 3 evaluation results using the provided examples and got an AP of 24%.
For me it looks like the model output is wrong, when compared to the GT labels.

Any help on this would be highly appreciated.

Thanks in advance,

M

HamidShojanazeri · 2024-08-23T17:47:12Z

cc @albertodepaola

sheli-kohan · 2024-08-26T09:32:51Z

@init27 when I used the "safe" and "unsafe" token probabilities + fixed the prompt to llama guard 3, I reached AUCPR of 50. still lower than presented 62 for llama guard 2.

tryrobbo · 2024-08-28T17:37:48Z

Thanks for raising this issue @sheli-kohan . Indeed it does seem to be an issue with the notebook. We will endeavor to work out what's going on here, and hope to update the notebook in due course.
Can I ask that you open a PR with the modifications you have made which have made an improvement so far. I'll discuss with colleagues how we can work out the issue here.
Thanks again for looking at this.
@tryrobbo (author of the LlamaGuard notebook)

sheli-kohan · 2024-08-29T09:53:13Z

Hi @tryrobbo , thanks for your assistance.

this is a PR with some bug fixes and additional category-wise evaluation code (zero shot only).
keep in mind that the suggested code still does not reach the desired metrics.

A succces criteria to solve this bug are reaching these metrics:

At least 62 AUPRC on binary classification on Llama Guard 3
Category-wise evaluation that reach the zero-shot ones presented in Fig. 2 in Llama guard paper

mlaugharn · 2024-10-23T22:24:22Z

Hi, any update on this?

init27 assigned init27 and albertodepaola Aug 15, 2024

sheli-kohan mentioned this issue Aug 29, 2024

fix that help reaching 50% over binary classification of toxic chat #653

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarification on Evaluation Results for Llama Guard 3 #633

Clarification on Evaluation Results for Llama Guard 3 #633

sheli-kohan commented Aug 15, 2024 •

edited

Loading

init27 commented Aug 15, 2024

sheli-kohan commented Aug 18, 2024 •

edited

Loading

init27 commented Aug 18, 2024

sheli-kohan commented Aug 19, 2024 •

edited

Loading

MLRadfys commented Aug 22, 2024

HamidShojanazeri commented Aug 23, 2024

sheli-kohan commented Aug 26, 2024 •

edited

Loading

tryrobbo commented Aug 28, 2024

sheli-kohan commented Aug 29, 2024 •

edited

Loading

mlaugharn commented Oct 23, 2024

Clarification on Evaluation Results for Llama Guard 3 #633

Clarification on Evaluation Results for Llama Guard 3 #633

Comments

sheli-kohan commented Aug 15, 2024 • edited Loading

System Info

Information

🐛 Describe the bug

Error logs

Expected behavior

init27 commented Aug 15, 2024

sheli-kohan commented Aug 18, 2024 • edited Loading

init27 commented Aug 18, 2024

sheli-kohan commented Aug 19, 2024 • edited Loading

MLRadfys commented Aug 22, 2024

HamidShojanazeri commented Aug 23, 2024

sheli-kohan commented Aug 26, 2024 • edited Loading

tryrobbo commented Aug 28, 2024

sheli-kohan commented Aug 29, 2024 • edited Loading

mlaugharn commented Oct 23, 2024

sheli-kohan commented Aug 15, 2024 •

edited

Loading

sheli-kohan commented Aug 18, 2024 •

edited

Loading

sheli-kohan commented Aug 19, 2024 •

edited

Loading

sheli-kohan commented Aug 26, 2024 •

edited

Loading

sheli-kohan commented Aug 29, 2024 •

edited

Loading