Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can not reproduce results by LLAMA-7B on OpenBook QA #24

Open
AkideLiu opened this issue Mar 25, 2024 · 9 comments
Open

Can not reproduce results by LLAMA-7B on OpenBook QA #24

AkideLiu opened this issue Mar 25, 2024 · 9 comments

Comments

@AkideLiu
Copy link

Full Cache Baseli
huggyllama/llama-7b
bash scripts/lm_eval/full_cache.sh openbookqa huggyllama/llama-7b llama

{
  "results": {
    "openbookqa": {
      "acc": 0.446,
      "acc_stderr": 0.022252153078595897,
      "acc_norm": 0.49,
      "acc_norm_stderr": 0.022378596989230774
    }
  },
  "versions": {
    "openbookqa": 0
  }
}

huggyllama/llama-7b
H2O
bash scripts/lm_eval/h2o.sh openbookqa huggyllama/llama-7b llama

{
  "results": {
    "openbookqa": {
      "acc": 0.412,
      "acc_stderr": 0.02203367799374087,
      "acc_norm": 0.462,
      "acc_norm_stderr": 0.022318338119870537
    }
  },
  "versions": {
    "openbookqa": 0
  }
}

As shown in the paper :

image
@PiotrNawrot
Copy link

+1, I'm getting exactly the same results

@Kyriection
Copy link
Collaborator

Hi, the results in Table 6 are obtained from OPT-30B (As described in 5.3.Q3). And for practical use, you can use the accumulation attention scores obtained from the whole prefilling stage. Since OpenbookQA only requires one step decoding, our current implementation is a simulation version that decomposes the original prefilling stage into a two parts. And we consider the second part as a simulated decoding stage. In this simulation version, we only use the local statistics of accumulation attention scores which might be biased when the sequence length is extremely small.

@PiotrNawrot
Copy link

Hey @Kyriection - Thanks a lot for your response and extra clarification. I'm having one more issue with reproducing Figure 8 from the latest version of the paper. I followed your setup exactly and haven't changed anything in the code - just calling commands from the README. Below I paste a screenshot Excel with my results - in my attempt the downstream scores downgrade much quicker than reported in Figure 8. Do you have any idea why I cannot reproduce those results ? I'm using huggyllama-llama-7b and even heavy and recent ratio.

image

@PiotrNawrot
Copy link

Moreover I'm also having issues with reproducing Table 2 results from the paper for OPT-30B. Again I believe that I'm strictly following the commands from the README. It would be of great help if you could comment on this - and congrats ones again on the amazing work!
image

@PiotrNawrot
Copy link

"and for practical use, you can use the accumulation attention scores obtained from the whole prefilling stage"

Did you use scores from prefilling stage for any of the downstream results reported in the paper or did you use the simulated decoding? I believe that the implementation in the repo, at least for the LM-Eval, follows the simulated decoding approach.

@Kyriection
Copy link
Collaborator

Hi, we adjust the ratio of how much part of prefilling stage are used for the simulated decoding approach. Since some input samples only contain tens of tokens, using 20% for calculating accumulated attention scores is highly biased. For simplicty, you can directly use the whole prefilling stage for calculating the scores, which is a reasonable and practical setting.

@PiotrNawrot
Copy link

Yes, I understand - is this logic implemented somewhere in the code?

Also, do you have any idea what could be the reason behind my suboptimal results?

@Kyriection
Copy link
Collaborator

Hi, you can use the implementaion here https://github.com/FMInference/H2O/blob/main/h2o_hf/utils_lm_eval/modify_llama.py#L152. (I tested current implementation with llama-1-7b on openbookqa, full accuracy is 44.6 and H2O is 44.4.)

Previous simulation implemention will directy use the first 20% prefilling stage for calculating accumulated attention scores which are biased when input samples only contains tens of tokens. This might be the reason behind the suboptimal results. By increase the ratio of the prefilling stage for calculating accumulated attention scores, or directly use the whole prefilling stage(global statistics), such biased can be largely mitigated, resulting in better performance.

@yinwangsong
Copy link

"and for practical use, you can use the accumulation attention scores obtained from the whole prefilling stage"

Did you use scores from prefilling stage for any of the downstream results reported in the paper or did you use the simulated decoding? I believe that the implementation in the repo, at least for the LM-Eval, follows the simulated decoding approach.

Hello, did you find code of the ''simulated decoding'' in this repo? Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants