Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about the reproduction of XSUM results #20

Open
SherrySwift opened this issue Feb 3, 2024 · 14 comments
Open

Question about the reproduction of XSUM results #20

SherrySwift opened this issue Feb 3, 2024 · 14 comments

Comments

@SherrySwift
Copy link

Hi, thanks for your great works!
I have some questions about the reproduction of XSUM results. I tried to run this command in h2o_hf dir:

# Full baseline on XSUM
shots=5
GPU-ID=0
bash scripts/summarization/eval.sh xsum ${shots} full ${GPU-ID}

I tested on all 1000 samples in xsum_5shot.jsonl, using LLaMA-7B model, but the ROUGE-2 result that I got is only about 9%
According to Figure 4 in paper, the full baseline of XSUM, LLaMA-7B is 12%
Can't figure out the reason about it. Would you please give me some advice?
Thanks a lot!

@Kyriection
Copy link
Collaborator

Hi, Thanks for your question. Did you use Llama-2-7b? The model used in the paper is "huggyllama/llama-7b".

@SherrySwift
Copy link
Author

Hi, I used huggyllama/llama-7b, but I encounterd the following errors when I try to run scripts/summarization/eval.sh:

Traceback (most recent call last):
  File "/data1/H2O-main/h2o_hf/run_summarization.py", line 138, in <module>
    output_sequences = model.generate(
  File "/usr/local/miniconda3/envs/atom/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/data1/LLM/transformers/src/transformers/generation/utils.py", line 1719, in generate
    return self.sample(
  File "/data1/LLM/transformers/src/transformers/generation/utils.py", line 2837, in sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

when I load other models like Llama-2-7b, there won't be such an error.
Do you have any ideas about it? Thanks a lot !

@Kyriection
Copy link
Collaborator

Hi, could you provide the detailed command and tranformers version you used? I didn't reproduce the same issue on my side when using huggyllama/llama-7b.

@SherrySwift
Copy link
Author

Thanks for your reply.
Here is the command:
bash scripts/summarization/eval.sh xsum 5 full 0

The contents in scripts/summarization/eval.sh are:

task=$1
shots=$2
method=$3
GPU=$4
HH_SIZE=$5
RECENT_SIZE=$6

if [[ ${method} == 'h2o' ]]; then
    CUDA_VISIBLE_DEVICES=${GPU} python -u run_summarization.py \
        --input_path data/summarization_data/${task}_${shots}shot.jsonl \
        --output_path summary_results/${task}_${shots}shot_h2o_hh${1}_local${2}.jsonl \
        --model_name huggyllama/llama-7b
        --hh_size ${HH_SIZE} \
        --recent_size ${RECENT_SIZE} \
        --cache_dir ../../llm_weights \
        --enable_h2o_cache
elif [[ ${method} == 'full' ]]; then
    CUDA_VISIBLE_DEVICES=${GPU} python -u run_summarization.py \
        --input_path data/summarization_data/${task}_${shots}shot.jsonl \
        --output_path summary_results/${task}_${shots}shot_full.jsonl \
        --model_name huggyllama/llama-7b
else
    echo 'unknown argment for method'
fi

As for tranformers version, I tried both 4.33.0 and 4.35.0, and I encounter the same problem.

@SherrySwift
Copy link
Author

by the way, the above error also occur in the middle of evaluation when I use other models (such as llama-2-7b)
Here is part of the log:

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
rouge-1: 0.310912, rouge-2: 0.118365, rouge-l: 0.260621
 80%|███████████████████████████████████████████████████████████████████▋                 | 796/1000 [1:12:14<18:08,  5.33s/it]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
rouge-1: 0.310952, rouge-2: 0.118289, rouge-l: 0.260724
 80%|███████████████████████████████████████████████████████████████████▋                 | 797/1000 [1:12:19<18:08,  5.36s/it]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
 80%|███████████████████████████████████████████████████████████████████▋                 | 797/1000 [1:12:23<18:26,  5.45s/it]
Traceback (most recent call last):
  File "/data1/H2O-main/h2o_hf/run_summarization.py", line 137, in <module>
    output_sequences = model.generate(
  File "/usr/local/miniconda3/envs/atom/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/data1/LLM/transformers/src/transformers/generation/utils.py", line 1719, in generate
    return self.sample(
  File "/data1/LLM/transformers/src/transformers/generation/utils.py", line 2837, in sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

When seek for solutions, I found this issue. Is it possible that this error is related to beam sample that used in generation process?

@Kyriection
Copy link
Collaborator

Hi, I tested the samples from 795 to 800, but didn't encounter the same error.
image

Based on your error information, could you try to specify "pad_token_id=tokenizer.eos_token_id" in the model.generate() function.

@SherrySwift
Copy link
Author

Thanks for your patience, but specify "tokenizer.pad_token_id=tokenizer.eos_token_id" still cannot solve the problem.
Since I couldn't come up with a better solution, I just skip the sample 797 in the end.

Also, I notice that you set 'temperature=0.3, top_p=1, do_sample=True' in model.generate() function in h2o_hf/run_summarization.py, is there any particular reason for these parameter settings? Just wonder about it.

@Kyriection
Copy link
Collaborator

Hi, I followed the original HELM for these parameters. Generally, large temperature will bring more diversity and less deterministic.

@SherrySwift
Copy link
Author

Sorry to bother you again.
In h2o_hf/data directory, there are several different jsonl files for xsum dataset.
In order to reproduce the result in Figure 4 in paper (i.e. Rouge-2 is 12 for llama-7b), which jsonl file should I use?
I notice that the content between xsum_5shot.jsonl and xsum.jsonl are quite different. So got a liittle bit confused about that.

@ThisisBillhe
Copy link

ThisisBillhe commented Mar 19, 2024

Hi everyone, I have another question regarding reproducing XSUM results. In h2o_hf/scripts/summarization/eval.sh, it sets a fixed HH_SIZE and RECENT_SIZE, but the x-axis of figure 4 represents KV Cache Budget (%), so what is the relationship between size and percentage? The total number of tokens varies with each sample, right?

@slatter666
Copy link

Hi, I used huggyllama/llama-7b, but I encounterd the following errors when I try to run scripts/summarization/eval.sh:

Traceback (most recent call last):
  File "/data1/H2O-main/h2o_hf/run_summarization.py", line 138, in <module>
    output_sequences = model.generate(
  File "/usr/local/miniconda3/envs/atom/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/data1/LLM/transformers/src/transformers/generation/utils.py", line 1719, in generate
    return self.sample(
  File "/data1/LLM/transformers/src/transformers/generation/utils.py", line 2837, in sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

when I load other models like Llama-2-7b, there won't be such an error. Do you have any ideas about it? Thanks a lot !

I use Llama-2-7b but I still get this error, I use float16. And I check this piece of data, the prompt has 6768 tokens so I guess this is because prompt length is too long so the model collapse

@zwxandy
Copy link

zwxandy commented Apr 10, 2024

Thanks for your patience, but specify "tokenizer.pad_token_id=tokenizer.eos_token_id" still cannot solve the problem. Since I couldn't come up with a better solution, I just skip the sample 797 in the end.

Also, I notice that you set 'temperature=0.3, top_p=1, do_sample=True' in model.generate() function in h2o_hf/run_summarization.py, is there any particular reason for these parameter settings? Just wonder about it.

Hi, I have also met the same bug when the generation process comes to 797/1000:

RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

So I try to test the sample 797 by modifying line #117 as

requests = requests[795:]

As expected, the bug occurs at 2/205 again.
So I go to check the dataset, i.e., sum_5shot.jsonl, and find this sample is marked as

Tokenization is skipped for long lines for performance reasons. This can be configured via editor.maxTokenizationLineLength.

Obviously, the reason for the model collapse is that the prompt length is too long.

@zwxandy
Copy link

zwxandy commented Apr 11, 2024

Hi everyone, I have another question regarding reproducing XSUM results. In h2o_hf/scripts/summarization/eval.sh, it sets a fixed HH_SIZE and RECENT_SIZE, but the x-axis of figure 4 represents KV Cache Budget (%), so what is the relationship between size and percentage? The total number of tokens varies with each sample, right?

Hi, thanks for your great works! I have some questions about the reproduction of XSUM results. I tried to run this command in h2o_hf dir:

# Full baseline on XSUM
shots=5
GPU-ID=0
bash scripts/summarization/eval.sh xsum ${shots} full ${GPU-ID}

I tested on all 1000 samples in xsum_5shot.jsonl, using LLaMA-7B model, but the ROUGE-2 result that I got is only about 9% According to Figure 4 in paper, the full baseline of XSUM, LLaMA-7B is 12% Can't figure out the reason about it. Would you please give me some advice? Thanks a lot!

Hi, I also used huggyllama/llama-7b to run the XSUM task, and got the same conclusion as yours:

rouge-1: 0.267594, rouge-2: 0.098886, rouge-l: 0.222643

Do you have any ideas about this?

@zwxandy
Copy link

zwxandy commented Apr 17, 2024

Hi everyone, I have another question regarding reproducing XSUM results. In h2o_hf/scripts/summarization/eval.sh, it sets a fixed HH_SIZE and RECENT_SIZE, but the x-axis of figure 4 represents KV Cache Budget (%), so what is the relationship between size and percentage? The total number of tokens varies with each sample, right?

Hi, I also want to know this question. Do you have any ideas?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants