Question about the reproduction of XSUM results #20

SherrySwift · 2024-02-03T03:39:29Z

Hi, thanks for your great works!
I have some questions about the reproduction of XSUM results. I tried to run this command in h2o_hf dir:

# Full baseline on XSUM
shots=5
GPU-ID=0
bash scripts/summarization/eval.sh xsum ${shots} full ${GPU-ID}

I tested on all 1000 samples in xsum_5shot.jsonl, using LLaMA-7B model, but the ROUGE-2 result that I got is only about 9%
According to Figure 4 in paper, the full baseline of XSUM, LLaMA-7B is 12%
Can't figure out the reason about it. Would you please give me some advice?
Thanks a lot!

The text was updated successfully, but these errors were encountered:

Kyriection · 2024-02-04T16:52:37Z

Hi, Thanks for your question. Did you use Llama-2-7b? The model used in the paper is "huggyllama/llama-7b".

SherrySwift · 2024-02-05T09:17:12Z

Hi, I used huggyllama/llama-7b, but I encounterd the following errors when I try to run scripts/summarization/eval.sh:

Traceback (most recent call last):
  File "/data1/H2O-main/h2o_hf/run_summarization.py", line 138, in <module>
    output_sequences = model.generate(
  File "/usr/local/miniconda3/envs/atom/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/data1/LLM/transformers/src/transformers/generation/utils.py", line 1719, in generate
    return self.sample(
  File "/data1/LLM/transformers/src/transformers/generation/utils.py", line 2837, in sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

when I load other models like Llama-2-7b, there won't be such an error.
Do you have any ideas about it? Thanks a lot !

Kyriection · 2024-02-05T21:20:59Z

Hi, could you provide the detailed command and tranformers version you used? I didn't reproduce the same issue on my side when using huggyllama/llama-7b.

SherrySwift · 2024-02-06T01:59:33Z

Thanks for your reply.
Here is the command:
bash scripts/summarization/eval.sh xsum 5 full 0

The contents in scripts/summarization/eval.sh are:

task=$1
shots=$2
method=$3
GPU=$4
HH_SIZE=$5
RECENT_SIZE=$6

if [[ ${method} == 'h2o' ]]; then
    CUDA_VISIBLE_DEVICES=${GPU} python -u run_summarization.py \
        --input_path data/summarization_data/${task}_${shots}shot.jsonl \
        --output_path summary_results/${task}_${shots}shot_h2o_hh${1}_local${2}.jsonl \
        --model_name huggyllama/llama-7b
        --hh_size ${HH_SIZE} \
        --recent_size ${RECENT_SIZE} \
        --cache_dir ../../llm_weights \
        --enable_h2o_cache
elif [[ ${method} == 'full' ]]; then
    CUDA_VISIBLE_DEVICES=${GPU} python -u run_summarization.py \
        --input_path data/summarization_data/${task}_${shots}shot.jsonl \
        --output_path summary_results/${task}_${shots}shot_full.jsonl \
        --model_name huggyllama/llama-7b
else
    echo 'unknown argment for method'
fi

As for tranformers version, I tried both 4.33.0 and 4.35.0, and I encounter the same problem.

SherrySwift · 2024-02-06T02:53:03Z

by the way, the above error also occur in the middle of evaluation when I use other models (such as llama-2-7b)
Here is part of the log:

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
rouge-1: 0.310912, rouge-2: 0.118365, rouge-l: 0.260621
 80%|███████████████████████████████████████████████████████████████████▋                 | 796/1000 [1:12:14<18:08,  5.33s/it]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
rouge-1: 0.310952, rouge-2: 0.118289, rouge-l: 0.260724
 80%|███████████████████████████████████████████████████████████████████▋                 | 797/1000 [1:12:19<18:08,  5.36s/it]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
 80%|███████████████████████████████████████████████████████████████████▋                 | 797/1000 [1:12:23<18:26,  5.45s/it]
Traceback (most recent call last):
  File "/data1/H2O-main/h2o_hf/run_summarization.py", line 137, in <module>
    output_sequences = model.generate(
  File "/usr/local/miniconda3/envs/atom/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/data1/LLM/transformers/src/transformers/generation/utils.py", line 1719, in generate
    return self.sample(
  File "/data1/LLM/transformers/src/transformers/generation/utils.py", line 2837, in sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

When seek for solutions, I found this issue. Is it possible that this error is related to beam sample that used in generation process?

Kyriection · 2024-02-06T04:37:30Z

Hi, I tested the samples from 795 to 800, but didn't encounter the same error.

Based on your error information, could you try to specify "pad_token_id=tokenizer.eos_token_id" in the model.generate() function.

SherrySwift · 2024-02-06T08:36:03Z

Thanks for your patience, but specify "tokenizer.pad_token_id=tokenizer.eos_token_id" still cannot solve the problem.
Since I couldn't come up with a better solution, I just skip the sample 797 in the end.

Also, I notice that you set 'temperature=0.3, top_p=1, do_sample=True' in model.generate() function in h2o_hf/run_summarization.py, is there any particular reason for these parameter settings? Just wonder about it.

Kyriection · 2024-02-06T17:29:15Z

Hi, I followed the original HELM for these parameters. Generally, large temperature will bring more diversity and less deterministic.

SherrySwift · 2024-02-20T11:03:52Z

Sorry to bother you again.
In h2o_hf/data directory, there are several different jsonl files for xsum dataset.
In order to reproduce the result in Figure 4 in paper (i.e. Rouge-2 is 12 for llama-7b), which jsonl file should I use?
I notice that the content between xsum_5shot.jsonl and xsum.jsonl are quite different. So got a liittle bit confused about that.

ThisisBillhe · 2024-03-19T08:09:17Z

Hi everyone, I have another question regarding reproducing XSUM results. In h2o_hf/scripts/summarization/eval.sh, it sets a fixed HH_SIZE and RECENT_SIZE, but the x-axis of figure 4 represents KV Cache Budget (%), so what is the relationship between size and percentage? The total number of tokens varies with each sample, right?

slatter666 · 2024-03-26T17:51:11Z

Hi, I used huggyllama/llama-7b, but I encounterd the following errors when I try to run scripts/summarization/eval.sh:

Traceback (most recent call last):
  File "/data1/H2O-main/h2o_hf/run_summarization.py", line 138, in <module>
    output_sequences = model.generate(
  File "/usr/local/miniconda3/envs/atom/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/data1/LLM/transformers/src/transformers/generation/utils.py", line 1719, in generate
    return self.sample(
  File "/data1/LLM/transformers/src/transformers/generation/utils.py", line 2837, in sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

when I load other models like Llama-2-7b, there won't be such an error. Do you have any ideas about it? Thanks a lot !

I use Llama-2-7b but I still get this error, I use float16. And I check this piece of data, the prompt has 6768 tokens so I guess this is because prompt length is too long so the model collapse

zwxandy · 2024-04-10T09:34:03Z

Thanks for your patience, but specify "tokenizer.pad_token_id=tokenizer.eos_token_id" still cannot solve the problem. Since I couldn't come up with a better solution, I just skip the sample 797 in the end.

Also, I notice that you set 'temperature=0.3, top_p=1, do_sample=True' in model.generate() function in h2o_hf/run_summarization.py, is there any particular reason for these parameter settings? Just wonder about it.

Hi, I have also met the same bug when the generation process comes to 797/1000:

RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

So I try to test the sample 797 by modifying line #117 as

requests = requests[795:]

As expected, the bug occurs at 2/205 again.
So I go to check the dataset, i.e., sum_5shot.jsonl, and find this sample is marked as

Tokenization is skipped for long lines for performance reasons. This can be configured via editor.maxTokenizationLineLength.

Obviously, the reason for the model collapse is that the prompt length is too long.

zwxandy · 2024-04-11T05:30:13Z

Hi everyone, I have another question regarding reproducing XSUM results. In h2o_hf/scripts/summarization/eval.sh, it sets a fixed HH_SIZE and RECENT_SIZE, but the x-axis of figure 4 represents KV Cache Budget (%), so what is the relationship between size and percentage? The total number of tokens varies with each sample, right?

Hi, thanks for your great works! I have some questions about the reproduction of XSUM results. I tried to run this command in h2o_hf dir:
# Full baseline on XSUM
shots=5
GPU-ID=0
bash scripts/summarization/eval.sh xsum ${shots} full ${GPU-ID}
I tested on all 1000 samples in xsum_5shot.jsonl, using LLaMA-7B model, but the ROUGE-2 result that I got is only about 9% According to Figure 4 in paper, the full baseline of XSUM, LLaMA-7B is 12% Can't figure out the reason about it. Would you please give me some advice? Thanks a lot!

Hi, I also used huggyllama/llama-7b to run the XSUM task, and got the same conclusion as yours:

rouge-1: 0.267594, rouge-2: 0.098886, rouge-l: 0.222643

Do you have any ideas about this?

zwxandy · 2024-04-17T06:45:51Z

Hi everyone, I have another question regarding reproducing XSUM results. In h2o_hf/scripts/summarization/eval.sh, it sets a fixed HH_SIZE and RECENT_SIZE, but the x-axis of figure 4 represents KV Cache Budget (%), so what is the relationship between size and percentage? The total number of tokens varies with each sample, right?

Hi, I also want to know this question. Do you have any ideas?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about the reproduction of XSUM results #20

Question about the reproduction of XSUM results #20

SherrySwift commented Feb 3, 2024

Kyriection commented Feb 4, 2024

SherrySwift commented Feb 5, 2024

Kyriection commented Feb 5, 2024

SherrySwift commented Feb 6, 2024

SherrySwift commented Feb 6, 2024

Kyriection commented Feb 6, 2024

SherrySwift commented Feb 6, 2024

Kyriection commented Feb 6, 2024

SherrySwift commented Feb 20, 2024

ThisisBillhe commented Mar 19, 2024 •

edited

Loading

slatter666 commented Mar 26, 2024

zwxandy commented Apr 10, 2024

zwxandy commented Apr 11, 2024

zwxandy commented Apr 17, 2024

Question about the reproduction of XSUM results #20

Question about the reproduction of XSUM results #20

Comments

SherrySwift commented Feb 3, 2024

Kyriection commented Feb 4, 2024

SherrySwift commented Feb 5, 2024

Kyriection commented Feb 5, 2024

SherrySwift commented Feb 6, 2024

SherrySwift commented Feb 6, 2024

Kyriection commented Feb 6, 2024

SherrySwift commented Feb 6, 2024

Kyriection commented Feb 6, 2024

SherrySwift commented Feb 20, 2024

ThisisBillhe commented Mar 19, 2024 • edited Loading

slatter666 commented Mar 26, 2024

zwxandy commented Apr 10, 2024

zwxandy commented Apr 11, 2024

zwxandy commented Apr 17, 2024

ThisisBillhe commented Mar 19, 2024 •

edited

Loading