Failed to collect training data for sparsity predictor #11

PaulWang0513 · 2023-10-07T06:58:26Z

Hi,
I was trying to train the sparsity predictor by following the instructions in the README file, but encountered some error when running ./run_infer_opt_175b_collect_sp_data.sh.

Following is the printed message:

start running ./c4_train/c4_train.jsonl
Initialize NCCLCommunicator: < pipeline_group_0 >; rank: 0
<get_request_processor>: None
<RequestProcessor> dir: ./c4_train
<RequestProcessor> file: c4_train.jsonl
<RequestProcessor>, output file: ./c4_train/output_c4_train.jsonl
input seq length: 2048
=======Initialize Dist Inference(Sync).=======
=======Gpipe use FP16=======
loading layer 0
loading layer 1
loading layer 2
loading layer 3
loading layer 4
loading layer 5
loading layer 6
loading layer 7
loading layer 8
loading layer 9
loading layer 10
loading layer 11
temperature is 0, should be deterministic (greedy).
temperature is 0, should be deterministic (greedy).
<inference_batch> rank-<0> Enter!
<inference_batch> rank-<0> after first barrier!
<inference_batch> rank-<0> after first _init_cached_seqs_and_attentions!
<inference_batch> rank-<0> after second barrier!
<inference_batch> rank-<0> enter computation!
Compute prompt seq< 0 >.
_copy_initial_token_emb
_copy_initial_token_emb 0/1
Traceback (most recent call last):
  File "/mnt/c/Users/Paul/Documents/GitHub/DejaVu/Decentralized_FM_alpha/dist_inference_runner.py", line 111, in <module>
    main()
  File "/mnt/c/Users/Paul/Documents/GitHub/DejaVu/Decentralized_FM_alpha/dist_inference_runner.py", line 97, in main
    distributed_inference_mask_iter(args, pipe, device, request_processor)
  File "/mnt/c/Users/Paul/Documents/GitHub/DejaVu/Decentralized_FM_alpha/utils/dist_inference_utils.py", line 58, in distributed_inference_mask_iter
    current_iter_time = pipeline.inference_batch(input_ids, output_ids_list, attention_mask=attention_mask)
  File "/mnt/c/Users/Paul/Documents/GitHub/DejaVu/Decentralized_FM_alpha/pipeline_parallel/dist_pipeline_inference_mask_greedy_token_pipe_sync.py", line 822, in inference_batch
    self.forward_seq_pipeline_stage(
  File "/mnt/c/Users/Paul/Documents/GitHub/DejaVu/Decentralized_FM_alpha/pipeline_parallel/dist_pipeline_inference_mask_greedy_token_pipe_sync.py", line 652, in forward_seq_pipeline_stage
    self._forward_compute_prompt_seq(
  File "/mnt/c/Users/Paul/Documents/GitHub/DejaVu/Decentralized_FM_alpha/pipeline_parallel/dist_pipeline_inference_mask_greedy_token_pipe_sync.py", line 470, in _forward_compute_prompt_seq
    self._generate_echo_token_logprobs(index, indices=seq)
  File "/mnt/c/Users/Paul/Documents/GitHub/DejaVu/Decentralized_FM_alpha/pipeline_parallel/dist_pipeline_inference_mask_greedy_token_pipe_sync.py", line 485, in _generate_echo_token_logprobs
    self.ret_tokens[
RuntimeError: The expanded size of the tensor (2048) must match the existing size (2047) at non-singleton dimension 1.  Target sizes: [1, 2048].  Tensor sizes: [2047]

I used WSL2 with Ubuntu 20.04 and CUDA 11.3.
Due to storage limitations, I changed some setting to use a smaller pretrained model (opt-125m).
In DejaVu/Decentralized_FM_alpha/c4_train/get_data.py:

data = {
    "best_of": 1,
    "echo": True,
    "logprobs": 1,
    "max_tokens": 0,
    "model": "opt-125m",
    "n": 1,
    "prompt": doc["text"],
    "request_type": "language-model-inference",
    "stop": None,
    "temperature": 0,
    "top_p": 1,
}

In DejaVu/Decentralized_FM_alpha/convert_opt_checkpoint.py:

parser.add_argument('--model-name', type=str, default='facebook/opt-125m', 
                        help='model-name')
parser.add_argument('--save-path', type=str, default='/mnt/c/Users/Paul/Documents/GitHub/DejaVu/pretrained_models/opt-125m/checkpoint', 
                        help='model-name')

In DejaVu/Decentralized_FM_alpha/run_infer_opt_175b_collect_sp_data.sh:

ARGS="--model-name /mnt/c/Users/Paul/Documents/GitHub/DejaVu/pretrained_models/opt-125m/checkpoint \
--model-type opt-save \
--seed 42 \
--fp16 \
--num-layers 12 \
--max-layers 12 \
--budget 22800 \
--num-iters 2000 \
--dist-url tcp://127.0.0.1:9032 \
--token-micro-batch-size 1 \
--world-size 1 --pipeline-group-size 1 --data-group-size 1 \
--pp-mode pipe_sync_sample_mask_token_pipe \
--infer-data ${file}"

(trap 'kill 0' SIGINT; \
python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 0 --rank 0 \
    & \
wait)

The att_sp_x_0.mmap ~ att_sp_x_11.mmap, mlp_sp_x_0.mmap ~ mlp_sp_x_11.mmap, mlp_label_0.mmap ~ mlp_label_11.mmap, and score_norm_0 ~ score_norm_11.mmap are created successfully before the error occured.
Can you think of any possible reason or solution?
Thanks!

The text was updated successfully, but these errors were encountered:

simonsimanta · 2023-11-23T11:41:12Z

Hello Authors,

Encountered the above RuntimeError related to tensor size mismatch while trying to train the sparsity predictor as per the instructions in the README. The issue arises when running ./run_infer_opt_175b_collect_sp_data.sh. All relevant files are created successfully until the script hits a tensor size error.

RuntimeError: The expanded size of the tensor (2048) must match the existing size (2047) at non-singleton dimension 1. Target sizes: [1, 2048]. Tensor sizes: [2047]

Can you please provide additional information on how can we resolve the issue ?

Thank You .

zhaoyang-star · 2024-01-31T07:50:30Z

I met the same error when tp=1. Looking for the author to fix it.
A tricky way to solve it is using tp=2. I have verified it and the error disappears.

ariellubonja · 2024-04-23T17:37:56Z

Hi everyone! I am facing this issue right now, after trying to use facebook/opt-125m model instead. Anyone found a fix?

@zhaoyang-star Where is this parameter tp=2?

2455DD · 2024-05-20T08:46:13Z

Same things happens to me too when I'm trying to use facebook/opt-1.7b model in colab notebook. Waiting for any possible solution.

I met the same error when tp=1. Looking for the author to fix it. A tricky way to solve it is using tp=2. I have verified it and the error disappears.

@zhaoyang-star Is tp a library or a parameter? Does it mean the top-p in c4_train/getdata.py ? Thanks for your help!

…t-1.3b(FMInference#11)

ariellubonja · 2024-05-21T10:47:02Z

@2455DD I reached out to the author earlier this month, and she replied with this "I believe there is a temporary work around suggested on GitHub by setting the top_p = 2 in get_data file."

So, you're correct, it is top-p in c4_train/getdata.py

mailonghua · 2024-05-23T08:25:57Z

I followed the instructions and modified the file DejaVu/Decentralized_FM_alpha/c4_train/getdata.py in the following way:

data = {
"best_of": 1,
"echo": True,
"logprobs": 1,
"max_tokens": 0,
"model": "opt-1.3b",
"n": 1,
"prompt": doc["text"],
"request_type": "language-model-inference",
"stop": None,
"temperature": 0,
"top_p": 2,
}
After making these changes, I deleted the local Hugging Face cache of C4 and the file c4_train.jsonl. I then re-executed getdata.py and ran the script run_infer_opt_175b_collect_sp_data.sh. However, I am still encountering a size mismatch error:
xxx
RuntimeError: The size of tensor a (2047) must match the size of tensor b (2048) at non-singleton dimension 3

mailonghua · 2024-05-23T08:53:31Z

@2455DD I reached out to the author earlier this month, and she replied with this "I believe there is a temporary work around suggested on GitHub by setting the top_p = 2 in get_data file."

So, you're correct, it is top-p in c4_train/getdata.py

Same things happens to me too when I'm trying to use facebook/opt-1.7b model in colab notebook. Waiting for any possible solution.

I met the same error when tp=1. Looking for the author to fix it. A tricky way to solve it is using tp=2. I have verified it and the error disappears.

@zhaoyang-star Is tp a library or a parameter? Does it mean the top-p in c4_train/getdata.py ? Thanks for your help!

I used the run_infer_opt_1_3_b_collect_sp_data.sh script from your forked repository, but I encountered the same size mismatch error. Did you run the script successfully?

Philippe-Guyard · 2024-05-31T15:28:54Z

@2455DD I reached out to the author earlier this month, and she replied with this "I believe there is a temporary work around suggested on GitHub by setting the top_p = 2 in get_data file."
So, you're correct, it is top-p in c4_train/getdata.py

Same things happens to me too when I'm trying to use facebook/opt-1.7b model in colab notebook. Waiting for any possible solution.

I met the same error when tp=1. Looking for the author to fix it. A tricky way to solve it is using tp=2. I have verified it and the error disappears.

@zhaoyang-star Is tp a library or a parameter? Does it mean the top-p in c4_train/getdata.py ? Thanks for your help!

I used the run_infer_opt_1_3_b_collect_sp_data.sh script from your forked repository, but I encountered the same size mismatch error. Did you run the script successfully?

Can confirm I am having a similar issue after setting "model": "opt-125m", "top_p":2. Only difference is I get

RuntimeError: The expanded size of the tensor (2048) must match the existing size (2047) at non-singleton dimension 1.  Target sizes: [1, 2048].  Tensor sizes: [2047]

(the difference being that the mismatch happens at dimension 1 and not 3).

zhaoningyuan · 2024-10-15T15:30:30Z

it works for me if i set pipeline-group-size=2

ARGS="--model-name /data/xxx/DejaVu/Decentralized_FM_alpha/pretrained_models \
--model-type opt-save \
--seed 42 \
--fp16 \
--num-layers 16 \
--max-layers 32 \
--budget 22800 \
--num-iters 2000 \
--dist-url tcp://127.0.0.1:9032 \
--token-micro-batch-size 16 \
--world-size 2 --pipeline-group-size 2 --data-group-size 1 \
--pp-mode pipe_sync_sample_mask_token_pipe \
--infer-data ${file}"

(trap 'kill 0' SIGINT; \
python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 0 --rank 0 \
    &
python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 1 --rank 1 \
    & \
wait)

Hello Authors,

Encountered the above RuntimeError related to tensor size mismatch while trying to train the sparsity predictor as per the instructions in the README. The issue arises when running ./run_infer_opt_175b_collect_sp_data.sh. All relevant files are created successfully until the script hits a tensor size error.

RuntimeError: The expanded size of the tensor (2048) must match the existing size (2047) at non-singleton dimension 1. Target sizes: [1, 2048]. Tensor sizes: [2047]

Can you please provide additional information on how can we resolve the issue ?

Thank You .

2455DD added a commit to 2455DD/DejaVu that referenced this issue May 20, 2024

possible fix for tensor size mismatch during predictor training in op…

8077e56

…t-1.3b(FMInference#11)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to collect training data for sparsity predictor #11

Failed to collect training data for sparsity predictor #11

PaulWang0513 commented Oct 7, 2023

simonsimanta commented Nov 23, 2023

zhaoyang-star commented Jan 31, 2024 •

edited

Loading

ariellubonja commented Apr 23, 2024

2455DD commented May 20, 2024 •

edited

Loading

ariellubonja commented May 21, 2024

mailonghua commented May 23, 2024

mailonghua commented May 23, 2024

Philippe-Guyard commented May 31, 2024 •

edited

Loading

zhaoningyuan commented Oct 15, 2024

Failed to collect training data for sparsity predictor #11

Failed to collect training data for sparsity predictor #11

Comments

PaulWang0513 commented Oct 7, 2023

simonsimanta commented Nov 23, 2023

zhaoyang-star commented Jan 31, 2024 • edited Loading

ariellubonja commented Apr 23, 2024

2455DD commented May 20, 2024 • edited Loading

ariellubonja commented May 21, 2024

mailonghua commented May 23, 2024

mailonghua commented May 23, 2024

Philippe-Guyard commented May 31, 2024 • edited Loading

zhaoningyuan commented Oct 15, 2024

zhaoyang-star commented Jan 31, 2024 •

edited

Loading

2455DD commented May 20, 2024 •

edited

Loading

Philippe-Guyard commented May 31, 2024 •

edited

Loading