Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to collect training data for sparsity predictor #11

Open
PaulWang0513 opened this issue Oct 7, 2023 · 9 comments
Open

Failed to collect training data for sparsity predictor #11

PaulWang0513 opened this issue Oct 7, 2023 · 9 comments

Comments

@PaulWang0513
Copy link

Hi,
I was trying to train the sparsity predictor by following the instructions in the README file, but encountered some error when running ./run_infer_opt_175b_collect_sp_data.sh.

Following is the printed message:

start running ./c4_train/c4_train.jsonl
Initialize NCCLCommunicator: < pipeline_group_0 >; rank: 0
<get_request_processor>: None
<RequestProcessor> dir: ./c4_train
<RequestProcessor> file: c4_train.jsonl
<RequestProcessor>, output file: ./c4_train/output_c4_train.jsonl
input seq length: 2048
=======Initialize Dist Inference(Sync).=======
=======Gpipe use FP16=======
loading layer 0
loading layer 1
loading layer 2
loading layer 3
loading layer 4
loading layer 5
loading layer 6
loading layer 7
loading layer 8
loading layer 9
loading layer 10
loading layer 11
temperature is 0, should be deterministic (greedy).
temperature is 0, should be deterministic (greedy).
<inference_batch> rank-<0> Enter!
<inference_batch> rank-<0> after first barrier!
<inference_batch> rank-<0> after first _init_cached_seqs_and_attentions!
<inference_batch> rank-<0> after second barrier!
<inference_batch> rank-<0> enter computation!
Compute prompt seq< 0 >.
_copy_initial_token_emb
_copy_initial_token_emb 0/1
Traceback (most recent call last):
  File "/mnt/c/Users/Paul/Documents/GitHub/DejaVu/Decentralized_FM_alpha/dist_inference_runner.py", line 111, in <module>
    main()
  File "/mnt/c/Users/Paul/Documents/GitHub/DejaVu/Decentralized_FM_alpha/dist_inference_runner.py", line 97, in main
    distributed_inference_mask_iter(args, pipe, device, request_processor)
  File "/mnt/c/Users/Paul/Documents/GitHub/DejaVu/Decentralized_FM_alpha/utils/dist_inference_utils.py", line 58, in distributed_inference_mask_iter
    current_iter_time = pipeline.inference_batch(input_ids, output_ids_list, attention_mask=attention_mask)
  File "/mnt/c/Users/Paul/Documents/GitHub/DejaVu/Decentralized_FM_alpha/pipeline_parallel/dist_pipeline_inference_mask_greedy_token_pipe_sync.py", line 822, in inference_batch
    self.forward_seq_pipeline_stage(
  File "/mnt/c/Users/Paul/Documents/GitHub/DejaVu/Decentralized_FM_alpha/pipeline_parallel/dist_pipeline_inference_mask_greedy_token_pipe_sync.py", line 652, in forward_seq_pipeline_stage
    self._forward_compute_prompt_seq(
  File "/mnt/c/Users/Paul/Documents/GitHub/DejaVu/Decentralized_FM_alpha/pipeline_parallel/dist_pipeline_inference_mask_greedy_token_pipe_sync.py", line 470, in _forward_compute_prompt_seq
    self._generate_echo_token_logprobs(index, indices=seq)
  File "/mnt/c/Users/Paul/Documents/GitHub/DejaVu/Decentralized_FM_alpha/pipeline_parallel/dist_pipeline_inference_mask_greedy_token_pipe_sync.py", line 485, in _generate_echo_token_logprobs
    self.ret_tokens[
RuntimeError: The expanded size of the tensor (2048) must match the existing size (2047) at non-singleton dimension 1.  Target sizes: [1, 2048].  Tensor sizes: [2047]

I used WSL2 with Ubuntu 20.04 and CUDA 11.3.
Due to storage limitations, I changed some setting to use a smaller pretrained model (opt-125m).
In DejaVu/Decentralized_FM_alpha/c4_train/get_data.py:

data = {
    "best_of": 1,
    "echo": True,
    "logprobs": 1,
    "max_tokens": 0,
    "model": "opt-125m",
    "n": 1,
    "prompt": doc["text"],
    "request_type": "language-model-inference",
    "stop": None,
    "temperature": 0,
    "top_p": 1,
}

In DejaVu/Decentralized_FM_alpha/convert_opt_checkpoint.py:

parser.add_argument('--model-name', type=str, default='facebook/opt-125m', 
                        help='model-name')
parser.add_argument('--save-path', type=str, default='/mnt/c/Users/Paul/Documents/GitHub/DejaVu/pretrained_models/opt-125m/checkpoint', 
                        help='model-name')

In DejaVu/Decentralized_FM_alpha/run_infer_opt_175b_collect_sp_data.sh:

ARGS="--model-name /mnt/c/Users/Paul/Documents/GitHub/DejaVu/pretrained_models/opt-125m/checkpoint \
--model-type opt-save \
--seed 42 \
--fp16 \
--num-layers 12 \
--max-layers 12 \
--budget 22800 \
--num-iters 2000 \
--dist-url tcp://127.0.0.1:9032 \
--token-micro-batch-size 1 \
--world-size 1 --pipeline-group-size 1 --data-group-size 1 \
--pp-mode pipe_sync_sample_mask_token_pipe \
--infer-data ${file}"

(trap 'kill 0' SIGINT; \
python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 0 --rank 0 \
    & \
wait)

The att_sp_x_0.mmap ~ att_sp_x_11.mmap, mlp_sp_x_0.mmap ~ mlp_sp_x_11.mmap, mlp_label_0.mmap ~ mlp_label_11.mmap, and score_norm_0 ~ score_norm_11.mmap are created successfully before the error occured.
Can you think of any possible reason or solution?
Thanks!

@simonsimanta
Copy link

Hello Authors,

Encountered the above RuntimeError related to tensor size mismatch while trying to train the sparsity predictor as per the instructions in the README. The issue arises when running ./run_infer_opt_175b_collect_sp_data.sh. All relevant files are created successfully until the script hits a tensor size error.

RuntimeError: The expanded size of the tensor (2048) must match the existing size (2047) at non-singleton dimension 1. Target sizes: [1, 2048]. Tensor sizes: [2047]

Can you please provide additional information on how can we resolve the issue ?

Thank You .

@zhaoyang-star
Copy link

zhaoyang-star commented Jan 31, 2024

I met the same error when tp=1. Looking for the author to fix it.
A tricky way to solve it is using tp=2. I have verified it and the error disappears.

@ariellubonja
Copy link

Hi everyone! I am facing this issue right now, after trying to use facebook/opt-125m model instead. Anyone found a fix?

@zhaoyang-star Where is this parameter tp=2?

@2455DD
Copy link

2455DD commented May 20, 2024

Same things happens to me too when I'm trying to use facebook/opt-1.7b model in colab notebook. Waiting for any possible solution.

I met the same error when tp=1. Looking for the author to fix it. A tricky way to solve it is using tp=2. I have verified it and the error disappears.

@zhaoyang-star Is tp a library or a parameter? Does it mean the top-p in c4_train/getdata.py ? Thanks for your help!

2455DD added a commit to 2455DD/DejaVu that referenced this issue May 20, 2024
@ariellubonja
Copy link

@2455DD I reached out to the author earlier this month, and she replied with this "I believe there is a temporary work around suggested on GitHub by setting the top_p = 2 in get_data file."

So, you're correct, it is top-p in c4_train/getdata.py

@mailonghua
Copy link

I followed the instructions and modified the file DejaVu/Decentralized_FM_alpha/c4_train/getdata.py in the following way:

data = {
"best_of": 1,
"echo": True,
"logprobs": 1,
"max_tokens": 0,
"model": "opt-1.3b",
"n": 1,
"prompt": doc["text"],
"request_type": "language-model-inference",
"stop": None,
"temperature": 0,
"top_p": 2,
}
After making these changes, I deleted the local Hugging Face cache of C4 and the file c4_train.jsonl. I then re-executed getdata.py and ran the script run_infer_opt_175b_collect_sp_data.sh. However, I am still encountering a size mismatch error:
xxx
RuntimeError: The size of tensor a (2047) must match the size of tensor b (2048) at non-singleton dimension 3

@mailonghua
Copy link

@2455DD I reached out to the author earlier this month, and she replied with this "I believe there is a temporary work around suggested on GitHub by setting the top_p = 2 in get_data file."

So, you're correct, it is top-p in c4_train/getdata.py

Same things happens to me too when I'm trying to use facebook/opt-1.7b model in colab notebook. Waiting for any possible solution.

I met the same error when tp=1. Looking for the author to fix it. A tricky way to solve it is using tp=2. I have verified it and the error disappears.

@zhaoyang-star Is tp a library or a parameter? Does it mean the top-p in c4_train/getdata.py ? Thanks for your help!

I used the run_infer_opt_1_3_b_collect_sp_data.sh script from your forked repository, but I encountered the same size mismatch error. Did you run the script successfully?

@Philippe-Guyard
Copy link

Philippe-Guyard commented May 31, 2024

@2455DD I reached out to the author earlier this month, and she replied with this "I believe there is a temporary work around suggested on GitHub by setting the top_p = 2 in get_data file."
So, you're correct, it is top-p in c4_train/getdata.py

Same things happens to me too when I'm trying to use facebook/opt-1.7b model in colab notebook. Waiting for any possible solution.

I met the same error when tp=1. Looking for the author to fix it. A tricky way to solve it is using tp=2. I have verified it and the error disappears.

@zhaoyang-star Is tp a library or a parameter? Does it mean the top-p in c4_train/getdata.py ? Thanks for your help!

I used the run_infer_opt_1_3_b_collect_sp_data.sh script from your forked repository, but I encountered the same size mismatch error. Did you run the script successfully?

Can confirm I am having a similar issue after setting "model": "opt-125m", "top_p":2. Only difference is I get

RuntimeError: The expanded size of the tensor (2048) must match the existing size (2047) at non-singleton dimension 1.  Target sizes: [1, 2048].  Tensor sizes: [2047]

(the difference being that the mismatch happens at dimension 1 and not 3).

@zhaoningyuan
Copy link

it works for me if i set pipeline-group-size=2

ARGS="--model-name /data/xxx/DejaVu/Decentralized_FM_alpha/pretrained_models \
--model-type opt-save \
--seed 42 \
--fp16 \
--num-layers 16 \
--max-layers 32 \
--budget 22800 \
--num-iters 2000 \
--dist-url tcp://127.0.0.1:9032 \
--token-micro-batch-size 16 \
--world-size 2 --pipeline-group-size 2 --data-group-size 1 \
--pp-mode pipe_sync_sample_mask_token_pipe \
--infer-data ${file}"

(trap 'kill 0' SIGINT; \
python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 0 --rank 0 \
    &
python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 1 --rank 1 \
    & \
wait)

Hello Authors,

Encountered the above RuntimeError related to tensor size mismatch while trying to train the sparsity predictor as per the instructions in the README. The issue arises when running ./run_infer_opt_175b_collect_sp_data.sh. All relevant files are created successfully until the script hits a tensor size error.

RuntimeError: The expanded size of the tensor (2048) must match the existing size (2047) at non-singleton dimension 1. Target sizes: [1, 2048]. Tensor sizes: [2047]

Can you please provide additional information on how can we resolve the issue ?

Thank You .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants