Size mismatch error in the middle of training #6699

NicoZenith · 2025-01-18T14:24:33Z

Reminder

I have read the above rules and searched the existing issues.

System Info

Hi, I am training qwen2-vl on a dataset containing images paired with instructions. After pre-processing the whole dataset, and tokenising it, I can finally train the model. However, in the middle of training, I get a mismatch error, as if one datapoint was not correctly truncated to the set up cutoff_len, which is 4096.

Reproduction

Training script:

    TORCHRUN_ARGS=\"\
        --node-rank=\${SLURM_PROCID} \
        --master-addr=\${MASTER_ADDR} \
        --master-port=\${MASTER_PORT} \
        --nnodes=\${SLURM_NNODES} \
        --nproc-per-node=4 \
    \"

    # Launch training
    torchrun \$TORCHRUN_ARGS src/train.py \
        --run_name ${RUN_NAME} \
        --deepspeed examples/deepspeed/ds_z3_config.json \
        --stage sft \
        --tokenized_path /capstor/scratch/cscs/ndeperr/checkpoints/tokenizers/${RUN_NAME} \
        --do_train \
        --model_name_or_path Qwen/Qwen2-VL-7B-Instruct \
        --dataset all_train_qwen2vl \
        --template qwen2_vl \
        --image_resolution 262144 \
        --finetuning_type full \
        --output_dir \$SCRATCH/checkpoints/${RUN_NAME} \
        --overwrite_output_dir \
        --warmup_steps 100 \
        --weight_decay 0.1 \
        --per_device_train_batch_size 1 \
        --gradient_accumulation_steps 1 \
        --ddp_timeout 180000000 \
        --learning_rate 1e-5 \
        --lr_scheduler_type cosine \
        --logging_steps 1 \
        --cutoff_len 4096 \
        --report_to wandb \
        --save_steps 10000 \
        --plot_loss \
        --num_train_epochs 1 \
        --preprocessing_num_workers 128 \
        --bf16 \
        --overwrite_cache

And here is the error I get:

 0: {'loss': 0.688, 'grad_norm': 1.411176515565421, 'learning_rate': 9.436746524392668e-06, 'epoch': 0.16}
 0:
 0: ^M 16%|█▌        | 1452/8963 [51:50<4:04:03,  1.95s/it]
 0: ^M 16%|█▌        | 1453/8963 [51:52<4:04:48,  1.96s/it]
 0: ^M
 0: ^M
 0: {'loss': 0.6443, 'grad_norm': 1.5193737886859888, 'learning_rate': 9.435929038441369e-06, 'epoch': 0.16}
 0:
 0: ^M 16%|█▌        | 1453/8963 [51:52<4:04:48,  1.96s/it]
 0: [rank2]: Traceback (most recent call last):
 0: [rank2]:   File "/capstor/scratch/cscs/ndeperr/code/LLaMA-Factory/src/train.py", line 28, in <module>
 0: [rank2]:     main()
 0: [rank2]:   File "/capstor/scratch/cscs/ndeperr/code/LLaMA-Factory/src/train.py", line 19, in main
 0: [rank2]:     run_exp()
 0: [rank2]:   File "/capstor/scratch/cscs/ndeperr/code/LLaMA-Factory/src/llamafactory/train/tuner.py", line 92, in run_exp
 0: [rank2]:     _training_function(config={"args": args, "callbacks": callbacks})
 0: [rank2]:   File "/capstor/scratch/cscs/ndeperr/code/LLaMA-Factory/src/llamafactory/train/tuner.py", line 66, in _training_function
 0: [rank2]:     run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
 0: [rank2]:   File "/capstor/scratch/cscs/ndeperr/code/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 101, in run_sft
 0: [rank2]:     train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
 0: [rank2]:   File "/usr/local/lib/python3.10/dist-packages/transformers/tr
 0: ainer.py", line 2122, in train
 0: [rank2]:     return inner_training_loop(
 0: [rank2]:   File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2426, in _inner_training_loop
 0: [rank2]:     batch_samples, num_items_in_batch = self.get_batch_samples(epoch_iterator, num_batches)
 0: [rank2]:   File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 5038, in get_batch_samples
 0: [rank2]:     batch_samples += [next(epoch_iterator)]
 0: [rank2]:   File "/usr/local/lib/python3.10/dist-packages/accelerate/data_loader.py", line 561, in __iter__
 0: [rank2]:     next_batch = next(dataloader_iter)
 0: [rank2]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 631, in __next__
 0: [rank2]:     data = self._next_data()
 0: [rank2]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 675, in _next_data
 0: [rank2]:     data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
 0: [rank2]:   File "/usr/local/lib/python3.10/dist-packages/torch/u
 0: tils/data/_utils/fetch.py", line 54, in fetch
 0: [rank2]:     return self.collate_fn(data)
 0: [rank2]:   File "/capstor/scratch/cscs/ndeperr/code/LLaMA-Factory/src/llamafactory/data/collator.py", line 174, in __call__
 0: [rank2]:     features = super().__call__(features)
 0: [rank2]:   File "/capstor/scratch/cscs/ndeperr/code/LLaMA-Factory/src/llamafactory/data/collator.py", line 138, in __call__
 0: [rank2]:     features["position_ids"], features["rope_deltas"] = self.model.get_rope_index(
 0: [rank2]:   File "/usr/local/lib/python3.10/dist-packages/transformers/models/qwen2_vl/modeling_qwen2_vl.py", line 1571, in get_rope_index
 0: [rank2]:     position_ids[..., i, attention_mask[i] == 1] = llm_positions.to(position_ids.device)
 0: [rank2]: RuntimeError: shape mismatch: value tensor of shape [3, 8398] cannot be broadcast to indexing result of shape [3, 4096]

Is it an issue in the pre-processing method? Somehow some datapoints were not correctly truncated to 4096?

Many thanks

Others

No response

The text was updated successfully, but these errors were encountered:

hiyouga · 2025-01-20T11:52:34Z

Could you print the shapes of the position ids and attention mask respectively?

NicoZenith added bug Something isn't working pending This problem is yet to be addressed labels Jan 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Size mismatch error in the middle of training #6699

Size mismatch error in the middle of training #6699

NicoZenith commented Jan 18, 2025

hiyouga commented Jan 20, 2025

Size mismatch error in the middle of training #6699

Size mismatch error in the middle of training #6699

Comments

NicoZenith commented Jan 18, 2025

Reminder

System Info

Reproduction

Others

hiyouga commented Jan 20, 2025