Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Size mismatch error in the middle of training #6699

Open
1 task done
NicoZenith opened this issue Jan 18, 2025 · 1 comment
Open
1 task done

Size mismatch error in the middle of training #6699

NicoZenith opened this issue Jan 18, 2025 · 1 comment
Labels
bug Something isn't working pending This problem is yet to be addressed

Comments

@NicoZenith
Copy link

Reminder

  • I have read the above rules and searched the existing issues.

System Info

Hi, I am training qwen2-vl on a dataset containing images paired with instructions. After pre-processing the whole dataset, and tokenising it, I can finally train the model. However, in the middle of training, I get a mismatch error, as if one datapoint was not correctly truncated to the set up cutoff_len, which is 4096.

Reproduction

Training script:

    TORCHRUN_ARGS=\"\
        --node-rank=\${SLURM_PROCID} \
        --master-addr=\${MASTER_ADDR} \
        --master-port=\${MASTER_PORT} \
        --nnodes=\${SLURM_NNODES} \
        --nproc-per-node=4 \
    \"

    # Launch training
    torchrun \$TORCHRUN_ARGS src/train.py \
        --run_name ${RUN_NAME} \
        --deepspeed examples/deepspeed/ds_z3_config.json \
        --stage sft \
        --tokenized_path /capstor/scratch/cscs/ndeperr/checkpoints/tokenizers/${RUN_NAME} \
        --do_train \
        --model_name_or_path Qwen/Qwen2-VL-7B-Instruct \
        --dataset all_train_qwen2vl \
        --template qwen2_vl \
        --image_resolution 262144 \
        --finetuning_type full \
        --output_dir \$SCRATCH/checkpoints/${RUN_NAME} \
        --overwrite_output_dir \
        --warmup_steps 100 \
        --weight_decay 0.1 \
        --per_device_train_batch_size 1 \
        --gradient_accumulation_steps 1 \
        --ddp_timeout 180000000 \
        --learning_rate 1e-5 \
        --lr_scheduler_type cosine \
        --logging_steps 1 \
        --cutoff_len 4096 \
        --report_to wandb \
        --save_steps 10000 \
        --plot_loss \
        --num_train_epochs 1 \
        --preprocessing_num_workers 128 \
        --bf16 \
        --overwrite_cache

And here is the error I get:

 0: {'loss': 0.688, 'grad_norm': 1.411176515565421, 'learning_rate': 9.436746524392668e-06, 'epoch': 0.16}
 0:
 0: ^M 16%|█▌        | 1452/8963 [51:50<4:04:03,  1.95s/it]
 0: ^M 16%|█▌        | 1453/8963 [51:52<4:04:48,  1.96s/it]
 0: ^M
 0: ^M
 0: {'loss': 0.6443, 'grad_norm': 1.5193737886859888, 'learning_rate': 9.435929038441369e-06, 'epoch': 0.16}
 0:
 0: ^M 16%|█▌        | 1453/8963 [51:52<4:04:48,  1.96s/it]
 0: [rank2]: Traceback (most recent call last):
 0: [rank2]:   File "/capstor/scratch/cscs/ndeperr/code/LLaMA-Factory/src/train.py", line 28, in <module>
 0: [rank2]:     main()
 0: [rank2]:   File "/capstor/scratch/cscs/ndeperr/code/LLaMA-Factory/src/train.py", line 19, in main
 0: [rank2]:     run_exp()
 0: [rank2]:   File "/capstor/scratch/cscs/ndeperr/code/LLaMA-Factory/src/llamafactory/train/tuner.py", line 92, in run_exp
 0: [rank2]:     _training_function(config={"args": args, "callbacks": callbacks})
 0: [rank2]:   File "/capstor/scratch/cscs/ndeperr/code/LLaMA-Factory/src/llamafactory/train/tuner.py", line 66, in _training_function
 0: [rank2]:     run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
 0: [rank2]:   File "/capstor/scratch/cscs/ndeperr/code/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 101, in run_sft
 0: [rank2]:     train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
 0: [rank2]:   File "/usr/local/lib/python3.10/dist-packages/transformers/tr
 0: ainer.py", line 2122, in train
 0: [rank2]:     return inner_training_loop(
 0: [rank2]:   File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2426, in _inner_training_loop
 0: [rank2]:     batch_samples, num_items_in_batch = self.get_batch_samples(epoch_iterator, num_batches)
 0: [rank2]:   File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 5038, in get_batch_samples
 0: [rank2]:     batch_samples += [next(epoch_iterator)]
 0: [rank2]:   File "/usr/local/lib/python3.10/dist-packages/accelerate/data_loader.py", line 561, in __iter__
 0: [rank2]:     next_batch = next(dataloader_iter)
 0: [rank2]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 631, in __next__
 0: [rank2]:     data = self._next_data()
 0: [rank2]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 675, in _next_data
 0: [rank2]:     data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
 0: [rank2]:   File "/usr/local/lib/python3.10/dist-packages/torch/u
 0: tils/data/_utils/fetch.py", line 54, in fetch
 0: [rank2]:     return self.collate_fn(data)
 0: [rank2]:   File "/capstor/scratch/cscs/ndeperr/code/LLaMA-Factory/src/llamafactory/data/collator.py", line 174, in __call__
 0: [rank2]:     features = super().__call__(features)
 0: [rank2]:   File "/capstor/scratch/cscs/ndeperr/code/LLaMA-Factory/src/llamafactory/data/collator.py", line 138, in __call__
 0: [rank2]:     features["position_ids"], features["rope_deltas"] = self.model.get_rope_index(
 0: [rank2]:   File "/usr/local/lib/python3.10/dist-packages/transformers/models/qwen2_vl/modeling_qwen2_vl.py", line 1571, in get_rope_index
 0: [rank2]:     position_ids[..., i, attention_mask[i] == 1] = llm_positions.to(position_ids.device)
 0: [rank2]: RuntimeError: shape mismatch: value tensor of shape [3, 8398] cannot be broadcast to indexing result of shape [3, 4096]

Is it an issue in the pre-processing method? Somehow some datapoints were not correctly truncated to 4096?

Many thanks

Others

No response

@NicoZenith NicoZenith added bug Something isn't working pending This problem is yet to be addressed labels Jan 18, 2025
@hiyouga
Copy link
Owner

hiyouga commented Jan 20, 2025

Could you print the shapes of the position ids and attention mask respectively?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working pending This problem is yet to be addressed
Projects
None yet
Development

No branches or pull requests

2 participants