You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
Describe the bug
warning: The size of tensor a (3932) must match the size of tensor b (7680) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([3932, 4096]), vit_embeds.shape=torch.Size([7680, 4096])
warning: The size of tensor a (3932) must match the size of tensor b (7680) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([3932, 4096]), vit_embeds.shape=torch.Size([7680, 4096])
warning: The size of tensor a (3932) must match the size of tensor b (7680) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([3932, 4096]), vit_embeds.shape=torch.Size([7680, 4096])
warning: The size of tensor a (3932) must match the size of tensor b (10240) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([3932, 4096]), vit_embeds.shape=torch.Size([10240, 4096])
warning: The size of tensor a (3932) must match the size of tensor b (7680) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([3932, 4096]), vit_embeds.shape=torch.Size([7680, 4096])
dynamic ViT batch size: 30, images per sample: 7.5, dynamic token length: 1024
warning: The size of tensor a (3932) must match the size of tensor b (7680) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([3932, 4096]), vit_embeds.shape=torch.Size([7680, 4096])
warning: The size of tensor a (3932) must match the size of tensor b (7680) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([3932, 4096]), vit_embeds.shape=torch.Size([7680, 4096])
warning: The size of tensor a (3932) must match the size of tensor b (10240) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([3932, 4096]), vit_embeds.shape=torch.Size([10240, 4096])
[2025-02-08 10:58:15,271] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 78.92 | optimizer_gradients: 28.46 | optimizer_step: 48.12
[2025-02-08 10:58:15,272] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 496.09 | bwd_microstep: 1605.45 | bwd_inner_microstep: 1451.34 | bwd_allreduce_microstep: 154.04 | step_microstep: 188.06
[2025-02-08 10:58:15,272] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 496.08 | bwd: 1605.44 | bwd_inner: 1451.34 | bwd_allreduce: 154.05 | step: 188.06
1%| | 21/3306 [02:23<2:38:16, 2.89s/it]02/08/2025 10:58:15 - WARNING - tensorboardX.x2num - NaN or Inf found in input tensor.
Checklist
Describe the bug
warning: The size of tensor a (3932) must match the size of tensor b (7680) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([3932, 4096]), vit_embeds.shape=torch.Size([7680, 4096])
warning: The size of tensor a (3932) must match the size of tensor b (7680) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([3932, 4096]), vit_embeds.shape=torch.Size([7680, 4096])
warning: The size of tensor a (3932) must match the size of tensor b (7680) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([3932, 4096]), vit_embeds.shape=torch.Size([7680, 4096])
warning: The size of tensor a (3932) must match the size of tensor b (10240) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([3932, 4096]), vit_embeds.shape=torch.Size([10240, 4096])
warning: The size of tensor a (3932) must match the size of tensor b (7680) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([3932, 4096]), vit_embeds.shape=torch.Size([7680, 4096])
dynamic ViT batch size: 30, images per sample: 7.5, dynamic token length: 1024
warning: The size of tensor a (3932) must match the size of tensor b (7680) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([3932, 4096]), vit_embeds.shape=torch.Size([7680, 4096])
warning: The size of tensor a (3932) must match the size of tensor b (7680) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([3932, 4096]), vit_embeds.shape=torch.Size([7680, 4096])
warning: The size of tensor a (3932) must match the size of tensor b (10240) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([3932, 4096]), vit_embeds.shape=torch.Size([10240, 4096])
[2025-02-08 10:58:15,271] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 78.92 | optimizer_gradients: 28.46 | optimizer_step: 48.12
[2025-02-08 10:58:15,272] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 496.09 | bwd_microstep: 1605.45 | bwd_inner_microstep: 1451.34 | bwd_allreduce_microstep: 154.04 | step_microstep: 188.06
[2025-02-08 10:58:15,272] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 496.08 | bwd: 1605.44 | bwd_inner: 1451.34 | bwd_allreduce: 154.05 | step: 188.06
1%| | 21/3306 [02:23<2:38:16, 2.89s/it]02/08/2025 10:58:15 - WARNING - tensorboardX.x2num - NaN or Inf found in input tensor.
{'loss': 0.0, 'learning_rate': 1.0500000000000001e-06, 'rewards/chosen': 0.0, 'rewards/rejected': 0.0, 'rewards/accuracies': 0.0, 'rewards/margins': 0.0, 'logps/rejected': 0.0, 'logps/chosen': 0.0, 'logits/rejected': 6.465516090393066, 'logits/chosen': 6.095698833465576, 'nll_loss': nan, 'epoch': 0.02}
Reproduction
set -x
GPUS=${GPUS:-4}
GPUS_PER_NODE=${GPUS_PER_NODE:-1}
NODES=$((GPUS / GPUS_PER_NODE))
CPUS_PER_TASK=${CPUS_PER_TASK:-10}
SRUN_ARGS=${SRUN_ARGS:-""}
BATCH_SIZE=${BATCH_SIZE:-8}
PER_DEVICE_BATCH_SIZE=${PER_DEVICE_BATCH_SIZE:-2}
GRADIENT_ACC=$((BATCH_SIZE / PER_DEVICE_BATCH_SIZE / GPUS))
cd /mnt/pfs-mc0p4k/tts/team/zgx/workplace/internVL2/finetuning/InternVL/internvl_chat
source /opt/conda/bin/activate
conda activate /mnt/pfs-mc0p4k/tts/team/zgx/environment/internvl2
echo "Python path: $(which python)" >> "/mnt/pfs-mc0p4k/tts/team/zgx/workplace/shell/train_log.txt"
which python
export PYTHONPATH="${PYTHONPATH}:$(pwd)"
export MASTER_PORT=34229
export TF_CPP_MIN_LOG_LEVEL=3
export LAUNCHER=pytorch
OUTPUT_DIR='/mnt/pfs-mc0p4k/tts/team/zgx/workplace/internVL2/finetuning/InternVL/internvl_chat/output/internvl_chat_mpo_v2/internvl2_8b_mpo_v1'
if [ ! -d "$OUTPUT_DIR" ]; then
mkdir -p "$OUTPUT_DIR"
fi
torchrun
--nnodes=1
--node_rank=0
--master_addr=0.0.0.0
--nproc_per_node=${GPUS}
--master_port=${MASTER_PORT}
internvl/train/internvl_chat_dpo.py
--model_name_or_path "/mnt/pfs-mc0p4k/tts/team/zgx/workplace/internVL2/finetuning/InternVL/internvl_chat/output/merged_model/internvl2_8b_v1"
--conv_style "internlm2-chat"
--output_dir ${OUTPUT_DIR}
--meta_path "./shell/data/adqa_mpo.json"
--overwrite_output_dir True
--force_image_size 448
--down_sample_ratio 0.5
--drop_path_rate 0.1
--pad2square False
--freeze_llm False
--freeze_mlp False
--freeze_backbone False
--vision_select_layer -1
--use_data_resampling False
--dataloader_num_workers 8
--bf16 True
--num_train_epochs 3
--per_device_train_batch_size ${PER_DEVICE_BATCH_SIZE}
--gradient_accumulation_steps ${GRADIENT_ACC}
--evaluation_strategy "no"
--save_strategy "no"
--save_steps 100
--save_total_limit 100
--learning_rate 5e-6
--weight_decay 0.05
--warmup_ratio 0.03
--lr_scheduler_type "cosine"
--logging_steps 1
--max_seq_length 1024
--do_train True
--grad_checkpoint True
--group_by_length False
--dynamic_image_size True
--use_thumbnail True
--ps_version 'v2'
--deepspeed "zero_stage1_config.json"
--report_to "tensorboard"
--loss_type sigmoid,bco_pair
--sigmoid_loss_weight 0.8
--bco_pair_loss_weight 0.2
--rpo_alpha 1
2>&1 | tee -a "${OUTPUT_DIR}/training_log.txt"
Environment
Error traceback
The text was updated successfully, but these errors were encountered: