Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] internvl的MPO的loss一开始就为0 #889

Open
1 of 3 tasks
amoreZgx1n opened this issue Feb 8, 2025 · 0 comments
Open
1 of 3 tasks

[Bug] internvl的MPO的loss一开始就为0 #889

amoreZgx1n opened this issue Feb 8, 2025 · 0 comments

Comments

@amoreZgx1n
Copy link

amoreZgx1n commented Feb 8, 2025

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

warning: The size of tensor a (3932) must match the size of tensor b (7680) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([3932, 4096]), vit_embeds.shape=torch.Size([7680, 4096])
warning: The size of tensor a (3932) must match the size of tensor b (7680) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([3932, 4096]), vit_embeds.shape=torch.Size([7680, 4096])
warning: The size of tensor a (3932) must match the size of tensor b (7680) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([3932, 4096]), vit_embeds.shape=torch.Size([7680, 4096])
warning: The size of tensor a (3932) must match the size of tensor b (10240) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([3932, 4096]), vit_embeds.shape=torch.Size([10240, 4096])
warning: The size of tensor a (3932) must match the size of tensor b (7680) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([3932, 4096]), vit_embeds.shape=torch.Size([7680, 4096])
dynamic ViT batch size: 30, images per sample: 7.5, dynamic token length: 1024
warning: The size of tensor a (3932) must match the size of tensor b (7680) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([3932, 4096]), vit_embeds.shape=torch.Size([7680, 4096])
warning: The size of tensor a (3932) must match the size of tensor b (7680) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([3932, 4096]), vit_embeds.shape=torch.Size([7680, 4096])
warning: The size of tensor a (3932) must match the size of tensor b (10240) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([3932, 4096]), vit_embeds.shape=torch.Size([10240, 4096])
[2025-02-08 10:58:15,271] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 78.92 | optimizer_gradients: 28.46 | optimizer_step: 48.12
[2025-02-08 10:58:15,272] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 496.09 | bwd_microstep: 1605.45 | bwd_inner_microstep: 1451.34 | bwd_allreduce_microstep: 154.04 | step_microstep: 188.06
[2025-02-08 10:58:15,272] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 496.08 | bwd: 1605.44 | bwd_inner: 1451.34 | bwd_allreduce: 154.05 | step: 188.06

1%| | 21/3306 [02:23<2:38:16, 2.89s/it]02/08/2025 10:58:15 - WARNING - tensorboardX.x2num - NaN or Inf found in input tensor.

{'loss': 0.0, 'learning_rate': 1.0500000000000001e-06, 'rewards/chosen': 0.0, 'rewards/rejected': 0.0, 'rewards/accuracies': 0.0, 'rewards/margins': 0.0, 'logps/rejected': 0.0, 'logps/chosen': 0.0, 'logits/rejected': 6.465516090393066, 'logits/chosen': 6.095698833465576, 'nll_loss': nan, 'epoch': 0.02}

Reproduction

set -x

GPUS=${GPUS:-4}
GPUS_PER_NODE=${GPUS_PER_NODE:-1}
NODES=$((GPUS / GPUS_PER_NODE))
CPUS_PER_TASK=${CPUS_PER_TASK:-10}
SRUN_ARGS=${SRUN_ARGS:-""}
BATCH_SIZE=${BATCH_SIZE:-8}
PER_DEVICE_BATCH_SIZE=${PER_DEVICE_BATCH_SIZE:-2}
GRADIENT_ACC=$((BATCH_SIZE / PER_DEVICE_BATCH_SIZE / GPUS))

cd /mnt/pfs-mc0p4k/tts/team/zgx/workplace/internVL2/finetuning/InternVL/internvl_chat
source /opt/conda/bin/activate

conda activate /mnt/pfs-mc0p4k/tts/team/zgx/environment/internvl2
echo "Python path: $(which python)" >> "/mnt/pfs-mc0p4k/tts/team/zgx/workplace/shell/train_log.txt"
which python

export PYTHONPATH="${PYTHONPATH}:$(pwd)"
export MASTER_PORT=34229
export TF_CPP_MIN_LOG_LEVEL=3
export LAUNCHER=pytorch

OUTPUT_DIR='/mnt/pfs-mc0p4k/tts/team/zgx/workplace/internVL2/finetuning/InternVL/internvl_chat/output/internvl_chat_mpo_v2/internvl2_8b_mpo_v1'

if [ ! -d "$OUTPUT_DIR" ]; then
mkdir -p "$OUTPUT_DIR"
fi

torchrun
--nnodes=1
--node_rank=0
--master_addr=0.0.0.0
--nproc_per_node=${GPUS}
--master_port=${MASTER_PORT}
internvl/train/internvl_chat_dpo.py
--model_name_or_path "/mnt/pfs-mc0p4k/tts/team/zgx/workplace/internVL2/finetuning/InternVL/internvl_chat/output/merged_model/internvl2_8b_v1"
--conv_style "internlm2-chat"
--output_dir ${OUTPUT_DIR}
--meta_path "./shell/data/adqa_mpo.json"
--overwrite_output_dir True
--force_image_size 448
--down_sample_ratio 0.5
--drop_path_rate 0.1
--pad2square False
--freeze_llm False
--freeze_mlp False
--freeze_backbone False
--vision_select_layer -1
--use_data_resampling False
--dataloader_num_workers 8
--bf16 True
--num_train_epochs 3
--per_device_train_batch_size ${PER_DEVICE_BATCH_SIZE}
--gradient_accumulation_steps ${GRADIENT_ACC}
--evaluation_strategy "no"
--save_strategy "no"
--save_steps 100
--save_total_limit 100
--learning_rate 5e-6
--weight_decay 0.05
--warmup_ratio 0.03
--lr_scheduler_type "cosine"
--logging_steps 1
--max_seq_length 1024
--do_train True
--grad_checkpoint True
--group_by_length False
--dynamic_image_size True
--use_thumbnail True
--ps_version 'v2'
--deepspeed "zero_stage1_config.json"
--report_to "tensorboard"
--loss_type sigmoid,bco_pair
--sigmoid_loss_weight 0.8
--bco_pair_loss_weight 0.2
--rpo_alpha 1
2>&1 | tee -a "${OUTPUT_DIR}/training_log.txt"

Environment

torch2.0.1+cu118

Error traceback

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant