Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

loss does not decrease #120

Open
wangfengjuan opened this issue Sep 29, 2024 · 6 comments
Open

loss does not decrease #120

wangfengjuan opened this issue Sep 29, 2024 · 6 comments

Comments

@wangfengjuan
Copy link

Hello, thank you very much for sharing your work. In the TinyLLaVA_Factory-main file, I executed bash ./scripts/train/train_phi.sh and found a problem. Its loss has been around 5 and has not decreased. The final fine-tuned model effect is not good. The result of testvqa verification is 7.85. It's very strange. I don't know what went wrong? The executed script and results are shown in the figure below. Looking forward to your reply.
image
image
image

@YingHuTsing
Copy link
Collaborator

Are these pretrain-stage loss? For the Phi-2 LLM, final loss in the pretrain-stage could reach to about 2.5. From the screenshot you gave above, grad-norm is 0, which indicates the network is not learning and gradient is 0 and parameters are not updated any more. Did you change any hyper-params in the pretrain.sh/finetune.sh?

@wangfengjuan
Copy link
Author

This is the loss in the fine-tuning phase. I only changed the batch_size in the hyperparameters. I used four 3090 GPUs. I don't know where the problem lies.

deepspeed --include localhost:0,1,2,3 --master_port 29501 tinyllava/train/train.py
--deepspeed ./scripts/zero3.json
--data_path $DATA_PATH
--image_folder $IMAGE_PATH
--is_multimodal True
--conv_version $CONV_VERSION
--model_name_or_path $LLM_VERSION
--vision_tower $VT_VERSION
--vision_tower2 ''
--connector_type $CN_VERSION
--mm_vision_select_layer -2
--image_aspect_ratio square
--attn_implementation flash_attention_2
--fp16 True
--training_recipe $TRAIN_RECIPE
--tune_type_llm lora
--tune_type_vision_tower frozen
--tune_vision_tower_from_layer 0
--tune_type_connector full
--group_by_modality_length True
--pretrained_model_path /home/omnisky/userfile_2/wangfj/TinyLLaVA_Factory-main/checkpoints/llava_factory/tiny-llava-${LLM_VARIANT}-${VT_VARIANT}-${VERSION}-pretrain-0929
--output_dir /home/omnisky/userfile_2/wangfj/TinyLLaVA_Factory-main/checkpoints/llava_factory/tiny-llava-${LLM_VARIANT}-${VT_VARIANT}-${VERSION}-finetune-0929
--num_train_epochs 1
--per_device_train_batch_size 4
--per_device_eval_batch_size 4
--gradient_accumulation_steps 4
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 50000
--save_total_limit 1
--learning_rate 2e-5
--weight_decay 0.
--warmup_ratio 0.03
--lr_scheduler_type "cosine"
--logging_steps 1
--tf32 False
--model_max_length $MODEL_MAX_LENGTH
--gradient_checkpointing True
--dataloader_num_workers 8
--lazy_preprocess True
--report_to tensorboard
--tokenizer_use_fast False
--run_name /home/omnisky/userfile_2/wangfj/TinyLLaVA_Factory-main/checkpoints/llava_factory/tiny-llava-${LLM_VARIANT}-${VT_VARIANT}-${VERSION}-finetune-0929

@YingHuTsing
Copy link
Collaborator

Hi. After pretraining, the initial loss in finetune stage should starts from about 2.5. It seems the problem came from the pretraining stage. Please provide your params in pretrain.sh. Please also check if the final loss in your pretrain stage decreased to about 2.5.

@wangfengjuan
Copy link
Author

Hi. After pretraining, the initial loss in finetune stage should starts from about 2.5. It seems the problem came from the pretraining stage. Please provide your params in pretrain.sh. Please also check if the final loss in your pretrain stage decreased to about 2.5.

Thank you for your reply. The pre-training parameter settings are as follows. The pre-training loss is also around 5, which has not decreased.

deepspeed --include localhost:0,1,2,3 --master_port 29502 tinyllava/train/train.py
--deepspeed ./scripts/zero2.json
--data_path $DATA_PATH
--image_folder $IMAGE_PATH
--is_multimodal True
--conv_version pretrain
--model_name_or_path $LLM_VERSION
--vision_tower $VT_VERSION
--vision_tower2 $VT_VERSION2
--connector_type $CN_VERSION
--mm_vision_select_layer -2
--image_aspect_ratio square
--attn_implementation flash_attention_2
--fp16 True
--training_recipe $TRAIN_RECIPE
--tune_type_llm frozen
--tune_type_vision_tower frozen
--tune_vision_tower_from_layer 0
--tune_type_connector full
--output_dir /home/omnisky/userfile_2/wangfj/TinyLLaVA_Factory-main/checkpoints/llava_factory/tiny-llava-${LLM_VARIANT}-${VT_VARIANT}-${VERSION}-pretrain-0929
--num_train_epochs 1
--per_device_train_batch_size 32
--per_device_eval_batch_size 4
--gradient_accumulation_steps 2
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 24000
--save_total_limit 1
--learning_rate 1e-1
--weight_decay 0.
--warmup_ratio 0.03
--lr_scheduler_type "cosine"
--logging_steps 1
--tf32 False
--model_max_length $MODEL_MAX_LENGTH
--gradient_checkpointing True
--dataloader_num_workers 8
--lazy_preprocess True
--report_to tensorboard
--tokenizer_use_fast False
--run_name /home/omnisky/userfile_2/wangfj/TinyLLaVA_Factory-main/checkpoints/llava_factory/tiny-llava-${LLM_VARIANT}-${VT_VARIANT}-${VERSION}-pretrain-0929

@YingHuTsing
Copy link
Collaborator

Hi, your learning rate in the pretrain stage is too large...please set learning_rate to 1e-3.

And are you sure per_device_train_batch_size can be set to 32? I run your scripts also with a machine of 4 3090GPUs. I need to decrease per_device_train_batch_size to 16 and increase gradient_accumulation_steps to 4, so that I can avoid OOM.

@wangfengjuan
Copy link
Author

Thank you very much for your reply, per_device_train_batch_size be set 4 when running with a machine of 4 3090GPUs,Otherwise it will oom. I'll try again with a different learning rate. thankyou!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants