Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Which SFT setup is recommended now? #14

Open
tyleryzhu opened this issue Sep 12, 2024 · 1 comment
Open

Which SFT setup is recommended now? #14

tyleryzhu opened this issue Sep 12, 2024 · 1 comment

Comments

@tyleryzhu
Copy link

It seems like there's three different SFT setups recommended between the code and the paper.

Paper:

  • Stage 2: 600k image instructions from ALLaVA, 240k video instructions

Code (your ckpt):

  • Stage 2.1: 600k images, 300k video captions
  • Stage 2.2: 100k images, 200k video QA

Code (new recipe I assume?):

  • Stage 2: 600k images, 240k video instruction/QA (?), 15k video captions.

I assume the new recipe is one you tested and gets the same/better numbers than those in the paper? If you could clarify the different settings that would be much appreciated. Thank you!

@RifleZhang
Copy link
Owner

Hello,
from code https://github.com/RifleZhang/LLaVA-Hound-DPO/blob/main/llava_hound_dpo/sft_scripts/video_sft_qa_240k.sh#L19
for SFT stage, it is 100k image + 240k video QA. A small set of 15k caption is mixed, which inspired from ShareGPT4V training, but we didn't tested if that data is removed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants