Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about GPU Usage, Training Time, and Dataset Size #6

Open
aoliao12138 opened this issue Nov 29, 2024 · 8 comments
Open

Question about GPU Usage, Training Time, and Dataset Size #6

aoliao12138 opened this issue Nov 29, 2024 · 8 comments

Comments

@aoliao12138
Copy link

Nice work! I have a few questions regarding the details of your experiment.

  • How many GPUs did you use for training?
  • How much time did it take to train the model?
  • Could you share the size of the dataset you used?

Thanks in advance!

@SHYuanBest
Copy link
Member

Thank you for your interest in ConsisID. We used 40 NVIDIA H100 GPUs with a total batch size of 80, and trained for 1800 steps. The training dataset consists of approximately 140,000 video clips. More details can be found in our report.

@SHYuanBest
Copy link
Member

but only a single 80G graphics card is needed for training

@tyrink
Copy link

tyrink commented Dec 11, 2024

but only a single 80G graphics card is needed for training

Since the above mentions that 40 NVIDIA H100 GPUs are used for training, how long will it take to train on a single 80G gpu such as A100? Will it affect the model performance?

@SHYuanBest
Copy link
Member

Fro Q1, the specific speed difference may require actual testing. For Q2, ideally, it will not affect performance, but in reality, the global batch size (40x GPU vs 1x GPU) may affect convergence.

@1151368613
Copy link

Thank you for your interest in ConsisID. We used 40 NVIDIA H100 GPUs with a total batch size of 80, and trained for 1800 steps. The training dataset consists of approximately 140,000 video clips. More details can be found in our report.

请问您用40张H100GPU训练了多长时间大概。

@SHYuanBest
Copy link
Member

Thank you for your interest in ConsisID. We used 40 NVIDIA H100 GPUs with a total batch size of 80, and trained for 1800 steps. The training dataset consists of approximately 140,000 video clips. More details can be found in our report.

请问您用40张H100GPU训练了多长时间大概。

About 7~8 hours.

@1151368613
Copy link

好的感谢您的回答,然后我还想问下,请问你们训练的时候是单精度训练还是双精度训练,因为我想用H800显卡跑,但是H800在双精度上性能比较差。

@SHYuanBest
Copy link
Member

好的感谢您的回答,然后我还想问下,请问你们训练的时候是单精度训练还是双精度训练,因为我想用H800显卡跑,但是H800在双精度上性能比较差。

用的bf16训练

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants