ThunderFX is significantly slower than 2 weeks ago for 3 models #1428

mpatel31415 · 2024-11-12T16:04:15Z

🐛 Bug

Here are recently found regressions:

To Reproduce

All parameters to benchmark_litgpt.py are visible in the attached image.

Environment

system.device_product_name DGXH100
system.gpu_driver_version 535.129.03
libraries.cuda 12.6.98.001
libraries.pip.lightning 2.4.0.dev20240728
libraries.pip.lightning-thunder 0.2.0.dev0
libraries.pip.lightning-utilities 0.11.8
libraries.pip.litgpt 0.4.11
libraries.pip.nvfuser 0.2.22+gitba4f7d4
libraries.pip.pytorch-lightning 2.4.0
libraries.pip.torch 2.6.0a0+gita9b4989
libraries.pip.torchao 0.6.1
libraries.pip.torchmetrics 1.5.1
libraries.pip.torchvision 0.19.0a0+d23a6e1

IvanYashchuk · 2024-11-12T16:57:45Z

The difference seems to be due to regression in the batch size used (2 -> 1). This could be related to the switch to the "block" bucketing mode instead of "none". There was an increase in memory usage for other models resulting in OOM and here it seems it resulted in a smaller batch size that works. @kiya00, could you please take a look at this regression and find out what has caused this?

IvanYashchuk added the mixology Issues that the mixology team has surfaced label Nov 12, 2024

IvanYashchuk assigned kiya00 Nov 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ThunderFX is significantly slower than 2 weeks ago for 3 models #1428

ThunderFX is significantly slower than 2 weeks ago for 3 models #1428

mpatel31415 commented Nov 12, 2024

IvanYashchuk commented Nov 12, 2024

ThunderFX is significantly slower than 2 weeks ago for 3 models #1428

ThunderFX is significantly slower than 2 weeks ago for 3 models #1428

Comments

mpatel31415 commented Nov 12, 2024

🐛 Bug

To Reproduce

Environment

IvanYashchuk commented Nov 12, 2024