You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are some inaccuracies in the information that the script generates, some more important than others I believe. For example, it sets our OS to ubuntu since that is the operating system inside the container despite us running rocky outside of the container ( I figure this is not a big deal), more importantly when I generate results using the given default cm run offline script for both base and main for scc24:
it ends up saying we are using 3x H100 NVL. Our system has 4x H100 NVL, and they are all accessible. nvidia-smi yields the correct result inside the container and at multiple steps during runtime we can see when it iterates over cuda devices and lists indexes 0..3.
Furthermore, I'm not sure if this is the expected behavior or an error but by defauly without making any new configurations shouldn't it be running on only a single gpu and record that as the result accordingly? I've run it manually and monitored and verified that it is indeed only ever using the same GPU (index 0), so I would think it should then report only a single H100 being utilized?
Either way, I figure this should be reporting 1x or 4x. The cm run scripts I'm referencing are here:
Hi @rysc3 , yes that's a bug and should be fixed here.
Running on a single GPU - this is happening with the reference implementation right? Actually that's a problem with the reference implementation and there will be points if you can make it run using all the GPUs and give a PR to the inference repository.
For Nvidia implementation - all GPUs are expected to be used.
There are some inaccuracies in the information that the script generates, some more important than others I believe. For example, it sets our OS to ubuntu since that is the operating system inside the container despite us running rocky outside of the container ( I figure this is not a big deal), more importantly when I generate results using the given default cm run offline script for both base and main for scc24:
https://docs.mlcommons.org/cm4mlperf-inference/
it ends up saying we are using 3x H100 NVL. Our system has 4x H100 NVL, and they are all accessible. nvidia-smi yields the correct result inside the container and at multiple steps during runtime we can see when it iterates over cuda devices and lists indexes 0..3.
Furthermore, I'm not sure if this is the expected behavior or an error but by defauly without making any new configurations shouldn't it be running on only a single gpu and record that as the result accordingly? I've run it manually and monitored and verified that it is indeed only ever using the same GPU (index 0), so I would think it should then report only a single H100 being utilized?
Either way, I figure this should be reporting 1x or 4x. The cm run scripts I'm referencing are here:
https://docs.mlcommons.org/inference/benchmarks/text_to_image/reproducibility/scc24/
And note my submissions here to see the 3x H100s as mentioned on the leaderboard:
https://docs.mlcommons.org/cm4mlperf-inference/
The text was updated successfully, but these errors were encountered: