-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FLOPs counter seems doesn't work #642
Comments
cc @wukaixingxp |
Hi! Can you share your commend that gets this error? Can you share the output of
|
@wukaixingxp Thanks for the prompt reply.
Please see the output of
|
Also, Is there a way to get the token per second during training on each GPU ? Thanks |
Hi! I noticed that you are using AMD Instinct MI300X which unfortunately we have not yet have a chance to test and support, so I have to guess the possible solutions: (1) Can you run the official FlopTensorDispatchMode example to test if ROCM actually support this feature? (2) I noticed that your PyTorch version is 2.5.0a0+git10344d7, while the current stable version is 2.4.0. may be try to install the 2.4.0 will help? |
@wukaixingxp I was able to run the official FlopTensorDispatchMode example. However, I encountered the same issue when using pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.1. Interestingly, everything works fine without FSDP, but when I add the --enable_fsdp flag, it fails as mentioned earlier. Do you have any further insights on this? |
Unfortunately, we do not have a way to test and support ROCM at the moment. Will keep you updated once that has been changed. |
Hi, I had a similar flop counter problem while using multiple GPUs to finetune LLama 70B. But everything works fine when I use a single GPU to finetune LLama 8B.
Here is the cmd
|
🚀 The feature, motivation and pitch
I am able to run the training with the FSDP. But then add the "--flop_counter" flag. It gives the following issue. Could someone take a look at this issue? Is that possible to make the report flop count as default? Thanks
1: The module hierarchy tracking seems to be messed up.Please file a bug to PyTorch.
1: The module hierarchy tracking seems to be messed up.Please file a bug to PyTorch.
1: The module hierarchy tracking seems to be messed up.Please file a bug to PyTorch.
1: The module hierarchy tracking seems to be messed up.Please file a bug to PyTorch.
1: The module hierarchy tracking seems to be messed up.Please file a bug to PyTorch.
1: The module hierarchy tracking seems to be messed up.Please file a bug to PyTorch.
1: The module hierarchy tracking seems to be messed up.Please file a bug to PyTorch.
0: :0:rocdevice.cpp :2875: 1456647898545 us: [pid:3202 tid:0x7f2309bff700] Callback: Queue 0x7ee2fba00000 Aborting with error : HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events. Code: 0x1008 Available Free mem : 234 MB
0: :0:rocdevice.cpp :2875: 1456647904587 us: [pid:3198 tid:0x7f020bbff700] Callback: Queue 0x7f0208200000 Aborting with error : HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events. Code: 0x1008 Available Free mem : 226 MB
1: :0:rocdevice.cpp :2875: 2157204836001 us: [pid:208 tid:0x7fb03b1ff700] Callback: Queue 0x7f7031a00000 Aborting with error : HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events. Code: 0x1008 Available Free mem : 242 MB
1: :0:rocdevice.cpp :2875: 2157204836358 us: [pid:207 tid:0x7f0c92fff700] Callback: Queue 0x7ecc89800000 Aborting with error : HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events. Code: 0x1008 Available Free mem : 246 MB
1: :0:rocdevice.cpp :2875: 2157204838420 us: [pid:203 tid:0x7f59a81ff700] Callback: Queue 0x7f199ea00000 Aborting with error : HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events. Code: 0x1008 Available Free mem : 242 MB
0: :0:rocdevice.cpp :2875: 1456647929027 us: [pid:3201 tid:0x7f2e33bff700] Callback: Queue 0x7f2e30200000 Aborting with error : HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events. Code: 0x1008 Available Free mem : 226 MB
0: :0:rocdevice.cpp :2875: 1456648084561 us: [pid:3203 tid:0x7fac6c1ff700] Callback: Queue 0x7f6c62a00000 Aborting with error : HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events. Code: 0x1008 Available Free mem : 246 MB
1: W0823 17:42:51.540936 131 torch/distributed/elastic/multiprocessing/api.py:891] Sending process 202 closing signal SIGTERM
1: W0823 17:42:51.543727 131 torch/distributed/elastic/multiprocessing/api.py:891] Sending process 203 closing signal SIGTERM
1: W0823 17:42:51.544753 131 torch/distributed/elastic/multiprocessing/api.py:891] Sending process 204 closing signal SIGTERM
1: W0823 17:42:51.547995 131 torch/distributed/elastic/multiprocessing/api.py:891] Sending process 205 closing signal SIGTERM
1: W0823 17:42:51.549960 131 torch/distributed/elastic/multiprocessing/api.py:891] Sending process 206 closing signal SIGTERM
1: W0823 17:42:51.552839 131 torch/distributed/elastic/multiprocessing/api.py:891] Sending process 208 closing signal SIGTERM
1: W0823 17:42:51.553608 131 torch/distributed/elastic/multiprocessing/api.py:891] Sending process 209 closing signal SIGTERM
0: :0:rocdevice.cpp :2875: 1456648361779 us: [pid:3204 tid:0x7f04efdff700] Callback: Queue 0x7ec4e6600000 Aborting with error : HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events. Code: 0x1008 Available Free mem : 242 MB
0: W0823 17:42:51.587928 3126 torch/distributed/elastic/multiprocessing/api.py:891] Sending process 3198 closing signal SIGTERM
0: W0823 17:42:51.588275 3126 torch/distributed/elastic/multiprocessing/api.py:891] Sending process 3199 closing signal SIGTERM
0: W0823 17:42:51.591612 3126 torch/distributed/elastic/multiprocessing/api.py:891] Sending process 3200 closing signal SIGTERM
0: W0823 17:42:51.592847 3126 torch/distributed/elastic/multiprocessing/api.py:891] Sending process 3203 closing signal SIGTERM
0: W0823 17:42:51.595798 3126 torch/distributed/elastic/multiprocessing/api.py:891] Sending process 3204 closing signal SIGTERM
0: W0823 17:42:51.597895 3126 torch/distributed/elastic/multiprocessing/api.py:891] Sending process 3205 closing signal SIGTERM
0: E0823 17:42:52.329508 3126 torch/distributed/elastic/multiprocessing/api.py:863] failed (exitcode: -6) local_rank: 3 (pid: 3201) of binary: /opt/conda/envs/py_3.8/bin/python
0: Traceback (most recent call last):
0: File "/opt/conda/envs/py_3.8/bin/torchrun", line 33, in
0: sys.exit(load_entry_point('torch==2.5.0a0+git10344d7', 'console_scripts', 'torchrun')())
0: File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper
0: return f(*args, **kwargs)
0: File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/distributed/run.py", line 919, in main
0: run(args)
0: File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/distributed/run.py", line 910, in run
0: elastic_launch(
0: File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 138, in call
0: return launch_agent(self._config, self._entrypoint, list(args))
0: File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
0: raise ChildFailedError(
0: torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
Alternatives
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: