We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I'm trying to train RVRT, but it just fails like this:
python -m torch.distributed.launch --master_port=1234 main_train_vrt.py --opt options/rvrt/005_train_rvrt_videodeblurring_gopro.json --dist True /home/skerit/.local/lib/python3.10/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use-env is set by default in torchrun. If your script expects `--local-rank` argument to be set, please change it to read from `os.environ['LOCAL_RANK']` instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions warnings.warn( usage: main_train_vrt.py [-h] [--opt OPT] [--launcher LAUNCHER] [--local_rank LOCAL_RANK] [--dist DIST] main_train_vrt.py: error: unrecognized arguments: --local-rank=0 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 10408) of binary: /usr/bin/python Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/home/skerit/.local/lib/python3.10/site-packages/torch/distributed/launch.py", line 196, in <module> main() File "/home/skerit/.local/lib/python3.10/site-packages/torch/distributed/launch.py", line 192, in main launch(args) File "/home/skerit/.local/lib/python3.10/site-packages/torch/distributed/launch.py", line 177, in launch run(args) File "/home/skerit/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/skerit/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/skerit/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ main_train_vrt.py FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-04-09_22:57:36 host : dalaran rank : 0 (local_rank: 0) exitcode : 2 (pid: 10408) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ [skerit@dalaran KAIR]$ python -m torch.distributed.launch --nproc_per_node=8 --master_port=1234 main_train_vrt.py --opt options/rvrt/005_train_rvrt_videodeblurring_gopro.json --dist True /home/skerit/.local/lib/python3.10/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use-env is set by default in torchrun. If your script expects `--local-rank` argument to be set, please change it to read from `os.environ['LOCAL_RANK']` instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions warnings.warn( WARNING:torch.distributed.run: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** usage: main_train_vrt.py [-h] [--opt OPT] [--launcher LAUNCHER] [--local_rank LOCAL_RANK] [--dist DIST] main_train_vrt.py: error: unrecognized arguments: --local-rank=5 usage: main_train_vrt.py [-h] [--opt OPT] [--launcher LAUNCHER] [--local_rank LOCAL_RANK] [--dist DIST] main_train_vrt.py: error: unrecognized arguments: --local-rank=4 usage: main_train_vrt.py [-h] [--opt OPT] [--launcher LAUNCHER] [--local_rank LOCAL_RANK] [--dist DIST] main_train_vrt.py: error: unrecognized arguments: --local-rank=1 usage: main_train_vrt.py [-h] [--opt OPT] [--launcher LAUNCHER] [--local_rank LOCAL_RANK] [--dist DIST] main_train_vrt.py: error: unrecognized arguments: --local-rank=3 usage: main_train_vrt.py [-h] [--opt OPT] [--launcher LAUNCHER] [--local_rank LOCAL_RANK] [--dist DIST] main_train_vrt.py: error: unrecognized arguments: --local-rank=2 usage: main_train_vrt.py [-h] [--opt OPT] [--launcher LAUNCHER] [--local_rank LOCAL_RANK] [--dist DIST] main_train_vrt.py: error: unrecognized arguments: --local-rank=6 usage: main_train_vrt.py [-h] [--opt OPT] [--launcher LAUNCHER] [--local_rank LOCAL_RANK] [--dist DIST] main_train_vrt.py: error: unrecognized arguments: --local-rank=0 usage: main_train_vrt.py [-h] [--opt OPT] [--launcher LAUNCHER] [--local_rank LOCAL_RANK] [--dist DIST] main_train_vrt.py: error: unrecognized arguments: --local-rank=7 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 10450) of binary: /usr/bin/python Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/home/skerit/.local/lib/python3.10/site-packages/torch/distributed/launch.py", line 196, in <module> main() File "/home/skerit/.local/lib/python3.10/site-packages/torch/distributed/launch.py", line 192, in main launch(args) File "/home/skerit/.local/lib/python3.10/site-packages/torch/distributed/launch.py", line 177, in launch run(args) File "/home/skerit/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/skerit/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/skerit/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ main_train_vrt.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2023-04-09_22:57:57 host : dalaran rank : 1 (local_rank: 1) exitcode : 2 (pid: 10451) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2023-04-09_22:57:57 host : dalaran rank : 2 (local_rank: 2) exitcode : 2 (pid: 10452) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2023-04-09_22:57:57 host : dalaran rank : 3 (local_rank: 3) exitcode : 2 (pid: 10453) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [4]: time : 2023-04-09_22:57:57 host : dalaran rank : 4 (local_rank: 4) exitcode : 2 (pid: 10454) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [5]: time : 2023-04-09_22:57:57 host : dalaran rank : 5 (local_rank: 5) exitcode : 2 (pid: 10455) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [6]: time : 2023-04-09_22:57:57 host : dalaran rank : 6 (local_rank: 6) exitcode : 2 (pid: 10456) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [7]: time : 2023-04-09_22:57:57 host : dalaran rank : 7 (local_rank: 7) exitcode : 2 (pid: 10457) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-04-09_22:57:57 host : dalaran rank : 0 (local_rank: 0) exitcode : 2 (pid: 10450) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================
The text was updated successfully, but these errors were encountered:
Try with the following command:
python -m torch.distributed.launch --nproc_per_node=4 --master_port=1234 main_train_vrt.py --opt ./options/rvrt/... --dist True
Note: --nproc_per_node = number of available GPUs
Sorry, something went wrong.
Try use the torchrun instead the python -m
torchrun
python -m
open-mmlab/mmdetection#10024 (comment) this might be helpful
No branches or pull requests
I'm trying to train RVRT, but it just fails like this:
The text was updated successfully, but these errors were encountered: