Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training RVRT just fails #165

Open
skerit opened this issue Apr 9, 2023 · 3 comments
Open

Training RVRT just fails #165

skerit opened this issue Apr 9, 2023 · 3 comments

Comments

@skerit
Copy link

skerit commented Apr 9, 2023

I'm trying to train RVRT, but it just fails like this:

python -m torch.distributed.launch  --master_port=1234 main_train_vrt.py --opt options/rvrt/005_train_rvrt_videodeblurring_gopro.json  --dist True
/home/skerit/.local/lib/python3.10/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
usage: main_train_vrt.py [-h] [--opt OPT] [--launcher LAUNCHER] [--local_rank LOCAL_RANK] [--dist DIST]
main_train_vrt.py: error: unrecognized arguments: --local-rank=0
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 10408) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/skerit/.local/lib/python3.10/site-packages/torch/distributed/launch.py", line 196, in <module>
    main()
  File "/home/skerit/.local/lib/python3.10/site-packages/torch/distributed/launch.py", line 192, in main
    launch(args)
  File "/home/skerit/.local/lib/python3.10/site-packages/torch/distributed/launch.py", line 177, in launch
    run(args)
  File "/home/skerit/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/skerit/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/skerit/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
main_train_vrt.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-04-09_22:57:36
  host      : dalaran
  rank      : 0 (local_rank: 0)
  exitcode  : 2 (pid: 10408)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
[skerit@dalaran KAIR]$ python -m torch.distributed.launch --nproc_per_node=8 --master_port=1234 main_train_vrt.py --opt options/rvrt/005_train_rvrt_videodeblurring_gopro.json  --dist True
/home/skerit/.local/lib/python3.10/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
usage: main_train_vrt.py [-h] [--opt OPT] [--launcher LAUNCHER] [--local_rank LOCAL_RANK] [--dist DIST]
main_train_vrt.py: error: unrecognized arguments: --local-rank=5
usage: main_train_vrt.py [-h] [--opt OPT] [--launcher LAUNCHER] [--local_rank LOCAL_RANK] [--dist DIST]
main_train_vrt.py: error: unrecognized arguments: --local-rank=4
usage: main_train_vrt.py [-h] [--opt OPT] [--launcher LAUNCHER] [--local_rank LOCAL_RANK] [--dist DIST]
main_train_vrt.py: error: unrecognized arguments: --local-rank=1
usage: main_train_vrt.py [-h] [--opt OPT] [--launcher LAUNCHER] [--local_rank LOCAL_RANK] [--dist DIST]
main_train_vrt.py: error: unrecognized arguments: --local-rank=3
usage: main_train_vrt.py [-h] [--opt OPT] [--launcher LAUNCHER] [--local_rank LOCAL_RANK] [--dist DIST]
main_train_vrt.py: error: unrecognized arguments: --local-rank=2
usage: main_train_vrt.py [-h] [--opt OPT] [--launcher LAUNCHER] [--local_rank LOCAL_RANK] [--dist DIST]
main_train_vrt.py: error: unrecognized arguments: --local-rank=6
usage: main_train_vrt.py [-h] [--opt OPT] [--launcher LAUNCHER] [--local_rank LOCAL_RANK] [--dist DIST]
main_train_vrt.py: error: unrecognized arguments: --local-rank=0
usage: main_train_vrt.py [-h] [--opt OPT] [--launcher LAUNCHER] [--local_rank LOCAL_RANK] [--dist DIST]
main_train_vrt.py: error: unrecognized arguments: --local-rank=7
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 10450) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/skerit/.local/lib/python3.10/site-packages/torch/distributed/launch.py", line 196, in <module>
    main()
  File "/home/skerit/.local/lib/python3.10/site-packages/torch/distributed/launch.py", line 192, in main
    launch(args)
  File "/home/skerit/.local/lib/python3.10/site-packages/torch/distributed/launch.py", line 177, in launch
    run(args)
  File "/home/skerit/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/skerit/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/skerit/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
main_train_vrt.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-04-09_22:57:57
  host      : dalaran
  rank      : 1 (local_rank: 1)
  exitcode  : 2 (pid: 10451)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2023-04-09_22:57:57
  host      : dalaran
  rank      : 2 (local_rank: 2)
  exitcode  : 2 (pid: 10452)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2023-04-09_22:57:57
  host      : dalaran
  rank      : 3 (local_rank: 3)
  exitcode  : 2 (pid: 10453)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
  time      : 2023-04-09_22:57:57
  host      : dalaran
  rank      : 4 (local_rank: 4)
  exitcode  : 2 (pid: 10454)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[5]:
  time      : 2023-04-09_22:57:57
  host      : dalaran
  rank      : 5 (local_rank: 5)
  exitcode  : 2 (pid: 10455)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[6]:
  time      : 2023-04-09_22:57:57
  host      : dalaran
  rank      : 6 (local_rank: 6)
  exitcode  : 2 (pid: 10456)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[7]:
  time      : 2023-04-09_22:57:57
  host      : dalaran
  rank      : 7 (local_rank: 7)
  exitcode  : 2 (pid: 10457)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-04-09_22:57:57
  host      : dalaran
  rank      : 0 (local_rank: 0)
  exitcode  : 2 (pid: 10450)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
@fookseng
Copy link

Try with the following command:

python -m torch.distributed.launch --nproc_per_node=4 --master_port=1234 main_train_vrt.py --opt ./options/rvrt/... --dist True

Note: --nproc_per_node = number of available GPUs

@GuoCheng12
Copy link

Try use the torchrun instead the python -m

@yuanzhi-zhu
Copy link
Collaborator

open-mmlab/mmdetection#10024 (comment)
this might be helpful

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants