You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you very much for the valuable work of your team, I have completed the validation on the validation set using the pre-trained model.But when I use the following command, I want to train the model:python -m torch.distributed.run --nproc_per_node=1 --master_port=2333 tools/train.py projects/configs/VAD/VAD_base_stage_2.py --launcher pytorch --deterministic --work-dir /data03/work2_data/VAD/trained-net,As a result, a bug occurred. The following is the specific bug prompt. I hope you can help me solve it. I will be very grateful.`Traceback (most recent call last):
File "tools/train.py", line 266, in
main()
File "tools/train.py", line 111, in main
cfg = Config.fromfile(args.config)
File "/data03/work2_data/tools/anaconda/envs/vad/lib/python3.8/site-packages/mmcv/utils/config.py", line 331, in fromfile
cfg_dict, cfg_text = Config._file2dict(filename,
File "/data03/work2_data/tools/anaconda/envs/vad/lib/python3.8/site-packages/mmcv/utils/config.py", line 194, in _file2dict
Config._substitute_predefined_vars(filename,
File "/data03/work2_data/tools/anaconda/envs/vad/lib/python3.8/site-packages/mmcv/utils/config.py", line 119, in _substitute_predefined_vars
config_file = f.read()
File "/data03/work2_data/tools/anaconda/envs/vad/lib/python3.8/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc1 in position 1289: invalid start byte
Exception ignored in: <function _TemporaryFileCloser.del at 0x7fca0d6fcaf0>
Traceback (most recent call last):
File "/data03/work2_data/tools/anaconda/envs/vad/lib/python3.8/tempfile.py", line 456, in del
self.close()
File "/data03/work2_data/tools/anaconda/envs/vad/lib/python3.8/tempfile.py", line 452, in close
unlink(self.name)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpb0tjnbig/tmplkaxfpnw.py'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 119961) of binary: /data03/work2_data/tools/anaconda/envs/vad/bin/python
/data03/work2_data/tools/anaconda/envs/vad/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py:367: UserWarning:
CHILD PROCESS FAILED WITH NO ERROR_FILE
CHILD PROCESS FAILED WITH NO ERROR_FILE
Child process 119961 (local_rank 0) FAILED (exitcode 1)
Error msg: Process failed with exitcode 1
Without writing an error file to <N/A>.
While this DOES NOT affect the correctness of your application,
no trace information about the error will be available for inspection.
Consider decorating your top level entrypoint function with
torch.distributed.elastic.multiprocessing.errors.record. Example:
from torch.distributed.elastic.multiprocessing.errors import record
warnings.warn(_no_error_file_warning_msg(rank, failure))
Traceback (most recent call last):
File "/data03/work2_data/tools/anaconda/envs/vad/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/data03/work2_data/tools/anaconda/envs/vad/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/data03/work2_data/tools/anaconda/envs/vad/lib/python3.8/site-packages/torch/distributed/run.py", line 702, in
main()
File "/data03/work2_data/tools/anaconda/envs/vad/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 361, in wrapper
return f(*args, **kwargs)
File "/data03/work2_data/tools/anaconda/envs/vad/lib/python3.8/site-packages/torch/distributed/run.py", line 698, in main
run(args)
File "/data03/work2_data/tools/anaconda/envs/vad/lib/python3.8/site-packages/torch/distributed/run.py", line 689, in run
elastic_launch(
File "/data03/work2_data/tools/anaconda/envs/vad/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 116, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/data03/work2_data/tools/anaconda/envs/vad/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 244, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
Thank you very much for the valuable work of your team, I have completed the validation on the validation set using the pre-trained model.But when I use the following command, I want to train the model:
python -m torch.distributed.run --nproc_per_node=1 --master_port=2333 tools/train.py projects/configs/VAD/VAD_base_stage_2.py --launcher pytorch --deterministic --work-dir /data03/work2_data/VAD/trained-net
,As a result, a bug occurred. The following is the specific bug prompt. I hope you can help me solve it. I will be very grateful.`Traceback (most recent call last):File "tools/train.py", line 266, in
main()
File "tools/train.py", line 111, in main
cfg = Config.fromfile(args.config)
File "/data03/work2_data/tools/anaconda/envs/vad/lib/python3.8/site-packages/mmcv/utils/config.py", line 331, in fromfile
cfg_dict, cfg_text = Config._file2dict(filename,
File "/data03/work2_data/tools/anaconda/envs/vad/lib/python3.8/site-packages/mmcv/utils/config.py", line 194, in _file2dict
Config._substitute_predefined_vars(filename,
File "/data03/work2_data/tools/anaconda/envs/vad/lib/python3.8/site-packages/mmcv/utils/config.py", line 119, in _substitute_predefined_vars
config_file = f.read()
File "/data03/work2_data/tools/anaconda/envs/vad/lib/python3.8/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc1 in position 1289: invalid start byte
Exception ignored in: <function _TemporaryFileCloser.del at 0x7fca0d6fcaf0>
Traceback (most recent call last):
File "/data03/work2_data/tools/anaconda/envs/vad/lib/python3.8/tempfile.py", line 456, in del
self.close()
File "/data03/work2_data/tools/anaconda/envs/vad/lib/python3.8/tempfile.py", line 452, in close
unlink(self.name)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpb0tjnbig/tmplkaxfpnw.py'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 119961) of binary: /data03/work2_data/tools/anaconda/envs/vad/bin/python
/data03/work2_data/tools/anaconda/envs/vad/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py:367: UserWarning:
CHILD PROCESS FAILED WITH NO ERROR_FILE
Child process 119961 (local_rank 0) FAILED (exitcode 1)
Error msg: Process failed with exitcode 1
Without writing an error file to <N/A>.
While this DOES NOT affect the correctness of your application,
no trace information about the error will be available for inspection.
Consider decorating your top level entrypoint function with
torch.distributed.elastic.multiprocessing.errors.record. Example:
from torch.distributed.elastic.multiprocessing.errors import record
@record
def trainer_main(args):
# do train
warnings.warn(_no_error_file_warning_msg(rank, failure))
Traceback (most recent call last):
File "/data03/work2_data/tools/anaconda/envs/vad/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/data03/work2_data/tools/anaconda/envs/vad/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/data03/work2_data/tools/anaconda/envs/vad/lib/python3.8/site-packages/torch/distributed/run.py", line 702, in
main()
File "/data03/work2_data/tools/anaconda/envs/vad/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 361, in wrapper
return f(*args, **kwargs)
File "/data03/work2_data/tools/anaconda/envs/vad/lib/python3.8/site-packages/torch/distributed/run.py", line 698, in main
run(args)
File "/data03/work2_data/tools/anaconda/envs/vad/lib/python3.8/site-packages/torch/distributed/run.py", line 689, in run
elastic_launch(
File "/data03/work2_data/tools/anaconda/envs/vad/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 116, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/data03/work2_data/tools/anaconda/envs/vad/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 244, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=======================================
Root Cause:
[0]:
time: 2024-09-04_11:32:12
rank: 0 (local_rank: 0)
exitcode: 1 (pid: 119961)
error_file: <N/A>
msg: "Process failed with exitcode 1"
Other Failures:
<NO_OTHER_FAILURES>
`
The text was updated successfully, but these errors were encountered: