Training Problem #77

wanghuanqi · 2024-09-04T03:54:53Z

Thank you very much for the valuable work of your team, I have completed the validation on the validation set using the pre-trained model.But when I use the following command, I want to train the model:python -m torch.distributed.run --nproc_per_node=1 --master_port=2333 tools/train.py projects/configs/VAD/VAD_base_stage_2.py --launcher pytorch --deterministic --work-dir /data03/work2_data/VAD/trained-net，As a result, a bug occurred. The following is the specific bug prompt. I hope you can help me solve it. I will be very grateful.`Traceback (most recent call last):
File "tools/train.py", line 266, in
main()
File "tools/train.py", line 111, in main
cfg = Config.fromfile(args.config)
File "/data03/work2_data/tools/anaconda/envs/vad/lib/python3.8/site-packages/mmcv/utils/config.py", line 331, in fromfile
cfg_dict, cfg_text = Config._file2dict(filename,
File "/data03/work2_data/tools/anaconda/envs/vad/lib/python3.8/site-packages/mmcv/utils/config.py", line 194, in _file2dict
Config._substitute_predefined_vars(filename,
File "/data03/work2_data/tools/anaconda/envs/vad/lib/python3.8/site-packages/mmcv/utils/config.py", line 119, in _substitute_predefined_vars
config_file = f.read()
File "/data03/work2_data/tools/anaconda/envs/vad/lib/python3.8/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc1 in position 1289: invalid start byte
Exception ignored in: <function _TemporaryFileCloser.del at 0x7fca0d6fcaf0>
Traceback (most recent call last):
File "/data03/work2_data/tools/anaconda/envs/vad/lib/python3.8/tempfile.py", line 456, in del
self.close()
File "/data03/work2_data/tools/anaconda/envs/vad/lib/python3.8/tempfile.py", line 452, in close
unlink(self.name)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpb0tjnbig/tmplkaxfpnw.py'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 119961) of binary: /data03/work2_data/tools/anaconda/envs/vad/bin/python
/data03/work2_data/tools/anaconda/envs/vad/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py:367: UserWarning:

           CHILD PROCESS FAILED WITH NO ERROR_FILE

CHILD PROCESS FAILED WITH NO ERROR_FILE
Child process 119961 (local_rank 0) FAILED (exitcode 1)
Error msg: Process failed with exitcode 1
Without writing an error file to <N/A>.
While this DOES NOT affect the correctness of your application,
no trace information about the error will be available for inspection.
Consider decorating your top level entrypoint function with
torch.distributed.elastic.multiprocessing.errors.record. Example:

from torch.distributed.elastic.multiprocessing.errors import record

@record
def trainer_main(args):
# do train

warnings.warn(_no_error_file_warning_msg(rank, failure))
Traceback (most recent call last):
File "/data03/work2_data/tools/anaconda/envs/vad/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/data03/work2_data/tools/anaconda/envs/vad/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/data03/work2_data/tools/anaconda/envs/vad/lib/python3.8/site-packages/torch/distributed/run.py", line 702, in
main()
File "/data03/work2_data/tools/anaconda/envs/vad/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 361, in wrapper
return f(*args, **kwargs)
File "/data03/work2_data/tools/anaconda/envs/vad/lib/python3.8/site-packages/torch/distributed/run.py", line 698, in main
run(args)
File "/data03/work2_data/tools/anaconda/envs/vad/lib/python3.8/site-packages/torch/distributed/run.py", line 689, in run
elastic_launch(
File "/data03/work2_data/tools/anaconda/envs/vad/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 116, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/data03/work2_data/tools/anaconda/envs/vad/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 244, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

     tools/train.py FAILED

=======================================
Root Cause:
[0]:
time: 2024-09-04_11:32:12
rank: 0 (local_rank: 0)
exitcode: 1 (pid: 119961)
error_file: <N/A>
msg: "Process failed with exitcode 1"

Other Failures:
<NO_OTHER_FAILURES>

`

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training Problem #77

Training Problem #77

wanghuanqi commented Sep 4, 2024

Training Problem #77

Training Problem #77

Comments

wanghuanqi commented Sep 4, 2024

======================================= Root Cause: [0]: time: 2024-09-04_11:32:12 rank: 0 (local_rank: 0) exitcode: 1 (pid: 119961) error_file: <N/A> msg: "Process failed with exitcode 1"

=======================================
Root Cause:
[0]:
time: 2024-09-04_11:32:12
rank: 0 (local_rank: 0)
exitcode: 1 (pid: 119961)
error_file: <N/A>
msg: "Process failed with exitcode 1"