Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training Problem #77

Open
wanghuanqi opened this issue Sep 4, 2024 · 0 comments
Open

Training Problem #77

wanghuanqi opened this issue Sep 4, 2024 · 0 comments

Comments

@wanghuanqi
Copy link

Thank you very much for the valuable work of your team, I have completed the validation on the validation set using the pre-trained model.But when I use the following command, I want to train the model:python -m torch.distributed.run --nproc_per_node=1 --master_port=2333 tools/train.py projects/configs/VAD/VAD_base_stage_2.py --launcher pytorch --deterministic --work-dir /data03/work2_data/VAD/trained-net,As a result, a bug occurred. The following is the specific bug prompt. I hope you can help me solve it. I will be very grateful.`Traceback (most recent call last):
File "tools/train.py", line 266, in
main()
File "tools/train.py", line 111, in main
cfg = Config.fromfile(args.config)
File "/data03/work2_data/tools/anaconda/envs/vad/lib/python3.8/site-packages/mmcv/utils/config.py", line 331, in fromfile
cfg_dict, cfg_text = Config._file2dict(filename,
File "/data03/work2_data/tools/anaconda/envs/vad/lib/python3.8/site-packages/mmcv/utils/config.py", line 194, in _file2dict
Config._substitute_predefined_vars(filename,
File "/data03/work2_data/tools/anaconda/envs/vad/lib/python3.8/site-packages/mmcv/utils/config.py", line 119, in _substitute_predefined_vars
config_file = f.read()
File "/data03/work2_data/tools/anaconda/envs/vad/lib/python3.8/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc1 in position 1289: invalid start byte
Exception ignored in: <function _TemporaryFileCloser.del at 0x7fca0d6fcaf0>
Traceback (most recent call last):
File "/data03/work2_data/tools/anaconda/envs/vad/lib/python3.8/tempfile.py", line 456, in del
self.close()
File "/data03/work2_data/tools/anaconda/envs/vad/lib/python3.8/tempfile.py", line 452, in close
unlink(self.name)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpb0tjnbig/tmplkaxfpnw.py'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 119961) of binary: /data03/work2_data/tools/anaconda/envs/vad/bin/python
/data03/work2_data/tools/anaconda/envs/vad/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py:367: UserWarning:


           CHILD PROCESS FAILED WITH NO ERROR_FILE

CHILD PROCESS FAILED WITH NO ERROR_FILE
Child process 119961 (local_rank 0) FAILED (exitcode 1)
Error msg: Process failed with exitcode 1
Without writing an error file to <N/A>.
While this DOES NOT affect the correctness of your application,
no trace information about the error will be available for inspection.
Consider decorating your top level entrypoint function with
torch.distributed.elastic.multiprocessing.errors.record. Example:

from torch.distributed.elastic.multiprocessing.errors import record

@record
def trainer_main(args):
# do train


warnings.warn(_no_error_file_warning_msg(rank, failure))
Traceback (most recent call last):
File "/data03/work2_data/tools/anaconda/envs/vad/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/data03/work2_data/tools/anaconda/envs/vad/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/data03/work2_data/tools/anaconda/envs/vad/lib/python3.8/site-packages/torch/distributed/run.py", line 702, in
main()
File "/data03/work2_data/tools/anaconda/envs/vad/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 361, in wrapper
return f(*args, **kwargs)
File "/data03/work2_data/tools/anaconda/envs/vad/lib/python3.8/site-packages/torch/distributed/run.py", line 698, in main
run(args)
File "/data03/work2_data/tools/anaconda/envs/vad/lib/python3.8/site-packages/torch/distributed/run.py", line 689, in run
elastic_launch(
File "/data03/work2_data/tools/anaconda/envs/vad/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 116, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/data03/work2_data/tools/anaconda/envs/vad/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 244, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:


     tools/train.py FAILED

=======================================
Root Cause:
[0]:
time: 2024-09-04_11:32:12
rank: 0 (local_rank: 0)
exitcode: 1 (pid: 119961)
error_file: <N/A>
msg: "Process failed with exitcode 1"

Other Failures:
<NO_OTHER_FAILURES>


`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant