bug: crash when the `--output_dir` doesn't exist and multiple GPUs (processes) are used. #359

HarikrishnanBalagopal · 2024-10-01T09:33:00Z

Overview

Race condition? leading to a crash when multiple GPUs (processes) are used and the output directory doesn't exist.

Steps to reproduce

Run a multiple GPU job with torchrun and a non-existent output directory.

torchrun --nnodes=1 --nproc-per-node=8 --rdzv_id=101 --rdzv_endpoint=127.0.0.1:12345 -m tuning.sft_trainer --fsdp full_shard auto_wrap --fsdp_config=fsdp_config.json --model_name_or_path=TinyLlama/TinyLlama-1.1B-step-50K-105b --training_data_path=TIGER-Lab/SKGInstruct-skg-only --output_dir=train_output --gradient_checkpointing=false --data_formatter_template=Question:\\n\\n\\nAnswer:\\n --response_template=\\n\\nAnswer:\\n --log_level=debug --tracker=aim --aim_repo=/aim/aimrepo --experiment=test-1 --max_steps=100 --logging_strategy=steps --logging_steps=1 --save_strategy=steps --save_steps=50 --learning_rate=1e-06

Actual Output

  warnings.warn(
/proj/data-eng/haribala/conda/envs/platform/lib/python3.11/site-packages/transformers/training_args.py:2007: FutureWarning: `--push_to_hub_token` is deprecated and will be removed in version 5 of 🤗 Transformers. Use `--hub_token` instead.
  warnings.warn(
WARNING:sft_trainer.py:dataset kwargs None
/proj/data-eng/haribala/conda/envs/platform/lib/python3.11/site-packages/huggingface_hub/utils/_deprecation.py:100: FutureWarning: Deprecated argument(s) used in '__init__': dataset_text_field, max_seq_length. Will not be supported from version '1.0.0'.

Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.
  warnings.warn(message, FutureWarning)
/proj/data-eng/haribala/conda/envs/platform/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:280: UserWarning: You passed a `max_seq_length` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`.
  warnings.warn(
/proj/data-eng/haribala/conda/envs/platform/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:318: UserWarning: You passed a `dataset_text_field` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`.
  warnings.warn(
max_steps is given, it will override any value given in num_train_epochs
max_steps is given, it will override any value given in num_train_epochs
max_steps is given, it will override any value given in num_train_epochs
max_steps is given, it will override any value given in num_train_epochs
max_steps is given, it will override any value given in num_train_epochs
max_steps is given, it will override any value given in num_train_epochs
max_steps is given, it will override any value given in num_train_epochs
max_steps is given, it will override any value given in num_train_epochs
ERROR:sft_trainer.py:Traceback (most recent call last):
  File "/proj/data-eng/haribala/conda/envs/platform/lib/python3.11/site-packages/tuning/sft_trainer.py", line 639, in main
    trainer, additional_train_info = train(
                                     ^^^^^^
  File "/proj/data-eng/haribala/conda/envs/platform/lib/python3.11/site-packages/tuning/sft_trainer.py", line 418, in train
    resume_from_checkpoint = get_last_checkpoint(training_args.output_dir)
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/proj/data-eng/haribala/conda/envs/platform/lib/python3.11/site-packages/transformers/trainer_utils.py", line 212, in get_last_checkpoint
    content = os.listdir(folder)
              ^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: 'train_output'

From error.log file

Unable to load file: [Errno 2] No such file or directory: 'train_output'
[stderr.txt](https://github.com/user-attachments/files/17202794/stderr.txt)

Expected Output

The trainer should not crash in the multiple GPUs (processes) case.
The non-existent output directory should be created same as the single GPU case.

The text was updated successfully, but these errors were encountered:

HarikrishnanBalagopal · 2024-10-01T21:19:06Z

Related #352

anhuong · 2024-10-08T15:20:27Z

this was merged in, closing issue

HarikrishnanBalagopal mentioned this issue Oct 2, 2024

fix: crash when output directory doesn't exist #364

Merged

2 tasks

anhuong closed this as completed Oct 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: crash when the `--output_dir` doesn't exist and multiple GPUs (processes) are used. #359

bug: crash when the `--output_dir` doesn't exist and multiple GPUs (processes) are used. #359

HarikrishnanBalagopal commented Oct 1, 2024 •

edited

Loading

HarikrishnanBalagopal commented Oct 1, 2024

anhuong commented Oct 8, 2024

bug: crash when the --output_dir doesn't exist and multiple GPUs (processes) are used. #359

bug: crash when the --output_dir doesn't exist and multiple GPUs (processes) are used. #359

Comments

HarikrishnanBalagopal commented Oct 1, 2024 • edited Loading

Overview

Steps to reproduce

Actual Output

Expected Output

HarikrishnanBalagopal commented Oct 1, 2024

anhuong commented Oct 8, 2024

bug: crash when the `--output_dir` doesn't exist and multiple GPUs (processes) are used. #359

bug: crash when the `--output_dir` doesn't exist and multiple GPUs (processes) are used. #359

HarikrishnanBalagopal commented Oct 1, 2024 •

edited

Loading