Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: crash when the --output_dir doesn't exist and multiple GPUs (processes) are used. #359

Closed
HarikrishnanBalagopal opened this issue Oct 1, 2024 · 2 comments

Comments

@HarikrishnanBalagopal
Copy link
Contributor

HarikrishnanBalagopal commented Oct 1, 2024

Overview

Race condition? leading to a crash when multiple GPUs (processes) are used and the output directory doesn't exist.

Steps to reproduce

Run a multiple GPU job with torchrun and a non-existent output directory.

torchrun --nnodes=1 --nproc-per-node=8 --rdzv_id=101 --rdzv_endpoint=127.0.0.1:12345 -m tuning.sft_trainer --fsdp full_shard auto_wrap --fsdp_config=fsdp_config.json --model_name_or_path=TinyLlama/TinyLlama-1.1B-step-50K-105b --training_data_path=TIGER-Lab/SKGInstruct-skg-only --output_dir=train_output --gradient_checkpointing=false --data_formatter_template=Question:\\n\\n\\nAnswer:\\n --response_template=\\n\\nAnswer:\\n --log_level=debug --tracker=aim --aim_repo=/aim/aimrepo --experiment=test-1 --max_steps=100 --logging_strategy=steps --logging_steps=1 --save_strategy=steps --save_steps=50 --learning_rate=1e-06

Actual Output

  warnings.warn(
/proj/data-eng/haribala/conda/envs/platform/lib/python3.11/site-packages/transformers/training_args.py:2007: FutureWarning: `--push_to_hub_token` is deprecated and will be removed in version 5 of 🤗 Transformers. Use `--hub_token` instead.
  warnings.warn(
WARNING:sft_trainer.py:dataset kwargs None
/proj/data-eng/haribala/conda/envs/platform/lib/python3.11/site-packages/huggingface_hub/utils/_deprecation.py:100: FutureWarning: Deprecated argument(s) used in '__init__': dataset_text_field, max_seq_length. Will not be supported from version '1.0.0'.

Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.
  warnings.warn(message, FutureWarning)
/proj/data-eng/haribala/conda/envs/platform/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:280: UserWarning: You passed a `max_seq_length` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`.
  warnings.warn(
/proj/data-eng/haribala/conda/envs/platform/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:318: UserWarning: You passed a `dataset_text_field` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`.
  warnings.warn(
max_steps is given, it will override any value given in num_train_epochs
max_steps is given, it will override any value given in num_train_epochs
max_steps is given, it will override any value given in num_train_epochs
max_steps is given, it will override any value given in num_train_epochs
max_steps is given, it will override any value given in num_train_epochs
max_steps is given, it will override any value given in num_train_epochs
max_steps is given, it will override any value given in num_train_epochs
max_steps is given, it will override any value given in num_train_epochs
ERROR:sft_trainer.py:Traceback (most recent call last):
  File "/proj/data-eng/haribala/conda/envs/platform/lib/python3.11/site-packages/tuning/sft_trainer.py", line 639, in main
    trainer, additional_train_info = train(
                                     ^^^^^^
  File "/proj/data-eng/haribala/conda/envs/platform/lib/python3.11/site-packages/tuning/sft_trainer.py", line 418, in train
    resume_from_checkpoint = get_last_checkpoint(training_args.output_dir)
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/proj/data-eng/haribala/conda/envs/platform/lib/python3.11/site-packages/transformers/trainer_utils.py", line 212, in get_last_checkpoint
    content = os.listdir(folder)
              ^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: 'train_output'

From error.log file

Unable to load file: [Errno 2] No such file or directory: 'train_output'
[stderr.txt](https://github.com/user-attachments/files/17202794/stderr.txt)

Expected Output

The trainer should not crash in the multiple GPUs (processes) case.
The non-existent output directory should be created same as the single GPU case.

@HarikrishnanBalagopal
Copy link
Contributor Author

Related #352

@anhuong
Copy link
Collaborator

anhuong commented Oct 8, 2024

this was merged in, closing issue

@anhuong anhuong closed this as completed Oct 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants