You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
warnings.warn(
/proj/data-eng/haribala/conda/envs/platform/lib/python3.11/site-packages/transformers/training_args.py:2007: FutureWarning: `--push_to_hub_token` is deprecated and will be removed in version 5 of 🤗 Transformers. Use `--hub_token` instead.
warnings.warn(
WARNING:sft_trainer.py:dataset kwargs None
/proj/data-eng/haribala/conda/envs/platform/lib/python3.11/site-packages/huggingface_hub/utils/_deprecation.py:100: FutureWarning: Deprecated argument(s) used in'__init__': dataset_text_field, max_seq_length. Will not be supported from version '1.0.0'.
Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.
warnings.warn(message, FutureWarning)
/proj/data-eng/haribala/conda/envs/platform/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:280: UserWarning: You passed a `max_seq_length` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`.
warnings.warn(
/proj/data-eng/haribala/conda/envs/platform/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:318: UserWarning: You passed a `dataset_text_field` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`.
warnings.warn(
max_steps is given, it will override any value given in num_train_epochs
max_steps is given, it will override any value given in num_train_epochs
max_steps is given, it will override any value given in num_train_epochs
max_steps is given, it will override any value given in num_train_epochs
max_steps is given, it will override any value given in num_train_epochs
max_steps is given, it will override any value given in num_train_epochs
max_steps is given, it will override any value given in num_train_epochs
max_steps is given, it will override any value given in num_train_epochs
ERROR:sft_trainer.py:Traceback (most recent call last):
File "/proj/data-eng/haribala/conda/envs/platform/lib/python3.11/site-packages/tuning/sft_trainer.py", line 639, in main
trainer, additional_train_info = train(
^^^^^^
File "/proj/data-eng/haribala/conda/envs/platform/lib/python3.11/site-packages/tuning/sft_trainer.py", line 418, in train
resume_from_checkpoint = get_last_checkpoint(training_args.output_dir)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/proj/data-eng/haribala/conda/envs/platform/lib/python3.11/site-packages/transformers/trainer_utils.py", line 212, in get_last_checkpoint
content = os.listdir(folder)
^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: 'train_output'
From error.log file
Unable to load file: [Errno 2] No such file or directory: 'train_output'
[stderr.txt](https://github.com/user-attachments/files/17202794/stderr.txt)
Expected Output
The trainer should not crash in the multiple GPUs (processes) case.
The non-existent output directory should be created same as the single GPU case.
The text was updated successfully, but these errors were encountered:
Overview
Race condition? leading to a crash when multiple GPUs (processes) are used and the output directory doesn't exist.
Steps to reproduce
Run a multiple GPU job with
torchrun
and a non-existent output directory.Actual Output
From error.log file
Unable to load file: [Errno 2] No such file or directory: 'train_output' [stderr.txt](https://github.com/user-attachments/files/17202794/stderr.txt)
Expected Output
The trainer should not crash in the multiple GPUs (processes) case.
The non-existent output directory should be created same as the single GPU case.
The text was updated successfully, but these errors were encountered: