Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IndexError: pop from an empty deque after 20 seconds of training #3386

Open
JohnConnor123 opened this issue Feb 7, 2025 · 3 comments
Open

Comments

@JohnConnor123
Copy link

JohnConnor123 commented Feb 7, 2025

System Info

- `Accelerate` version: 1.3.0
- Platform: Linux-6.8.0-52-generic-x86_64-with-glibc2.39
- `accelerate` bash location: /home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/bin/accelerate
- Python version: 3.12.3
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.4.0+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- PyTorch MUSA available: False
- System RAM: 31.26 GB
- GPU type: NVIDIA GeForce RTX 4060 Ti
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: DEEPSPEED
        - mixed_precision: fp16
        - use_cpu: False
        - debug: False
        - num_processes: 1
        - machine_rank: 0
        - num_machines: 1
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - enable_cpu_affinity: False
        - deepspeed_config: {'gradient_accumulation_steps': 1, 'offload_optimizer_device': 'cpu', 'offload_param_device': 'cpu', 'zero3_init_flag': False, 'zero3_save_16bit_model': False, 'zero_stage': 3}
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

Reproduction

I'm pre-training my LLM using PPO from hugging face TRL library. This algorithm is very demanding - so I'm trying to offload gradients and optimizer states to cpu to reduce VRAM usage on my single video card. I almost managed to run ppo using only 6GB VRAM instead of 10GB (I had to roll back accelerate from 1.3.0 to 0.34.2 and tweak the accelerate config to do this) and at almost the same speed, but after a while the IndexError: pop from an empty deque error appears. On the latest version of accelerate another error appears:
[rank0]: ValueError: Please make sure to properly initialize your accelerator via `accelerator = Accelerator()` before using any functionality from the `accelerate` library. [rank0]:[W207 01:12:34.028098829 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator()) E0207 01:12:35.515000 130698066202752 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 661843) of binary: /home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/bin/python

single_gpu.yaml:

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  gradient_accumulation_steps: 1
  offload_optimizer_device: cpu
  offload_param_device: cpu
  zero3_init_flag: true
  zero3_save_16bit_model: false
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

With accelerate:latest (==1.3.0):

(skoltech-llm-long-context-py3.12) calibri@devai:~/experiments/rl_finetunning$ source start-ppo-with-deepspeed.sh
[2025-02-07 01:12:20,356] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-02-07 01:12:23,879] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-02-07 01:12:24,770] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-02-07 01:12:24,771] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2025-02-07 01:12:25,947] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 1
[2025-02-07 01:12:26,727] [INFO] [partition_parameters.py:348:__exit__] finished initializing model - num_params = 291, num_elems = 0.49B
Some weights of Qwen2ForSequenceClassification were not initialized from the model checkpoint at Qwen/Qwen2.5-0.5B-Instruct and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[2025-02-07 01:12:27,329] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 1
[2025-02-07 01:12:27,995] [INFO] [partition_parameters.py:348:__exit__] finished initializing model - num_params = 582, num_elems = 0.99B
Some weights of Qwen2ForSequenceClassification were not initialized from the model checkpoint at Qwen/Qwen2.5-0.5B-Instruct and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[2025-02-07 01:12:28,826] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 1
[2025-02-07 01:12:29,486] [INFO] [partition_parameters.py:348:__exit__] finished initializing model - num_params = 873, num_elems = 1.62B
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/calibri/experiments/rl_finetunning/trl/examples/scripts/ppo/ppo.py", line 152, in <module>
[rank0]:     trainer = PPOTrainer(
[rank0]:               ^^^^^^^^^^^
[rank0]:   File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/transformers/utils/deprecation.py", line 165, in wrapped_func
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/transformers/utils/deprecation.py", line 165, in wrapped_func
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/transformers/utils/deprecation.py", line 165, in wrapped_func
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   [Previous line repeated 1 more time]
[rank0]:   File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/trl/trainer/ppo_trainer.py", line 194, in __init__
[rank0]:     accelerator = Accelerator(gradient_accumulation_steps=args.gradient_accumulation_steps)
[rank0]:                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/accelerate/accelerator.py", line 302, in __init__
[rank0]:     deepspeed_plugins = AcceleratorState().deepspeed_plugins
[rank0]:                         ^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/accelerate/state.py", line 887, in __init__
[rank0]:     raise ValueError(
[rank0]: ValueError: Please make sure to properly initialize your accelerator via `accelerator = Accelerator()` before using any functionality from the `accelerate` library.
[rank0]:[W207 01:12:34.028098829 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
E0207 01:12:35.515000 130698066202752 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 661843) of binary: /home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/bin/python
Traceback (most recent call last):
  File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/accelerate/commands/launch.py", line 1157, in launch_command
    deepspeed_launcher(args)
  File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/accelerate/commands/launch.py", line 845, in deepspeed_launcher
    distrib_run.run(args)
  File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
trl/examples/scripts/ppo/ppo.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-02-07_01:12:35
  host      : devai
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 661843)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

With accelerate==0.34.2:

(skoltech-llm-long-context-py3.12) calibri@devai:~/experiments/rl_finetunning$ source start-ppo-with-deepspeed.sh
[2025-02-07 01:13:21,262] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-02-07 01:13:24,750] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-02-07 01:13:25,644] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-02-07 01:13:25,644] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2025-02-07 01:13:26,568] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 1
[2025-02-07 01:13:27,351] [INFO] [partition_parameters.py:348:__exit__] finished initializing model - num_params = 291, num_elems = 0.49B
Some weights of Qwen2ForSequenceClassification were not initialized from the model checkpoint at Qwen/Qwen2.5-0.5B-Instruct and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[2025-02-07 01:13:27,973] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 1
[2025-02-07 01:13:28,645] [INFO] [partition_parameters.py:348:__exit__] finished initializing model - num_params = 582, num_elems = 0.99B
Some weights of Qwen2ForSequenceClassification were not initialized from the model checkpoint at Qwen/Qwen2.5-0.5B-Instruct and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[2025-02-07 01:13:29,418] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 1
[2025-02-07 01:13:30,081] [INFO] [partition_parameters.py:348:__exit__] finished initializing model - num_params = 873, num_elems = 1.62B
Using /home/calibri/.cache/torch_extensions/py312_cu121 as PyTorch extensions root...
Emitting ninja build file /home/calibri/.cache/torch_extensions/py312_cu121/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.179290294647217 seconds
Adam Optimizer #0 is created with AVX2 arithmetic capability.
Config: alpha=0.000003, betas=(0.900000, 0.999000), weight_decay=0.010000, adam_w=1
[2025-02-07 01:13:37,954] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed info: version=0.16.3, git-hash=unknown, git-branch=unknown
[2025-02-07 01:13:37,954] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 1
[2025-02-07 01:13:37,973] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2025-02-07 01:13:37,974] [INFO] [logging.py:128:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2025-02-07 01:13:37,974] [INFO] [logging.py:128:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
[2025-02-07 01:13:37,995] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam
[2025-02-07 01:13:37,995] [INFO] [utils.py:59:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'>
[2025-02-07 01:13:37,995] [INFO] [logging.py:128:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer, MiCS is enabled False, Hierarchical params gather False
[2025-02-07 01:13:37,995] [INFO] [logging.py:128:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 3 optimizer
[2025-02-07 01:13:38,133] [INFO] [utils.py:781:see_memory_usage] Stage 3 initialize beginning
[2025-02-07 01:13:38,134] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB         Max_MA 0.76 GB         CA 0.77 GB         Max_CA 1 GB
[2025-02-07 01:13:38,134] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 5.32 GB, percent = 17.0%
[2025-02-07 01:13:38,137] [INFO] [stage3.py:169:__init__] Reduce bucket size 500000000
[2025-02-07 01:13:38,137] [INFO] [stage3.py:170:__init__] Prefetch bucket size 50000000
[2025-02-07 01:13:38,263] [INFO] [utils.py:781:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2025-02-07 01:13:38,263] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.77 GB         Max_CA 1 GB
[2025-02-07 01:13:38,263] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 5.32 GB, percent = 17.0%
Parameter Offload: Total persistent parameters: 2306688 in 339 params
[2025-02-07 01:13:38,449] [INFO] [utils.py:781:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
[2025-02-07 01:13:38,450] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.77 GB         Max_CA 1 GB
[2025-02-07 01:13:38,450] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 5.32 GB, percent = 17.0%
[2025-02-07 01:13:38,585] [INFO] [utils.py:781:see_memory_usage] Before creating fp16 partitions
[2025-02-07 01:13:38,586] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.77 GB         Max_CA 1 GB
[2025-02-07 01:13:38,586] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 5.31 GB, percent = 17.0%
[2025-02-07 01:13:39,436] [INFO] [utils.py:781:see_memory_usage] After creating fp16 partitions: 2
[2025-02-07 01:13:39,437] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.77 GB         Max_CA 1 GB
[2025-02-07 01:13:39,437] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 6.23 GB, percent = 19.9%
[2025-02-07 01:13:39,579] [INFO] [utils.py:781:see_memory_usage] Before creating fp32 partitions
[2025-02-07 01:13:39,579] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.77 GB         Max_CA 1 GB
[2025-02-07 01:13:39,579] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 6.21 GB, percent = 19.9%
[2025-02-07 01:13:40,037] [INFO] [utils.py:781:see_memory_usage] After creating fp32 partitions
[2025-02-07 01:13:40,037] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.77 GB         Max_CA 1 GB
[2025-02-07 01:13:40,038] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 8.07 GB, percent = 25.8%
[2025-02-07 01:13:40,175] [INFO] [utils.py:781:see_memory_usage] Before initializing optimizer states
[2025-02-07 01:13:40,176] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.77 GB         Max_CA 1 GB
[2025-02-07 01:13:40,176] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 8.07 GB, percent = 25.8%
[2025-02-07 01:13:40,689] [INFO] [utils.py:781:see_memory_usage] After initializing optimizer states
[2025-02-07 01:13:40,689] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.77 GB         Max_CA 1 GB
[2025-02-07 01:13:40,689] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 9.88 GB, percent = 31.6%
[2025-02-07 01:13:40,690] [INFO] [stage3.py:529:_setup_for_real_optimizer] optimizer state initialized
[2025-02-07 01:13:41,060] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer
[2025-02-07 01:13:41,060] [INFO] [utils.py:782:see_memory_usage] MA 0.93 GB         Max_MA 1.44 GB         CA 1.7 GB         Max_CA 2 GB
[2025-02-07 01:13:41,060] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 10.65 GB, percent = 34.1%
[2025-02-07 01:13:41,061] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer_Stage3
[2025-02-07 01:13:41,061] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed using configured LR scheduler = None
[2025-02-07 01:13:41,061] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2025-02-07 01:13:41,061] [INFO] [logging.py:128:log_dist] [Rank 0] step=0, skipped=0, lr=[3e-06, 3e-06], mom=[(0.9, 0.999), (0.9, 0.999)]
[2025-02-07 01:13:41,062] [INFO] [config.py:999:print] DeepSpeedEngine configuration:
[2025-02-07 01:13:41,063] [INFO] [config.py:1003:print]   activation_checkpointing_config  {
    "partition_activations": false,
    "contiguous_memory_optimization": false,
    "cpu_checkpointing": false,
    "number_checkpoints": null,
    "synchronize_checkpoint_boundary": false,
    "profile": false
}
[2025-02-07 01:13:41,063] [INFO] [config.py:1003:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True, 'use_gds': False}
[2025-02-07 01:13:41,063] [INFO] [config.py:1003:print]   amp_enabled .................. False
[2025-02-07 01:13:41,063] [INFO] [config.py:1003:print]   amp_params ................... False
[2025-02-07 01:13:41,063] [INFO] [config.py:1003:print]   autotuning_config ............ {
    "enabled": false,
    "start_step": null,
    "end_step": null,
    "metric_path": null,
    "arg_mappings": null,
    "metric": "throughput",
    "model_info": null,
    "results_dir": "autotuning_results",
    "exps_dir": "autotuning_exps",
    "overwrite": true,
    "fast": true,
    "start_profile_step": 3,
    "end_profile_step": 5,
    "tuner_type": "gridsearch",
    "tuner_early_stopping": 5,
    "tuner_num_trials": 50,
    "model_info_path": null,
    "mp_size": 1,
    "max_train_batch_size": null,
    "min_train_batch_size": 1,
    "max_train_micro_batch_size_per_gpu": 1.024000e+03,
    "min_train_micro_batch_size_per_gpu": 1,
    "num_tuning_micro_batch_sizes": 3
}
[2025-02-07 01:13:41,063] [INFO] [config.py:1003:print]   bfloat16_enabled ............. True
[2025-02-07 01:13:41,063] [INFO] [config.py:1003:print]   bfloat16_immediate_grad_update  False
[2025-02-07 01:13:41,063] [INFO] [config.py:1003:print]   checkpoint_parallel_write_pipeline  False
[2025-02-07 01:13:41,063] [INFO] [config.py:1003:print]   checkpoint_tag_validation_enabled  True
[2025-02-07 01:13:41,063] [INFO] [config.py:1003:print]   checkpoint_tag_validation_fail  False
[2025-02-07 01:13:41,063] [INFO] [config.py:1003:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x78ba035a7d70>
[2025-02-07 01:13:41,063] [INFO] [config.py:1003:print]   communication_data_type ...... None
[2025-02-07 01:13:41,063] [INFO] [config.py:1003:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2025-02-07 01:13:41,063] [INFO] [config.py:1003:print]   curriculum_enabled_legacy .... False
[2025-02-07 01:13:41,063] [INFO] [config.py:1003:print]   curriculum_params_legacy ..... False
[2025-02-07 01:13:41,063] [INFO] [config.py:1003:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2025-02-07 01:13:41,063] [INFO] [config.py:1003:print]   data_efficiency_enabled ...... False
[2025-02-07 01:13:41,063] [INFO] [config.py:1003:print]   dataloader_drop_last ......... False
[2025-02-07 01:13:41,063] [INFO] [config.py:1003:print]   disable_allgather ............ False
[2025-02-07 01:13:41,064] [INFO] [config.py:1003:print]   dump_state ................... False
[2025-02-07 01:13:41,064] [INFO] [config.py:1003:print]   dynamic_loss_scale_args ...... None
[2025-02-07 01:13:41,064] [INFO] [config.py:1003:print]   eigenvalue_enabled ........... False
[2025-02-07 01:13:41,064] [INFO] [config.py:1003:print]   eigenvalue_gas_boundary_resolution  1
[2025-02-07 01:13:41,064] [INFO] [config.py:1003:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2025-02-07 01:13:41,064] [INFO] [config.py:1003:print]   eigenvalue_layer_num ......... 0
[2025-02-07 01:13:41,064] [INFO] [config.py:1003:print]   eigenvalue_max_iter .......... 100
[2025-02-07 01:13:41,064] [INFO] [config.py:1003:print]   eigenvalue_stability ......... 1e-06
[2025-02-07 01:13:41,064] [INFO] [config.py:1003:print]   eigenvalue_tol ............... 0.01
[2025-02-07 01:13:41,064] [INFO] [config.py:1003:print]   eigenvalue_verbose ........... False
[2025-02-07 01:13:41,064] [INFO] [config.py:1003:print]   elasticity_enabled ........... False
[2025-02-07 01:13:41,064] [INFO] [config.py:1003:print]   flops_profiler_config ........ {
    "enabled": false,
    "recompute_fwd_factor": 0.0,
    "profile_step": 1,
    "module_depth": -1,
    "top_modules": 1,
    "detailed": true,
    "output_file": null
}
[2025-02-07 01:13:41,064] [INFO] [config.py:1003:print]   fp16_auto_cast ............... None
[2025-02-07 01:13:41,064] [INFO] [config.py:1003:print]   fp16_enabled ................. False
[2025-02-07 01:13:41,064] [INFO] [config.py:1003:print]   fp16_master_weights_and_gradients  False
[2025-02-07 01:13:41,064] [INFO] [config.py:1003:print]   global_rank .................. 0
[2025-02-07 01:13:41,064] [INFO] [config.py:1003:print]   grad_accum_dtype ............. None
[2025-02-07 01:13:41,064] [INFO] [config.py:1003:print]   gradient_accumulation_steps .. 1
[2025-02-07 01:13:41,064] [INFO] [config.py:1003:print]   gradient_clipping ............ 1.0
[2025-02-07 01:13:41,064] [INFO] [config.py:1003:print]   gradient_predivide_factor .... 1.0
[2025-02-07 01:13:41,064] [INFO] [config.py:1003:print]   graph_harvesting ............. False
[2025-02-07 01:13:41,064] [INFO] [config.py:1003:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2025-02-07 01:13:41,064] [INFO] [config.py:1003:print]   initial_dynamic_scale ........ 1
[2025-02-07 01:13:41,064] [INFO] [config.py:1003:print]   load_universal_checkpoint .... False
[2025-02-07 01:13:41,064] [INFO] [config.py:1003:print]   loss_scale ................... 1.0
[2025-02-07 01:13:41,064] [INFO] [config.py:1003:print]   memory_breakdown ............. False
[2025-02-07 01:13:41,064] [INFO] [config.py:1003:print]   mics_hierarchial_params_gather  False
[2025-02-07 01:13:41,064] [INFO] [config.py:1003:print]   mics_shard_size .............. -1
[2025-02-07 01:13:41,064] [INFO] [config.py:1003:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName')
[2025-02-07 01:13:41,065] [INFO] [config.py:1003:print]   nebula_config ................ {
    "enabled": false,
    "persistent_storage_path": null,
    "persistent_time_interval": 100,
    "num_of_version_in_retention": 2,
    "enable_nebula_load": true,
    "load_path": null
}
[2025-02-07 01:13:41,065] [INFO] [config.py:1003:print]   optimizer_legacy_fusion ...... False
[2025-02-07 01:13:41,065] [INFO] [config.py:1003:print]   optimizer_name ............... None
[2025-02-07 01:13:41,065] [INFO] [config.py:1003:print]   optimizer_params ............. None
[2025-02-07 01:13:41,065] [INFO] [config.py:1003:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
[2025-02-07 01:13:41,065] [INFO] [config.py:1003:print]   pld_enabled .................. False
[2025-02-07 01:13:41,065] [INFO] [config.py:1003:print]   pld_params ................... False
[2025-02-07 01:13:41,065] [INFO] [config.py:1003:print]   prescale_gradients ........... False
[2025-02-07 01:13:41,065] [INFO] [config.py:1003:print]   scheduler_name ............... None
[2025-02-07 01:13:41,065] [INFO] [config.py:1003:print]   scheduler_params ............. None
[2025-02-07 01:13:41,065] [INFO] [config.py:1003:print]   seq_parallel_communication_data_type  torch.float32
[2025-02-07 01:13:41,065] [INFO] [config.py:1003:print]   sparse_attention ............. None
[2025-02-07 01:13:41,065] [INFO] [config.py:1003:print]   sparse_gradients_enabled ..... False
[2025-02-07 01:13:41,065] [INFO] [config.py:1003:print]   steps_per_print .............. inf
[2025-02-07 01:13:41,065] [INFO] [config.py:1003:print]   timers_config ................ enabled=True synchronized=True
[2025-02-07 01:13:41,065] [INFO] [config.py:1003:print]   train_batch_size ............. 1
[2025-02-07 01:13:41,065] [INFO] [config.py:1003:print]   train_micro_batch_size_per_gpu  1
[2025-02-07 01:13:41,065] [INFO] [config.py:1003:print]   use_data_before_expert_parallel_  False
[2025-02-07 01:13:41,065] [INFO] [config.py:1003:print]   use_node_local_storage ....... False
[2025-02-07 01:13:41,065] [INFO] [config.py:1003:print]   wall_clock_breakdown ......... False
[2025-02-07 01:13:41,065] [INFO] [config.py:1003:print]   weight_quantization_config ... None
[2025-02-07 01:13:41,065] [INFO] [config.py:1003:print]   world_size ................... 1
[2025-02-07 01:13:41,065] [INFO] [config.py:1003:print]   zero_allow_untested_optimizer  True
[2025-02-07 01:13:41,065] [INFO] [config.py:1003:print]   zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500000000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='cpu', nvme_path=None, buffer_count=5, buffer_size=100000000, max_in_cpu=1000000000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='cpu', nvme_path=None, buffer_count=4, pin_memory=False, pipeline_read=False, pipeline_write=False, fast_init=False, ratio=1.0) sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50000000 param_persistence_threshold=100000 model_persistence_threshold=9223372036854775807 max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=False module_granularity_threshold=0 use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False zeropp_loco_param=None mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
[2025-02-07 01:13:41,065] [INFO] [config.py:1003:print]   zero_enabled ................. True
[2025-02-07 01:13:41,065] [INFO] [config.py:1003:print]   zero_force_ds_cpu_optimizer .. True
[2025-02-07 01:13:41,065] [INFO] [config.py:1003:print]   zero_optimization_stage ...... 3
[2025-02-07 01:13:41,065] [INFO] [config.py:989:print_user_config]   json = {
    "train_batch_size": 1,
    "train_micro_batch_size_per_gpu": 1,
    "gradient_accumulation_steps": 1,
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "nvme_path": null
        },
        "offload_param": {
            "device": "cpu",
            "nvme_path": null
        },
        "stage3_gather_16bit_weights_on_model_save": false
    },
    "gradient_clipping": 1.0,
    "steps_per_print": inf,
    "bf16": {
        "enabled": true
    },
    "fp16": {
        "enabled": false
    },
    "zero_allow_untested_optimizer": true
}
[2025-02-07 01:13:41,066] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed info: version=0.16.3, git-hash=unknown, git-branch=unknown
[2025-02-07 01:13:41,066] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 1
[2025-02-07 01:13:41,070] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2025-02-07 01:13:41,071] [INFO] [logging.py:128:log_dist] [Rank 0] Creating ZeRO Offload
[2025-02-07 01:13:41,211] [INFO] [utils.py:781:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2025-02-07 01:13:41,212] [INFO] [utils.py:782:see_memory_usage] MA 0.93 GB         Max_MA 0.93 GB         CA 1.7 GB         Max_CA 2 GB
[2025-02-07 01:13:41,212] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 10.65 GB, percent = 34.1%
Parameter Offload: Total persistent parameters: 72448 in 122 params
[2025-02-07 01:13:41,360] [INFO] [utils.py:781:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
[2025-02-07 01:13:41,361] [INFO] [utils.py:782:see_memory_usage] MA 0.93 GB         Max_MA 0.93 GB         CA 1.7 GB         Max_CA 2 GB
[2025-02-07 01:13:41,361] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 10.65 GB, percent = 34.1%
[2025-02-07 01:13:41,362] [INFO] [config.py:999:print] DeepSpeedEngine configuration:
[2025-02-07 01:13:41,362] [INFO] [config.py:1003:print]   activation_checkpointing_config  {
    "partition_activations": false,
    "contiguous_memory_optimization": false,
    "cpu_checkpointing": false,
    "number_checkpoints": null,
    "synchronize_checkpoint_boundary": false,
    "profile": false
}
[2025-02-07 01:13:41,362] [INFO] [config.py:1003:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True, 'use_gds': False}
[2025-02-07 01:13:41,362] [INFO] [config.py:1003:print]   amp_enabled .................. False
[2025-02-07 01:13:41,362] [INFO] [config.py:1003:print]   amp_params ................... False
[2025-02-07 01:13:41,362] [INFO] [config.py:1003:print]   autotuning_config ............ {
    "enabled": false,
    "start_step": null,
    "end_step": null,
    "metric_path": null,
    "arg_mappings": null,
    "metric": "throughput",
    "model_info": null,
    "results_dir": "autotuning_results",
    "exps_dir": "autotuning_exps",
    "overwrite": true,
    "fast": true,
    "start_profile_step": 3,
    "end_profile_step": 5,
    "tuner_type": "gridsearch",
    "tuner_early_stopping": 5,
    "tuner_num_trials": 50,
    "model_info_path": null,
    "mp_size": 1,
    "max_train_batch_size": null,
    "min_train_batch_size": 1,
    "max_train_micro_batch_size_per_gpu": 1.024000e+03,
    "min_train_micro_batch_size_per_gpu": 1,
    "num_tuning_micro_batch_sizes": 3
}
[2025-02-07 01:13:41,362] [INFO] [config.py:1003:print]   bfloat16_enabled ............. True
[2025-02-07 01:13:41,362] [INFO] [config.py:1003:print]   bfloat16_immediate_grad_update  False
[2025-02-07 01:13:41,362] [INFO] [config.py:1003:print]   checkpoint_parallel_write_pipeline  False
[2025-02-07 01:13:41,362] [INFO] [config.py:1003:print]   checkpoint_tag_validation_enabled  True
[2025-02-07 01:13:41,362] [INFO] [config.py:1003:print]   checkpoint_tag_validation_fail  False
[2025-02-07 01:13:41,362] [INFO] [config.py:1003:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x78b8ddb07260>
[2025-02-07 01:13:41,362] [INFO] [config.py:1003:print]   communication_data_type ...... None
[2025-02-07 01:13:41,362] [INFO] [config.py:1003:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2025-02-07 01:13:41,362] [INFO] [config.py:1003:print]   curriculum_enabled_legacy .... False
[2025-02-07 01:13:41,362] [INFO] [config.py:1003:print]   curriculum_params_legacy ..... False
[2025-02-07 01:13:41,362] [INFO] [config.py:1003:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2025-02-07 01:13:41,362] [INFO] [config.py:1003:print]   data_efficiency_enabled ...... False
[2025-02-07 01:13:41,363] [INFO] [config.py:1003:print]   dataloader_drop_last ......... False
[2025-02-07 01:13:41,363] [INFO] [config.py:1003:print]   disable_allgather ............ False
[2025-02-07 01:13:41,363] [INFO] [config.py:1003:print]   dump_state ................... False
[2025-02-07 01:13:41,363] [INFO] [config.py:1003:print]   dynamic_loss_scale_args ...... None
[2025-02-07 01:13:41,363] [INFO] [config.py:1003:print]   eigenvalue_enabled ........... False
[2025-02-07 01:13:41,363] [INFO] [config.py:1003:print]   eigenvalue_gas_boundary_resolution  1
[2025-02-07 01:13:41,363] [INFO] [config.py:1003:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2025-02-07 01:13:41,363] [INFO] [config.py:1003:print]   eigenvalue_layer_num ......... 0
[2025-02-07 01:13:41,363] [INFO] [config.py:1003:print]   eigenvalue_max_iter .......... 100
[2025-02-07 01:13:41,363] [INFO] [config.py:1003:print]   eigenvalue_stability ......... 1e-06
[2025-02-07 01:13:41,363] [INFO] [config.py:1003:print]   eigenvalue_tol ............... 0.01
[2025-02-07 01:13:41,363] [INFO] [config.py:1003:print]   eigenvalue_verbose ........... False
[2025-02-07 01:13:41,363] [INFO] [config.py:1003:print]   elasticity_enabled ........... False
[2025-02-07 01:13:41,363] [INFO] [config.py:1003:print]   flops_profiler_config ........ {
    "enabled": false,
    "recompute_fwd_factor": 0.0,
    "profile_step": 1,
    "module_depth": -1,
    "top_modules": 1,
    "detailed": true,
    "output_file": null
}
[2025-02-07 01:13:41,363] [INFO] [config.py:1003:print]   fp16_auto_cast ............... None
[2025-02-07 01:13:41,363] [INFO] [config.py:1003:print]   fp16_enabled ................. False
[2025-02-07 01:13:41,363] [INFO] [config.py:1003:print]   fp16_master_weights_and_gradients  False
[2025-02-07 01:13:41,363] [INFO] [config.py:1003:print]   global_rank .................. 0
[2025-02-07 01:13:41,363] [INFO] [config.py:1003:print]   grad_accum_dtype ............. None
[2025-02-07 01:13:41,363] [INFO] [config.py:1003:print]   gradient_accumulation_steps .. 1
[2025-02-07 01:13:41,363] [INFO] [config.py:1003:print]   gradient_clipping ............ 1.0
[2025-02-07 01:13:41,363] [INFO] [config.py:1003:print]   gradient_predivide_factor .... 1.0
[2025-02-07 01:13:41,363] [INFO] [config.py:1003:print]   graph_harvesting ............. False
[2025-02-07 01:13:41,363] [INFO] [config.py:1003:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2025-02-07 01:13:41,363] [INFO] [config.py:1003:print]   initial_dynamic_scale ........ 1
[2025-02-07 01:13:41,363] [INFO] [config.py:1003:print]   load_universal_checkpoint .... False
[2025-02-07 01:13:41,363] [INFO] [config.py:1003:print]   loss_scale ................... 1.0
[2025-02-07 01:13:41,363] [INFO] [config.py:1003:print]   memory_breakdown ............. False
[2025-02-07 01:13:41,363] [INFO] [config.py:1003:print]   mics_hierarchial_params_gather  False
[2025-02-07 01:13:41,363] [INFO] [config.py:1003:print]   mics_shard_size .............. -1
[2025-02-07 01:13:41,363] [INFO] [config.py:1003:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName')
[2025-02-07 01:13:41,364] [INFO] [config.py:1003:print]   nebula_config ................ {
    "enabled": false,
    "persistent_storage_path": null,
    "persistent_time_interval": 100,
    "num_of_version_in_retention": 2,
    "enable_nebula_load": true,
    "load_path": null
}
[2025-02-07 01:13:41,364] [INFO] [config.py:1003:print]   optimizer_legacy_fusion ...... False
[2025-02-07 01:13:41,364] [INFO] [config.py:1003:print]   optimizer_name ............... None
[2025-02-07 01:13:41,364] [INFO] [config.py:1003:print]   optimizer_params ............. None
[2025-02-07 01:13:41,364] [INFO] [config.py:1003:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
[2025-02-07 01:13:41,364] [INFO] [config.py:1003:print]   pld_enabled .................. False
[2025-02-07 01:13:41,364] [INFO] [config.py:1003:print]   pld_params ................... False
[2025-02-07 01:13:41,364] [INFO] [config.py:1003:print]   prescale_gradients ........... False
[2025-02-07 01:13:41,364] [INFO] [config.py:1003:print]   scheduler_name ............... None
[2025-02-07 01:13:41,364] [INFO] [config.py:1003:print]   scheduler_params ............. None
[2025-02-07 01:13:41,364] [INFO] [config.py:1003:print]   seq_parallel_communication_data_type  torch.float32
[2025-02-07 01:13:41,364] [INFO] [config.py:1003:print]   sparse_attention ............. None
[2025-02-07 01:13:41,364] [INFO] [config.py:1003:print]   sparse_gradients_enabled ..... False
[2025-02-07 01:13:41,364] [INFO] [config.py:1003:print]   steps_per_print .............. inf
[2025-02-07 01:13:41,364] [INFO] [config.py:1003:print]   timers_config ................ enabled=True synchronized=True
[2025-02-07 01:13:41,364] [INFO] [config.py:1003:print]   train_batch_size ............. 1
[2025-02-07 01:13:41,364] [INFO] [config.py:1003:print]   train_micro_batch_size_per_gpu  1
[2025-02-07 01:13:41,364] [INFO] [config.py:1003:print]   use_data_before_expert_parallel_  False
[2025-02-07 01:13:41,364] [INFO] [config.py:1003:print]   use_node_local_storage ....... False
[2025-02-07 01:13:41,364] [INFO] [config.py:1003:print]   wall_clock_breakdown ......... False
[2025-02-07 01:13:41,364] [INFO] [config.py:1003:print]   weight_quantization_config ... None
[2025-02-07 01:13:41,364] [INFO] [config.py:1003:print]   world_size ................... 1
[2025-02-07 01:13:41,364] [INFO] [config.py:1003:print]   zero_allow_untested_optimizer  True
[2025-02-07 01:13:41,364] [INFO] [config.py:1003:print]   zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500000000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='cpu', nvme_path=None, buffer_count=5, buffer_size=100000000, max_in_cpu=1000000000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='cpu', nvme_path=None, buffer_count=4, pin_memory=False, pipeline_read=False, pipeline_write=False, fast_init=False, ratio=1.0) sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50000000 param_persistence_threshold=100000 model_persistence_threshold=9223372036854775807 max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=False module_granularity_threshold=0 use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False zeropp_loco_param=None mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
[2025-02-07 01:13:41,364] [INFO] [config.py:1003:print]   zero_enabled ................. True
[2025-02-07 01:13:41,364] [INFO] [config.py:1003:print]   zero_force_ds_cpu_optimizer .. True
[2025-02-07 01:13:41,364] [INFO] [config.py:1003:print]   zero_optimization_stage ...... 3
[2025-02-07 01:13:41,364] [INFO] [config.py:989:print_user_config]   json = {
    "train_batch_size": 1,
    "train_micro_batch_size_per_gpu": 1,
    "gradient_accumulation_steps": 1,
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "nvme_path": null
        },
        "offload_param": {
            "device": "cpu",
            "nvme_path": null
        },
        "stage3_gather_16bit_weights_on_model_save": false
    },
    "gradient_clipping": 1.0,
    "steps_per_print": inf,
    "bf16": {
        "enabled": true
    },
    "fp16": {
        "enabled": false
    },
    "zero_allow_untested_optimizer": true,
    "zero_optimization.reduce_bucket_size": 8.028160e+05,
    "zero_optimization.stage3_param_persistence_threshold": 8.960000e+03,
    "zero_optimization.stage3_prefetch_bucket_size": 0
}
===training policy===
wandb: Currently logged in as: ivan-eudokimoff2014 (ivan-eudokimoff2014-skolkovo-institute) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin
wandb: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
wandb: Tracking run with wandb version 0.19.5
wandb: Run data is saved locally in /home/calibri/experiments/rl_finetunning/wandb/run-20250207_011341-74y2rhd9
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run ppo_config__42__1738890814
wandb: ⭐️ View project at https://wandb.ai/ivan-eudokimoff2014-skolkovo-institute/huggingface
wandb: 🚀 View run at https://wandb.ai/ivan-eudokimoff2014-skolkovo-institute/huggingface/runs/74y2rhd9
  0%|                                                                                                                                                                                                               | 0/1000 [00:00<?, ?it/s]From v4.47 onwards, when a model cache is to be returned, `generate` will return a `Cache` instance instead by default (as opposed to the legacy tuple of tuples format). If you want to keep returning the legacy format, please set `return_legacy_cache=True`.
/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/trl/trainer/ppo_trainer.py:640: UserWarning: var(): degrees of freedom is <= 0. Correction should be strictly less than the reduction factor (input numel divided by output numel). (Triggered internally at ../aten/src/ATen/native/ReduceOps.cpp:1808.)
  metrics["val/ratio_var"] = self.accelerator.gather_for_metrics(ratio_stats).var().item()
{'eps': 0, 'objective/kl': 0.6529181003570557, 'objective/entropy': 58.82828903198242, 'objective/non_score_reward': -0.032645903527736664, 'objective/rlhf_reward': -2.6420209407806396, 'objective/scores': -2.609375, 'policy/approxkl_avg': 0.0022548267152160406, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': 0.011362709105014801, 'loss/value_avg': 3.932140827178955, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.484375, 'val/ratio': 0.9899009466171265, 'val/ratio_var': nan, 'val/num_eos_tokens': 0, 'lr': 3e-06, 'episode': 1, 'epoch': 0.0}
  0%|▏                                                                                                                                                                                                    | 1/1000 [00:05<1:25:03,  5.11s/it]┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ query                                                                                                        ┃ model response                                                                                               ┃ score       ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ She couldn't get the saw-player the kid had mentioned out of her mind. Sounds Hawaiian, she thought over and │  She couldn't help but giggle. Eddie was a good man, and she was glad to have him by her side. She had been  │ -1.75       │
│ over again as Eddie pushed her grimly along in the new wheelchair, weaving in and out of the stalled         │ in a car accident a few days ago, and she was in a terrible state of shock. She had been in a car            │             │
│ vehicles. Sounds Hawaiian, doesn't it? Sounds fucking Hawaiian, doesn't it.                                  │                                                                                                              │             │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────┼─────────────┤
│ "You little piss-ant!" the girl snapped. "Don't tell me I slipped up. She died at seventeen. That's why I    │  "I'm a man. I'm not a child. I'm not a child. I'm a man."                                                   │ -0.58984375 │
│ wasn't there. I was never notified."                                                                         │                                                                                                              │             │
│                                                                                                              │ "Then why are you here?" she demanded. "Why are you here? Why are you here? Why are you here? Why are you    │             │
│ "But I don't do sixteen," he said, his voice going nasty.                                                    │ here                                                                                                         │             │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────┼─────────────┤
│ Daniel flashed one of his own-a real one, this time.                                                         │  The room was empty. I was alone. I didn't know what to say. I didn't know what to do. I didn't know what to │ 0.44140625  │
│                                                                                                              │ do.                                                                                                          │             │
│ "I almost had a heart attack when Mom almost had a heart attack," he said, his voice quiet. Serious.         │                                                                                                              │             │
│ "I'm-I'm happy you're okay."                                                                                 │ I sat down on the floor, my hands in my lap. I didn't know what to do                                        │             │
│                                                                                                              │                                                                                                              │             │
│ I looked around the room.                                                                                    │                                                                                                              │             │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────┼─────────────┤
│ "Lights," the Oracle announced.  Images sprang up on the panels, showing coral and the calm swirling of sea  │  I'm not sure what they are, but I'm not sure what they are."                                                │ 12.6875     │
│ particles.                                                                                                   │ Zook, a 22-year-old computer science student at the University of California, Berkeley, was watching the     │             │
│ "What am I looking at?" asked Zook.                                                                          │ Oracle's display of images on the screen. He was looking at                                                  │             │
│ "The extra eyes of technology that I dropped behind us.                                                      │                                                                                                              │             │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────┼─────────────┤
│ These people could still smile and frown freely, they were just young. In fact, the only person I had seen   │  I was a bit nervous, but I was determined to get the story out of my head. I was going to tell the story of │ 5.71875     │
│ older than myself in the compound, was Dom. I frowned, and added that point to my agenda to discuss with the │ how I got into the compound, and how I was able to get out. I was going to tell the story of how I           │             │
│ others. Then I launched into the story.                                                                      │                                                                                                              │             │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────┴─────────────┘
Traceback (most recent call last):
  File "/home/calibri/experiments/rl_finetunning/trl/examples/scripts/ppo/ppo.py", line 163, in <module>
    trainer.train()
  File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/trl/trainer/ppo_trainer.py", line 556, in train
    output, vpred_temp = forward(model, mb_query_responses, processing_class.pad_token_id)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/trl/trainer/utils.py", line 1224, in forward
    return model(
           ^^^^^^
  File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
    ret_val = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
  File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/deepspeed/runtime/engine.py", line 1914, in forward
    loss = self.module(*inputs, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1592, in _call_impl
    args_result = hook(self, args)
                  ^^^^^^^^^^^^^^^^
  File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
    ret_val = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
  File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 241, in _start_of_forward_hook
    self.get_param_coordinator().reset_step()
  File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 600, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 235, in reset_step
    self.construct_parameter_trace_from_module_trace()
  File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 219, in construct_parameter_trace_from_module_trace
    self.record_parameters(sub_module)
  File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 211, in record_parameters
    step_id = self.__step_id_module_fetched_for[sub_module.id].popleft()
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
IndexError: pop from an empty deque
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/calibri/experiments/rl_finetunning/trl/examples/scripts/ppo/ppo.py", line 163, in <module>
[rank0]:     trainer.train()
[rank0]:   File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/trl/trainer/ppo_trainer.py", line 556, in train
[rank0]:     output, vpred_temp = forward(model, mb_query_responses, processing_class.pad_token_id)
[rank0]:                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/trl/trainer/utils.py", line 1224, in forward
[rank0]:     return model(
[rank0]:            ^^^^^^
[rank0]:   File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank0]:     ret_val = func(*args, **kwargs)
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/deepspeed/runtime/engine.py", line 1914, in forward
[rank0]:     loss = self.module(*inputs, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1592, in _call_impl
[rank0]:     args_result = hook(self, args)
[rank0]:                   ^^^^^^^^^^^^^^^^
[rank0]:   File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank0]:     ret_val = func(*args, **kwargs)
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 241, in _start_of_forward_hook
[rank0]:     self.get_param_coordinator().reset_step()
[rank0]:   File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 600, in _fn
[rank0]:     return fn(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 235, in reset_step
[rank0]:     self.construct_parameter_trace_from_module_trace()
[rank0]:   File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 219, in construct_parameter_trace_from_module_trace
[rank0]:     self.record_parameters(sub_module)
[rank0]:   File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 211, in record_parameters
[rank0]:     step_id = self.__step_id_module_fetched_for[sub_module.id].popleft()
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: IndexError: pop from an empty deque
wandb:
wandb: 🚀 View run ppo_config__42__1738890814 at: https://wandb.ai/ivan-eudokimoff2014-skolkovo-institute/huggingface/runs/74y2rhd9
wandb: Find logs at: wandb/run-20250207_011341-74y2rhd9/logs
E0207 01:13:55.827000 138054911975552 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 662027) of binary: /home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/bin/python
Traceback (most recent call last):
  File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/accelerate/commands/launch.py", line 1159, in launch_command
    deepspeed_launcher(args)
  File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/accelerate/commands/launch.py", line 852, in deepspeed_launcher
    distrib_run.run(args)
  File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
trl/examples/scripts/ppo/ppo.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-02-07_01:13:55
  host      : devai
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 662027)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

P.s. I opened a similar issue in trl, there are more details there: huggingface/trl#2784 (comment)

@JohnConnor123 JohnConnor123 changed the title IndexError: pop from an empty deque IndexError: pop from an empty deque after 20 seconds of training Feb 7, 2025
@yangqiancheng-yuan
Copy link

hello, have you solved this problem? I met the same issue for deepspeed zero3. If solved ,could you please share your solution? Thanks in advance. @JohnConnor123

@JohnConnor123
Copy link
Author

JohnConnor123 commented Feb 13, 2025

hello, have you solved this problem? I met the same issue for deepspeed zero3. If solved ,could you please share your solution? Thanks in advance. @JohnConnor123

Unfortunately, no. But there is advice which is helped to some people: huggingface/trl#2795

@yangqiancheng-yuan
Copy link

hello, have you solved this problem? I met the same issue for deepspeed zero3. If solved ,could you please share your solution? Thanks in advance. @JohnConnor123

Unfortunately, no. But there is advice which is helped to some people: huggingface/trl#2795

unfortunately, it can not solve my problem either. However, thanks for your reponse. If i can find the solution somewhere else, I will u @JohnConnor123 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants