DeepSpeed ZeRO-3 Parameter Loading Issue with ppo_vllm_thread_ray_gtrl.py #583

s-smits · 2025-02-26T21:32:24Z

python open_instruct/ppo_vllm_thread_ray_gtrl.py     --exp_name tulu-3-8b-rlvr     --dataset_mixer_list allenai/RLVR-GSM-MATH-IF-Mixed-Constraints 1.0     --dataset_mixer_list_splits train     --dataset_mixer_eval_list allenai/RLVR-GSM-MATH-IF-Mixed-Constraints 16     --dataset_mixer_eval_list_splits train     --max_token_length 2048     --max_prompt_token_length 2048     --response_length 2048     --model_name_or_path allenai/Llama-3.1-Tulu-3-8B-DPO     --reward_model_path allenai/Llama-3.1-Tulu-3-8B-RM     --non_stop_penalty     --stop_token eos     --temperature 1.0     --ground_truths_key ground_truth     --chat_template_name tulu     --sft_messages_key messages     --learning_rate 3e-7     --total_episodes 10000000     --penalty_reward_value -10.0     --deepspeed_stage 2     --per_device_train_batch_size 2     --local_rollout_forward_batch_size 2     --local_mini_batch_size 4     --local_rollout_batch_size 4     --actor_num_gpus_per_node 7     --vllm_tensor_parallel_size 1     --beta 0.05     --apply_verifiable_reward true     --output_dir output/rlvr_8b     --seed 3     --num_evals 3     --save_freq 100     --reward_model_multiplier 0.0     --gradient_checkpointing     --with_tracking

[2025-02-26 21:08:01,753] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
INFO 02-26 21:08:03 [__init__.py:207] Automatically detected platform cuda.
wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
wandb: Currently logged in as: qzyph. Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.18.1
wandb: Run data is saved locally in /workspace/open-instruct/wandb/run-20250226_210806-femvooh7
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run tulu-3-8b-rlvr__3__1740604084
wandb: ⭐️ View project at https://wandb.ai/qzyph/open_instruct_internal
wandb: 🚀 View run at https://wandb.ai/qzyph/open_instruct_internal/runs/femvooh7
tokenizer_config.json: 100%|█████████████████████████████████████| 50.5k/50.5k [00:00<00:00, 49.6MB/s]
tokenizer.json: 100%|████████████████████████████████████████████| 9.09M/9.09M [00:00<00:00, 48.7MB/s]
special_tokens_map.json: 100%|██████████████████████████████████████| 73.0/73.0 [00:00<00:00, 219kB/s]
by default, we will use the same split for all datasets
✅ Found cached dataset at https://huggingface.co/datasets/stijn-zyphra/dataset-mix-cached/tree/b322610826
by default, we will use the same split for all datasets
✅ Found cached dataset at https://huggingface.co/datasets/stijn-zyphra/dataset-mix-cached/tree/ee2d7afc67
[
│   Args(
│   │   dataset_mixer_list=['allenai/RLVR-GSM-MATH-IF-Mixed-Constraints', '1.0'],
│   │   dataset_mixer_eval_list=['allenai/RLVR-GSM-MATH-IF-Mixed-Constraints', '16'],
│   │   dataset_mixer_list_splits=['train'],
│   │   dataset_mixer_eval_list_splits=['train'],
│   │   exp_name='tulu-3-8b-rlvr',
│   │   seed=3,
│   │   run_name='tulu-3-8b-rlvr__3__1740604084',
│   │   eps=1e-05,
│   │   learning_rate=3e-07,
│   │   lr_scheduler_type='linear',
│   │   warm_up_steps=0,
│   │   warmup_ratio=0.0,
│   │   gradient_accumulation_steps=16,
│   │   per_device_train_batch_size=2,
│   │   per_device_eval_batch_size=1,
│   │   total_episodes=10000000,
│   │   world_size=7,
│   │   micro_batch_size=14,
│   │   local_rollout_batch_size=32,
│   │   local_total_prompts=32,
│   │   rollout_batch_size=224,
│   │   num_training_steps=44642,
│   │   num_evals=3,
│   │   eval_freq=14880,
│   │   local_dataloader_batch_size=224,
│   │   save_freq=100,
│   │   num_epochs=4,
│   │   num_mini_batches=1,
│   │   local_mini_batch_size=32,
│   │   mini_batch_size=224,
│   │   local_rollout_forward_batch_size=2,
│   │   reward_model_path='allenai/Llama-3.1-Tulu-3-8B-RM',
│   │   reward_model_revision=None,
│   │   init_value_from_scratch=False,
│   │   response_length=2048,
│   │   stop_token='eos',
│   │   stop_token_id=None,
│   │   min_response_length=0,
│   │   temperature=1.0,
│   │   penalty_reward_value=-10.0,
│   │   non_stop_penalty=True,
│   │   number_samples_per_prompt=1,
│   │   stop_strings=None,
│   │   eval_max_length=4096,
│   │   beta=0.05,
│   │   whiten_rewards=False,
│   │   cliprange=0.2,
│   │   vf_coef=0.1,
│   │   cliprange_value=0.2,
│   │   gamma=1,
│   │   lam=0.95,
│   │   kl_estimator='kl1',
│   │   apply_verifiable_reward=True,
│   │   reward_model_multiplier=0.0,
│   │   verification_reward=10.0,
│   │   add_r1_style_format_reward=False,
│   │   r1_style_format_reward=1.0,
│   │   async_mode=True,
│   │   actor_num_gpus_per_node=[7],
│   │   single_gpu_mode=False,
│   │   vllm_num_engines=1,
│   │   vllm_tensor_parallel_size=1,
│   │   vllm_enforce_eager=False,
│   │   vllm_sync_backend='nccl',
│   │   vllm_gpu_memory_utilization=0.9,
│   │   enable_prefix_caching=False,
│   │   deepspeed_stage=3,
│   │   gather_whole_model=True,
│   │   with_tracking=True,
│   │   wandb_project_name='open_instruct_internal',
│   │   wandb_entity=None,
│   │   push_to_hub=True,
│   │   hf_entity='stijn-zyphra',
│   │   hf_repo_id='stijn-zyphra/open_instruct_dev',
│   │   hf_repo_revision='tulu-3-8b-rlvr__3__1740604084',
│   │   hf_repo_url='https://huggingface.co/stijn-zyphra/open_instruct_dev/tree/tulu-3-8b-rlvr__3__1740604084',
│   │   output_dir='output/rlvr_8b',
│   │   checkpoint_output_dir=None,
│   │   cache_dataset_only=False,
│   │   save_value_model=False,
│   │   try_launch_beaker_eval_jobs=True,
│   │   try_launch_beaker_eval_jobs_on_weka=False,
│   │   try_auto_save_to_beaker=True,
│   │   oe_eval_tasks=None,
│   │   hf_metadata_dataset='allenai/tulu-3-evals'
│   ),
│   DatasetConfig(
│   │   chat_template=None,
│   │   preference_chosen_key='chosen',
│   │   preference_rejected_key='rejected',
│   │   sft_messages_key='messages',
│   │   ground_truths_key='ground_truth',
│   │   dataset_source_key='dataset',
│   │   binary_messages_key='messages',
│   │   label='binary_labels',
│   │   convert_preference_to_binary_dataset=False,
│   │   max_token_length=2048,
│   │   max_prompt_token_length=2048,
│   │   sanity_check=False,
│   │   sanity_check_max_samples=100,
│   │   batched=False,
│   │   load_from_cache_file=True,
│   │   num_proc=176,
│   │   train_only_on_prompt=False,
│   │   ncols=2
│   ),
│   ModelConfig(
│   │   model_name_or_path='meta-llama/Llama-3.1-8B',
│   │   model_revision=None,
│   │   tokenizer_name=None,
│   │   tokenizer_revision=None,
│   │   use_slow_tokenizer=False,
│   │   add_bos=False,
│   │   chat_template_name='tulu',
│   │   trust_remote_code=False,
│   │   torch_dtype=None,
│   │   attn_implementation=None,
│   │   use_cache=False,
│   │   gradient_checkpointing=True,
│   │   use_peft=False,
│   │   lora_r=16,
│   │   lora_alpha=32,
│   │   lora_dropout=0.05,
│   │   lora_target_modules=None,
│   │   lora_modules_to_save=None,
│   │   lora_task_type='CAUSAL_LM',
│   │   load_in_8bit=False,
│   │   load_in_4bit=False,
│   │   bnb_4bit_quant_type='nf4',
│   │   use_bnb_nested_quant=False
│   )
]
<|user|>
Question: Find the domain of the expression $\frac{\sqrt{x-2}}{\sqrt{5-x}}$.}
Answer:The expressions inside each square root must be non-negative.
Therefore, $x-2 \ge 0$, so $x\ge2$, and $5 - x \ge 0$, so $x \le 5$.
Also, the denominator cannot be equal to zero, so $5-x>0$, which gives $x<5$.
Therefore, the domain of the expression is $\boxed{[2,5)}$.

Question: If $\det \mathbf{A} = 2$ and $\det \mathbf{B} = 12,$ then find $\det (\mathbf{A}
\mathbf{B}).$
Answer:We have that $\det (\mathbf{A} \mathbf{B}) = (\det \mathbf{A})(\det \mathbf{B}) = (2)(12) =
\boxed{24}.$

Question: Terrell usually lifts two 20-pound weights 12 times. If he uses two 15-pound weights
instead, how many times must Terrell lift them in order to lift the same total weight?
Answer:If Terrell lifts two 20-pound weights 12 times, he lifts a total of $2\cdot 12\cdot20=480$
pounds of weight.  If he lifts two 15-pound weights instead for $n$ times, he will lift a total of
$2\cdot15\cdot n=30n$ pounds of weight.  Equating this to 480 pounds, we can solve for $n$:
\begin{align*}
30n&=480\\
\Rightarrow\qquad n&=480/30=\boxed{16}
\end{align*}

Question: If the system of equations

\begin{align*}
6x-4y&=a,\\
6y-9x &=b.
\end{align*}has a solution $(x, y)$ where $x$ and $y$ are both nonzero, find $\frac{a}{b},$ assuming
$b$ is nonzero.
Answer:If we multiply the first equation by $-\frac{3}{2}$, we obtain

$$6y-9x=-\frac{3}{2}a.$$Since we also know that $6y-9x=b$, we have

$$-\frac{3}{2}a=b\Rightarrow\frac{a}{b}=\boxed{-\frac{2}{3}}.$$

Question: What is the value of $x$ if $|x-1| = |x-2|$? Express your answer as a common fraction.
<|assistant|>

2025-02-26 21:08:13,145 INFO worker.py:1821 -- Started a local Ray instance.
(pid=142922) [2025-02-26 21:08:19,773] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
(pid=142922) INFO 02-26 21:08:23 [__init__.py:207] Automatically detected platform cuda.
rank=1, world_size=7, rank=1, master_addr='172.27.46.59', master_port=40735
rank=2, world_size=7, rank=2, master_addr='172.27.46.59', master_port=40735
rank=3, world_size=7, rank=3, master_addr='172.27.46.59', master_port=40735
rank=4, world_size=7, rank=4, master_addr='172.27.46.59', master_port=40735
rank=5, world_size=7, rank=5, master_addr='172.27.46.59', master_port=40735
rank=6, world_size=7, rank=6, master_addr='172.27.46.59', master_port=40735
vllm: num_gpus=1, num_engines=1
(PolicyTrainerRayProcess pid=142922) [2025-02-26 21:08:24,370] [INFO] [comm.py:658:init_distributed] cdb=None
(PolicyTrainerRayProcess pid=142922) [2025-02-26 21:08:24,370] [INFO] [comm.py:689:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
(pid=142930) INFO 02-26 21:08:29 [__init__.py:207] Automatically detected platform cuda.
(pid=142948) [2025-02-26 21:08:30,409] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
(pid=142948) INFO 02-26 21:08:35 [__init__.py:207] Automatically detected platform cuda.
(pid=142928) [2025-02-26 21:08:31,342] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) [repeated 5x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(PolicyTrainerRayProcess pid=142948) [2025-02-26 21:08:35,659] [INFO] [comm.py:658:init_distributed] cdb=None
(PolicyTrainerRayProcess pid=142948) dschf=<transformers.integrations.deepspeed.HfDeepSpeedConfig object at 0x7f327831d9c0>
Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]
(LLMRayActor pid=142930) INFO 02-26 21:08:49 [config.py:569] This model supports multiple tasks: {'generate', 'reward', 'score', 'embed', 'classify'}. Defaulting to 'generate'.
(LLMRayActor pid=142930) INFO 02-26 21:08:49 [llm_engine.py:235] Initializing a V0 LLM engine (v0.7.4.dev115+g4cb6fa0a) with config: model='meta-llama/Llama-3.1-8B', speculative_config=None, tokenizer='meta-llama/Llama-3.1-8B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=3, served_model_name=meta-llama/Llama-3.1-8B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False,
(pid=142927) INFO 02-26 21:08:35 [__init__.py:207] Automatically detected platform cuda. [repeated 5x across cluster]
(PolicyTrainerRayProcess pid=142927) [2025-02-26 21:08:36,673] [INFO] [comm.py:658:init_distributed] cdb=None [repeated 5x across cluster]
(PolicyTrainerRayProcess pid=142922) dschf=<transformers.integrations.deepspeed.HfDeepSpeedConfig object at 0x7f14e5151420> [repeated 6x across cluster]
(LLMRayActor pid=142930) INFO 02-26 21:08:50 [cuda.py:229] Using Flash Attention backend.
(LLMRayActor pid=142930) INFO 02-26 21:08:52 [parallel_state.py:948] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
(LLMRayActor pid=142930) INFO 02-26 21:08:52 [model_runner.py:1110] Starting to load model meta-llama/Llama-3.1-8B...
(LLMRayActor pid=142930) INFO 02-26 21:08:54 [weight_utils.py:254] Using model weights format ['*.safetensors']
Downloading shards:  25%|██▌       | 1/4 [00:24<01:12, 24.19s/it]
Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s] [repeated 6x across cluster]
Downloading shards:  50%|█████     | 2/4 [00:51<00:52, 26.10s/it] [repeated 7x across cluster]
Downloading shards:  75%|███████▌  | 3/4 [01:17<00:26, 26.17s/it] [repeated 7x across cluster]
(LLMRayActor pid=142930) INFO 02-26 21:10:02 [weight_utils.py:270] Time spent downloading weights for meta-llama/Llama-3.1-8B: 67.931327 seconds
(PolicyTrainerRayProcess pid=142921) [2025-02-26 21:10:02,053] [INFO] [config.py:734:__init__] Config mesh_device None world_size = 7
Downloading shards: 100%|██████████| 4/4 [01:23<00:00, 20.91s/it]
Downloading shards:  75%|███████▌  | 3/4 [01:18<00:26, 26.30s/it] [repeated 6x across cluster]
(PolicyTrainerRayProcess pid=142918) You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/workspace/open-instruct/open_instruct/ppo_vllm_thread_ray_gtrl.py", line 1871, in <module>
    main(*parser.parse())
  File "/workspace/open-instruct/open_instruct/ppo_vllm_thread_ray_gtrl.py", line 1769, in main
    ray.get(inits)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2755, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 906, in get_objects
    raise value.as_instanceof_cause()

    return model_class.from_pretrained(
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 268, in _wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 4406, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 4998, in _load_pretrained_model
    model_to_load.load_state_dict(fixed_state_dict, strict=False, assign=assign_to_params_buffers)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 2584, in load_state_dict
    raise RuntimeError(
RuntimeError: Error(s) in loading state_dict for LlamaForCausalLM:
        size mismatch for model.embed_tokens.weight: copying a param with shape torch.Size([128256, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for model.layers.0.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for model.layers.0.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for model.layers.0.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for model.layers.0.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for model.layers.0.mlp.gate_proj.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for model.layers.0.mlp.up_proj.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for model.layers.0.mlp.down_proj.weight: copying a param with shape torch.Size([4096, 14336]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for model.layers.0.input_layernorm.weight: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for model.layers.0.post_attention_layernorm.weight: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for model.layers.1.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for model.layers.1.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for model.layers.1.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for model.layers.1.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for model.layers.1.mlp.gate_proj.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for model.layers.1.mlp.up_proj.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
        si
4096, 14336]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for model.layers.1.input_layernorm.weight: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for model.layers.1.post_attention_layernorm.weight: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for model.layers.2.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
....
        size mismatch for model.layers.8.input_layernorm.weight: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for model.layers.8.post_attention_layernorm.weight: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
Traceback (most recent call last):
  File "/workspace/open-instruct/open_instruct/ppo_vllm_thread_ray_gtrl.py", line 1871, in <module>
    main(*parser.parse())
  File "/workspace/open-instruct/open_instruct/ppo_vllm_thread_ray_gtrl.py", line 1769, in main
    ray.get(inits)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2755, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 906, in get_objects
    raise value.as_instanceof_cause()

    return model_class.from_pretrained(
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 268, in _wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 4406, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 4998, in _load_pretrained_model
    model_to_load.load_state_dict(fixed_state_dict, strict=False, assign=assign_to_params_buffers)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 2584, in load_state_dict
    raise RuntimeError(
RuntimeError: Error(s) in loading state_dict for LlamaForCausalLM:
        size mismatch for model.embed_tokens.weight: copying a param with shape torch.Size([128256, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for model.layers.0.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for model.layers.0.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for model.layers.0.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
...
        size mismatch for model.layers.8.input_layernorm.weight: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for model.layers.8.post_attention_layernorm.weight: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
(PolicyTrainerRayProcess pid=142922) [2025-02-26 21:10:15,533] [INFO] [partition_parameters.py:348:__exit__] finished initializing model - num_params = 291, num_elems = 8.03B
(PolicyTrainerRayProcess pid=142918) [2025-02-26 21:10:02,056] [INFO] [config.py:734:__init__] Config mesh_device None world_size = 7 [repeated 6x across cluster]
Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]
Downloading shards: 100%|██████████| 4/4 [01:23<00:00, 20.80s/it] [repeated 6x across cluster]
(PolicyTrainerRayProcess pid=142940) You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. [repeated 6x across cluster]
Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]
wandb: 🚀 View run tulu-3-8b-rlvr__3__1740604084 at: https://wandb.ai/qzyph/open_instruct_internal/runs/femvooh7
wandb: Find logs at: wandb/run-20250226_210806-femvooh7/logs
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:14<00:43, 14.56s/it]
Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s] [repeated 4x across cluster]
Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s] [repeated 2x across cluster]

Hey team! I've been trying to replicate ppo_vllm_thread_ray_gtrl.py with DeepSpeed ZeRO-3 on Llama-3.1-8B-DPO with 8 H100s, but I'm hitting a frustrating issue. When loading the model with deepspeed_stage=3, I get a bunch of size mismatch errors where all the model parameters appear to have torch.Size([0]) instead of their actual dimensions. The error starts with the embedding layer and continues through all attention layers: size mismatch for model.embed_tokens.weight: copying a param with shape torch.Size([128256, 4096]) from checkpoint, the shape in current model is torch.Size([0]). Interestingly, when I switch to deepspeed_stage=2, everything works perfectly fine! Did you encounter this bug before during development? It seems to have something to do with ZeRO-3 parameter initialization and sharding. but I'm not sure why the model parameters are showing up as empty tensors. Not a lot of bugs on this as well. Any suggestions would be super appreciated!

The text was updated successfully, but these errors were encountered:

s-smits changed the title ~~DeepSpeed ZeRO-3 Parameter Loading Issue with Llama-3.1-8B~~ DeepSpeed ZeRO-3 Parameter Loading Issue with allenai/Llama-3.1-Tulu-3-8B-DPO Feb 26, 2025

s-smits changed the title ~~DeepSpeed ZeRO-3 Parameter Loading Issue with allenai/Llama-3.1-Tulu-3-8B-DPO~~ DeepSpeed ZeRO-3 Parameter Loading Issue with ppo_vllm_thread_ray_gtrl.py Feb 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeepSpeed ZeRO-3 Parameter Loading Issue with ppo_vllm_thread_ray_gtrl.py #583

DeepSpeed ZeRO-3 Parameter Loading Issue with ppo_vllm_thread_ray_gtrl.py #583

s-smits commented Feb 26, 2025

DeepSpeed ZeRO-3 Parameter Loading Issue with ppo_vllm_thread_ray_gtrl.py #583

DeepSpeed ZeRO-3 Parameter Loading Issue with ppo_vllm_thread_ray_gtrl.py #583

Comments

s-smits commented Feb 26, 2025