You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
python open_instruct/ppo_vllm_thread_ray_gtrl.py --exp_name tulu-3-8b-rlvr --dataset_mixer_list allenai/RLVR-GSM-MATH-IF-Mixed-Constraints 1.0 --dataset_mixer_list_splits train --dataset_mixer_eval_list allenai/RLVR-GSM-MATH-IF-Mixed-Constraints 16 --dataset_mixer_eval_list_splits train --max_token_length 2048 --max_prompt_token_length 2048 --response_length 2048 --model_name_or_path allenai/Llama-3.1-Tulu-3-8B-DPO --reward_model_path allenai/Llama-3.1-Tulu-3-8B-RM --non_stop_penalty --stop_token eos --temperature 1.0 --ground_truths_key ground_truth --chat_template_name tulu --sft_messages_key messages --learning_rate 3e-7 --total_episodes 10000000 --penalty_reward_value -10.0 --deepspeed_stage 2 --per_device_train_batch_size 2 --local_rollout_forward_batch_size 2 --local_mini_batch_size 4 --local_rollout_batch_size 4 --actor_num_gpus_per_node 7 --vllm_tensor_parallel_size 1 --beta 0.05 --apply_verifiable_reward true --output_dir output/rlvr_8b --seed 3 --num_evals 3 --save_freq 100 --reward_model_multiplier 0.0 --gradient_checkpointing --with_tracking
[2025-02-26 21:08:01,753] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
INFO 02-26 21:08:03 [__init__.py:207] Automatically detected platform cuda.
wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
wandb: Currently logged in as: qzyph. Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.18.1
wandb: Run data is saved locally in /workspace/open-instruct/wandb/run-20250226_210806-femvooh7
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run tulu-3-8b-rlvr__3__1740604084
wandb: ⭐️ View project at https://wandb.ai/qzyph/open_instruct_internal
wandb: 🚀 View run at https://wandb.ai/qzyph/open_instruct_internal/runs/femvooh7
tokenizer_config.json: 100%|█████████████████████████████████████| 50.5k/50.5k [00:00<00:00, 49.6MB/s]
tokenizer.json: 100%|████████████████████████████████████████████| 9.09M/9.09M [00:00<00:00, 48.7MB/s]
special_tokens_map.json: 100%|██████████████████████████████████████| 73.0/73.0 [00:00<00:00, 219kB/s]
by default, we will use the same split for all datasets
✅ Found cached dataset at https://huggingface.co/datasets/stijn-zyphra/dataset-mix-cached/tree/b322610826
by default, we will use the same split for all datasets
✅ Found cached dataset at https://huggingface.co/datasets/stijn-zyphra/dataset-mix-cached/tree/ee2d7afc67
[
│ Args(
│ │ dataset_mixer_list=['allenai/RLVR-GSM-MATH-IF-Mixed-Constraints', '1.0'],
│ │ dataset_mixer_eval_list=['allenai/RLVR-GSM-MATH-IF-Mixed-Constraints', '16'],
│ │ dataset_mixer_list_splits=['train'],
│ │ dataset_mixer_eval_list_splits=['train'],
│ │ exp_name='tulu-3-8b-rlvr',
│ │ seed=3,
│ │ run_name='tulu-3-8b-rlvr__3__1740604084',
│ │ eps=1e-05,
│ │ learning_rate=3e-07,
│ │ lr_scheduler_type='linear',
│ │ warm_up_steps=0,
│ │ warmup_ratio=0.0,
│ │ gradient_accumulation_steps=16,
│ │ per_device_train_batch_size=2,
│ │ per_device_eval_batch_size=1,
│ │ total_episodes=10000000,
│ │ world_size=7,
│ │ micro_batch_size=14,
│ │ local_rollout_batch_size=32,
│ │ local_total_prompts=32,
│ │ rollout_batch_size=224,
│ │ num_training_steps=44642,
│ │ num_evals=3,
│ │ eval_freq=14880,
│ │ local_dataloader_batch_size=224,
│ │ save_freq=100,
│ │ num_epochs=4,
│ │ num_mini_batches=1,
│ │ local_mini_batch_size=32,
│ │ mini_batch_size=224,
│ │ local_rollout_forward_batch_size=2,
│ │ reward_model_path='allenai/Llama-3.1-Tulu-3-8B-RM',
│ │ reward_model_revision=None,
│ │ init_value_from_scratch=False,
│ │ response_length=2048,
│ │ stop_token='eos',
│ │ stop_token_id=None,
│ │ min_response_length=0,
│ │ temperature=1.0,
│ │ penalty_reward_value=-10.0,
│ │ non_stop_penalty=True,
│ │ number_samples_per_prompt=1,
│ │ stop_strings=None,
│ │ eval_max_length=4096,
│ │ beta=0.05,
│ │ whiten_rewards=False,
│ │ cliprange=0.2,
│ │ vf_coef=0.1,
│ │ cliprange_value=0.2,
│ │ gamma=1,
│ │ lam=0.95,
│ │ kl_estimator='kl1',
│ │ apply_verifiable_reward=True,
│ │ reward_model_multiplier=0.0,
│ │ verification_reward=10.0,
│ │ add_r1_style_format_reward=False,
│ │ r1_style_format_reward=1.0,
│ │ async_mode=True,
│ │ actor_num_gpus_per_node=[7],
│ │ single_gpu_mode=False,
│ │ vllm_num_engines=1,
│ │ vllm_tensor_parallel_size=1,
│ │ vllm_enforce_eager=False,
│ │ vllm_sync_backend='nccl',
│ │ vllm_gpu_memory_utilization=0.9,
│ │ enable_prefix_caching=False,
│ │ deepspeed_stage=3,
│ │ gather_whole_model=True,
│ │ with_tracking=True,
│ │ wandb_project_name='open_instruct_internal',
│ │ wandb_entity=None,
│ │ push_to_hub=True,
│ │ hf_entity='stijn-zyphra',
│ │ hf_repo_id='stijn-zyphra/open_instruct_dev',
│ │ hf_repo_revision='tulu-3-8b-rlvr__3__1740604084',
│ │ hf_repo_url='https://huggingface.co/stijn-zyphra/open_instruct_dev/tree/tulu-3-8b-rlvr__3__1740604084',
│ │ output_dir='output/rlvr_8b',
│ │ checkpoint_output_dir=None,
│ │ cache_dataset_only=False,
│ │ save_value_model=False,
│ │ try_launch_beaker_eval_jobs=True,
│ │ try_launch_beaker_eval_jobs_on_weka=False,
│ │ try_auto_save_to_beaker=True,
│ │ oe_eval_tasks=None,
│ │ hf_metadata_dataset='allenai/tulu-3-evals'
│ ),
│ DatasetConfig(
│ │ chat_template=None,
│ │ preference_chosen_key='chosen',
│ │ preference_rejected_key='rejected',
│ │ sft_messages_key='messages',
│ │ ground_truths_key='ground_truth',
│ │ dataset_source_key='dataset',
│ │ binary_messages_key='messages',
│ │ label='binary_labels',
│ │ convert_preference_to_binary_dataset=False,
│ │ max_token_length=2048,
│ │ max_prompt_token_length=2048,
│ │ sanity_check=False,
│ │ sanity_check_max_samples=100,
│ │ batched=False,
│ │ load_from_cache_file=True,
│ │ num_proc=176,
│ │ train_only_on_prompt=False,
│ │ ncols=2
│ ),
│ ModelConfig(
│ │ model_name_or_path='meta-llama/Llama-3.1-8B',
│ │ model_revision=None,
│ │ tokenizer_name=None,
│ │ tokenizer_revision=None,
│ │ use_slow_tokenizer=False,
│ │ add_bos=False,
│ │ chat_template_name='tulu',
│ │ trust_remote_code=False,
│ │ torch_dtype=None,
│ │ attn_implementation=None,
│ │ use_cache=False,
│ │ gradient_checkpointing=True,
│ │ use_peft=False,
│ │ lora_r=16,
│ │ lora_alpha=32,
│ │ lora_dropout=0.05,
│ │ lora_target_modules=None,
│ │ lora_modules_to_save=None,
│ │ lora_task_type='CAUSAL_LM',
│ │ load_in_8bit=False,
│ │ load_in_4bit=False,
│ │ bnb_4bit_quant_type='nf4',
│ │ use_bnb_nested_quant=False
│ )
]
<|user|>
Question: Find the domain of the expression $\frac{\sqrt{x-2}}{\sqrt{5-x}}$.}
Answer:The expressions inside each square root must be non-negative.
Therefore, $x-2 \ge 0$, so $x\ge2$, and $5 - x \ge 0$, so $x \le 5$.
Also, the denominator cannot be equal to zero, so $5-x>0$, which gives $x<5$.
Therefore, the domain of the expression is $\boxed{[2,5)}$.
Question: If $\det \mathbf{A} = 2$ and $\det \mathbf{B} = 12,$ then find $\det (\mathbf{A}
\mathbf{B}).$
Answer:We have that $\det (\mathbf{A} \mathbf{B}) = (\det \mathbf{A})(\det \mathbf{B}) = (2)(12) =
\boxed{24}.$
Question: Terrell usually lifts two 20-pound weights 12 times. If he uses two 15-pound weights
instead, how many times must Terrell lift them in order to lift the same total weight?
Answer:If Terrell lifts two 20-pound weights 12 times, he lifts a total of $2\cdot 12\cdot20=480$
pounds of weight. If he lifts two 15-pound weights instead for $n$ times, he will lift a total of
$2\cdot15\cdot n=30n$ pounds of weight. Equating this to 480 pounds, we can solve for $n$:
\begin{align*}
30n&=480\\
\Rightarrow\qquad n&=480/30=\boxed{16}
\end{align*}
Question: If the system of equations
\begin{align*}
6x-4y&=a,\\
6y-9x &=b.
\end{align*}has a solution $(x, y)$ where $x$ and $y$ are both nonzero, find $\frac{a}{b},$ assuming
$b$ is nonzero.
Answer:If we multiply the first equation by $-\frac{3}{2}$, we obtain
$$6y-9x=-\frac{3}{2}a.$$Since we also know that $6y-9x=b$, we have
$$-\frac{3}{2}a=b\Rightarrow\frac{a}{b}=\boxed{-\frac{2}{3}}.$$
Question: What is the value of $x$ if $|x-1| = |x-2|$? Express your answer as a common fraction.
<|assistant|>
2025-02-26 21:08:13,145 INFO worker.py:1821 -- Started a local Ray instance.
(pid=142922) [2025-02-26 21:08:19,773] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
(pid=142922) INFO 02-26 21:08:23 [__init__.py:207] Automatically detected platform cuda.
rank=1, world_size=7, rank=1, master_addr='172.27.46.59', master_port=40735
rank=2, world_size=7, rank=2, master_addr='172.27.46.59', master_port=40735
rank=3, world_size=7, rank=3, master_addr='172.27.46.59', master_port=40735
rank=4, world_size=7, rank=4, master_addr='172.27.46.59', master_port=40735
rank=5, world_size=7, rank=5, master_addr='172.27.46.59', master_port=40735
rank=6, world_size=7, rank=6, master_addr='172.27.46.59', master_port=40735
vllm: num_gpus=1, num_engines=1
(PolicyTrainerRayProcess pid=142922) [2025-02-26 21:08:24,370] [INFO] [comm.py:658:init_distributed] cdb=None
(PolicyTrainerRayProcess pid=142922) [2025-02-26 21:08:24,370] [INFO] [comm.py:689:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
(pid=142930) INFO 02-26 21:08:29 [__init__.py:207] Automatically detected platform cuda.
(pid=142948) [2025-02-26 21:08:30,409] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
(pid=142948) INFO 02-26 21:08:35 [__init__.py:207] Automatically detected platform cuda.
(pid=142928) [2025-02-26 21:08:31,342] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) [repeated 5x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(PolicyTrainerRayProcess pid=142948) [2025-02-26 21:08:35,659] [INFO] [comm.py:658:init_distributed] cdb=None
(PolicyTrainerRayProcess pid=142948) dschf=<transformers.integrations.deepspeed.HfDeepSpeedConfig object at 0x7f327831d9c0>
Downloading shards: 0%| | 0/4 [00:00<?, ?it/s]
(LLMRayActor pid=142930) INFO 02-26 21:08:49 [config.py:569] This model supports multiple tasks: {'generate', 'reward', 'score', 'embed', 'classify'}. Defaulting to 'generate'.
(LLMRayActor pid=142930) INFO 02-26 21:08:49 [llm_engine.py:235] Initializing a V0 LLM engine (v0.7.4.dev115+g4cb6fa0a) with config: model='meta-llama/Llama-3.1-8B', speculative_config=None, tokenizer='meta-llama/Llama-3.1-8B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=3, served_model_name=meta-llama/Llama-3.1-8B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False,
(pid=142927) INFO 02-26 21:08:35 [__init__.py:207] Automatically detected platform cuda. [repeated 5x across cluster]
(PolicyTrainerRayProcess pid=142927) [2025-02-26 21:08:36,673] [INFO] [comm.py:658:init_distributed] cdb=None [repeated 5x across cluster]
(PolicyTrainerRayProcess pid=142922) dschf=<transformers.integrations.deepspeed.HfDeepSpeedConfig object at 0x7f14e5151420> [repeated 6x across cluster]
(LLMRayActor pid=142930) INFO 02-26 21:08:50 [cuda.py:229] Using Flash Attention backend.
(LLMRayActor pid=142930) INFO 02-26 21:08:52 [parallel_state.py:948] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
(LLMRayActor pid=142930) INFO 02-26 21:08:52 [model_runner.py:1110] Starting to load model meta-llama/Llama-3.1-8B...
(LLMRayActor pid=142930) INFO 02-26 21:08:54 [weight_utils.py:254] Using model weights format ['*.safetensors']
Downloading shards: 25%|██▌ | 1/4 [00:24<01:12, 24.19s/it]
Downloading shards: 0%| | 0/4 [00:00<?, ?it/s] [repeated 6x across cluster]
Downloading shards: 50%|█████ | 2/4 [00:51<00:52, 26.10s/it] [repeated 7x across cluster]
Downloading shards: 75%|███████▌ | 3/4 [01:17<00:26, 26.17s/it] [repeated 7x across cluster]
(LLMRayActor pid=142930) INFO 02-26 21:10:02 [weight_utils.py:270] Time spent downloading weights for meta-llama/Llama-3.1-8B: 67.931327 seconds
(PolicyTrainerRayProcess pid=142921) [2025-02-26 21:10:02,053] [INFO] [config.py:734:__init__] Config mesh_device None world_size = 7
Downloading shards: 100%|██████████| 4/4 [01:23<00:00, 20.91s/it]
Downloading shards: 75%|███████▌ | 3/4 [01:18<00:26, 26.30s/it] [repeated 6x across cluster]
(PolicyTrainerRayProcess pid=142918) You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/workspace/open-instruct/open_instruct/ppo_vllm_thread_ray_gtrl.py", line 1871, in <module>
main(*parser.parse())
File "/workspace/open-instruct/open_instruct/ppo_vllm_thread_ray_gtrl.py", line 1769, in main
ray.get(inits)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2755, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 906, in get_objects
raise value.as_instanceof_cause()
return model_class.from_pretrained(
File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 268, in _wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 4406, in from_pretrained
) = cls._load_pretrained_model(
File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 4998, in _load_pretrained_model
model_to_load.load_state_dict(fixed_state_dict, strict=False, assign=assign_to_params_buffers)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 2584, in load_state_dict
raise RuntimeError(
RuntimeError: Error(s) in loading state_dict for LlamaForCausalLM:
size mismatch for model.embed_tokens.weight: copying a param with shape torch.Size([128256, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
size mismatch for model.layers.0.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
size mismatch for model.layers.0.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
size mismatch for model.layers.0.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
size mismatch for model.layers.0.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
size mismatch for model.layers.0.mlp.gate_proj.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
size mismatch for model.layers.0.mlp.up_proj.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
size mismatch for model.layers.0.mlp.down_proj.weight: copying a param with shape torch.Size([4096, 14336]) from checkpoint, the shape in current model is torch.Size([0]).
size mismatch for model.layers.0.input_layernorm.weight: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
size mismatch for model.layers.0.post_attention_layernorm.weight: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
size mismatch for model.layers.1.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
size mismatch for model.layers.1.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
size mismatch for model.layers.1.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
size mismatch for model.layers.1.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
size mismatch for model.layers.1.mlp.gate_proj.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
size mismatch for model.layers.1.mlp.up_proj.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
si
4096, 14336]) from checkpoint, the shape in current model is torch.Size([0]).
size mismatch for model.layers.1.input_layernorm.weight: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
size mismatch for model.layers.1.post_attention_layernorm.weight: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
size mismatch for model.layers.2.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
....
size mismatch for model.layers.8.input_layernorm.weight: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
size mismatch for model.layers.8.post_attention_layernorm.weight: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
Traceback (most recent call last):
File "/workspace/open-instruct/open_instruct/ppo_vllm_thread_ray_gtrl.py", line 1871, in <module>
main(*parser.parse())
File "/workspace/open-instruct/open_instruct/ppo_vllm_thread_ray_gtrl.py", line 1769, in main
ray.get(inits)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2755, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 906, in get_objects
raise value.as_instanceof_cause()
return model_class.from_pretrained(
File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 268, in _wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 4406, in from_pretrained
) = cls._load_pretrained_model(
File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 4998, in _load_pretrained_model
model_to_load.load_state_dict(fixed_state_dict, strict=False, assign=assign_to_params_buffers)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 2584, in load_state_dict
raise RuntimeError(
RuntimeError: Error(s) in loading state_dict for LlamaForCausalLM:
size mismatch for model.embed_tokens.weight: copying a param with shape torch.Size([128256, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
size mismatch for model.layers.0.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
size mismatch for model.layers.0.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
size mismatch for model.layers.0.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
...
size mismatch for model.layers.8.input_layernorm.weight: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
size mismatch for model.layers.8.post_attention_layernorm.weight: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
(PolicyTrainerRayProcess pid=142922) [2025-02-26 21:10:15,533] [INFO] [partition_parameters.py:348:__exit__] finished initializing model - num_params = 291, num_elems = 8.03B
(PolicyTrainerRayProcess pid=142918) [2025-02-26 21:10:02,056] [INFO] [config.py:734:__init__] Config mesh_device None world_size = 7 [repeated 6x across cluster]
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s]
Downloading shards: 100%|██████████| 4/4 [01:23<00:00, 20.80s/it] [repeated 6x across cluster]
(PolicyTrainerRayProcess pid=142940) You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. [repeated 6x across cluster]
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s]
wandb: 🚀 View run tulu-3-8b-rlvr__3__1740604084 at: https://wandb.ai/qzyph/open_instruct_internal/runs/femvooh7
wandb: Find logs at: wandb/run-20250226_210806-femvooh7/logs
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:14<00:43, 14.56s/it]
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s] [repeated 4x across cluster]
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s] [repeated 2x across cluster]
Hey team! I've been trying to replicate ppo_vllm_thread_ray_gtrl.py with DeepSpeed ZeRO-3 on Llama-3.1-8B-DPO with 8 H100s, but I'm hitting a frustrating issue. When loading the model with deepspeed_stage=3, I get a bunch of size mismatch errors where all the model parameters appear to have torch.Size([0]) instead of their actual dimensions. The error starts with the embedding layer and continues through all attention layers: size mismatch for model.embed_tokens.weight: copying a param with shape torch.Size([128256, 4096]) from checkpoint, the shape in current model is torch.Size([0]). Interestingly, when I switch to deepspeed_stage=2, everything works perfectly fine! Did you encounter this bug before during development? It seems to have something to do with ZeRO-3 parameter initialization and sharding. but I'm not sure why the model parameters are showing up as empty tensors. Not a lot of bugs on this as well. Any suggestions would be super appreciated!
The text was updated successfully, but these errors were encountered:
s-smits
changed the title
DeepSpeed ZeRO-3 Parameter Loading Issue with Llama-3.1-8B
DeepSpeed ZeRO-3 Parameter Loading Issue with allenai/Llama-3.1-Tulu-3-8B-DPO
Feb 26, 2025
s-smits
changed the title
DeepSpeed ZeRO-3 Parameter Loading Issue with allenai/Llama-3.1-Tulu-3-8B-DPO
DeepSpeed ZeRO-3 Parameter Loading Issue with ppo_vllm_thread_ray_gtrl.py
Feb 26, 2025
Hey team! I've been trying to replicate ppo_vllm_thread_ray_gtrl.py with DeepSpeed ZeRO-3 on Llama-3.1-8B-DPO with 8 H100s, but I'm hitting a frustrating issue. When loading the model with
deepspeed_stage=3
, I get a bunch of size mismatch errors where all the model parameters appear to havetorch.Size([0])
instead of their actual dimensions. The error starts with the embedding layer and continues through all attention layers: size mismatch for model.embed_tokens.weight: copying a param with shapetorch.Size([128256, 4096])
from checkpoint, the shape in current model is torch.Size([0]). Interestingly, when I switch todeepspeed_stage=2
, everything works perfectly fine! Did you encounter this bug before during development? It seems to have something to do with ZeRO-3 parameter initialization and sharding. but I'm not sure why the model parameters are showing up as empty tensors. Not a lot of bugs on this as well. Any suggestions would be super appreciated!The text was updated successfully, but these errors were encountered: