-
Notifications
You must be signed in to change notification settings - Fork 429
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reproduce/enable DeepSeek R1 Distill Llama 8B #7981
Comments
This should just work out of the box if we convert the distilled weights and use them in |
Yeah, we have a script to do that |
Hi @mergennachin , would love to see a script exporting the deep seek model so we can give it a try with our delegates as well :-) |
I have a distilled model but I need help converting this to pte file from pth. The config file required a little cleaning or the Now the script has The shapes of all layers are different. Any ideas how I can fix this? |
@CypherpunkSamurai It may be related to the compressed projection used in MLA. I'm creating the interface in #8039 so that different attentions can be implemented and added. |
@iseeyuan thank you very much for your work man. Let me know if I can help contribute. Here's the error logs btw: # removing these json keys fixes the ModelArgs issue
unsupported_args = [
"architectures",
"attention_bias",
"attention_dropout",
"bos_token_id",
"eos_token_id",
"hidden_act",
"hidden_size",
"initializer_range",
"intermediate_size",
"max_position_embeddings",
"mlp_bias",
"model_type",
"num_attention_heads",
"num_hidden_layers",
"num_key_value_heads",
"pretraining_tp",
"rms_norm_eps",
"rope_scaling",
"tie_word_embeddings",
"torch_dtype",
"transformers_version",
"use_cache"
] Error logs: !cd executorch \
&& python -m examples.models.llama.export_llama \
--checkpoint "/tmp/deepseek-r1-llama-8b-pth/deepseek-r1-llama-8b.pth" \
--params "/tmp/deepseek-r1-llama-8b-pth/config_filtered.json" \
--output_name "deepseek-r1-llama-8b.pte" \
-kv \
--use_sdpa_with_kv_cache -X -qmode 8da4w --group_size 128 -d fp32
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/kaggle/working/executorch/examples/models/llama/export_llama.py", line 32, in <module>
main() # pragma: no cover
File "/kaggle/working/executorch/examples/models/llama/export_llama.py", line 28, in main
export_llama(args)
File "/kaggle/working/executorch/examples/models/llama/export_llama_lib.py", line 533, in export_llama
builder = _export_llama(args)
File "/kaggle/working/executorch/examples/models/llama/export_llama_lib.py", line 668, in _export_llama
builder_exported = _prepare_for_llama_export(args).export()
File "/kaggle/working/executorch/examples/models/llama/export_llama_lib.py", line 565, in _prepare_for_llama_export
_load_llama_model(
File "/kaggle/working/executorch/examples/models/llama/export_llama_lib.py", line 940, in _load_llama_model
EagerModelFactory.create_model(
File "/kaggle/working/executorch/examples/models/model_factory.py", line 44, in create_model
model = model_class(**kwargs)
File "/kaggle/working/executorch/examples/models/llama/model.py", line 235, in __init__
missing, unexpected = self.model_.load_state_dict(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 2581, in load_state_dict
raise RuntimeError(
RuntimeError: Error(s) in loading state_dict for Transformer:
size mismatch for layers.0.attention.wk.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.0.attention.wv.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.0.feed_forward.w1.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
size mismatch for layers.0.feed_forward.w2.weight: copying a param with shape torch.Size([4096, 14336]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
size mismatch for layers.0.feed_forward.w3.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
size mismatch for layers.1.attention.wk.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.1.attention.wv.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.1.feed_forward.w1.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
size mismatch for layers.1.feed_forward.w2.weight: copying a param with shape torch.Size([4096, 14336]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
size mismatch for layers.1.feed_forward.w3.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
size mismatch for layers.2.attention.wk.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.2.attention.wv.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.2.feed_forward.w1.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
size mismatch for layers.2.feed_forward.w2.weight: copying a param with shape torch.Size([4096, 14336]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
size mismatch for layers.2.feed_forward.w3.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
size mismatch for layers.3.attention.wk.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.3.attention.wv.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.3.feed_forward.w1.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
size mismatch for layers.3.feed_forward.w2.weight: copying a param with shape torch.Size([4096, 14336]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
size mismatch for layers.3.feed_forward.w3.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
size mismatch for layers.4.attention.wk.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.4.attention.wv.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.4.feed_forward.w1.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
size mismatch for layers.4.feed_forward.w2.weight: copying a param with shape torch.Size([4096, 14336]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
size mismatch for layers.4.feed_forward.w3.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
size mismatch for layers.5.attention.wk.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.5.attention.wv.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.5.feed_forward.w1.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
size mismatch for layers.5.feed_forward.w2.weight: copying a param with shape torch.Size([4096, 14336]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
size mismatch for layers.5.feed_forward.w3.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
size mismatch for layers.6.attention.wk.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.6.attention.wv.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.6.feed_forward.w1.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
size mismatch for layers.6.feed_forward.w2.weight: copying a param with shape torch.Size([4096, 14336]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
size mismatch for layers.6.feed_forward.w3.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
size mismatch for layers.7.attention.wk.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.7.attention.wv.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.7.feed_forward.w1.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
size mismatch for layers.7.feed_forward.w2.weight: copying a param with shape torch.Size([4096, 14336]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
size mismatch for layers.7.feed_forward.w3.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
size mismatch for layers.8.attention.wk.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.8.attention.wv.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.8.feed_forward.w1.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
size mismatch for layers.8.feed_forward.w2.weight: copying a param with shape torch.Size([4096, 14336]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
size mismatch for layers.8.feed_forward.w3.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
size mismatch for layers.9.attention.wk.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.9.attention.wv.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.9.feed_forward.w1.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
size mismatch for layers.9.feed_forward.w2.weight: copying a param with shape torch.Size([4096, 14336]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
size mismatch for layers.9.feed_forward.w3.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
size mismatch for layers.10.attention.wk.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.10.attention.wv.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.10.feed_forward.w1.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
size mismatch for layers.10.feed_forward.w2.weight: copying a param with shape torch.Size([4096, 14336]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
size mismatch for layers.10.feed_forward.w3.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
size mismatch for layers.11.attention.wk.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.11.attention.wv.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.11.feed_forward.w1.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
size mismatch for layers.11.feed_forward.w2.weight: copying a param with shape torch.Size([4096, 14336]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
size mismatch for layers.11.feed_forward.w3.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
size mismatch for layers.12.attention.wk.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.12.attention.wv.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.12.feed_forward.w1.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
size mismatch for layers.12.feed_forward.w2.weight: copying a param with shape torch.Size([4096, 14336]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
size mismatch for layers.12.feed_forward.w3.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
size mismatch for layers.13.attention.wk.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.13.attention.wv.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.13.feed_forward.w1.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
size mismatch for layers.13.feed_forward.w2.weight: copying a param with shape torch.Size([4096, 14336]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
size mismatch for layers.13.feed_forward.w3.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
size mismatch for layers.14.attention.wk.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.14.attention.wv.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.14.feed_forward.w1.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
size mismatch for layers.14.feed_forward.w2.weight: copying a param with shape torch.Size([4096, 14336]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
size mismatch for layers.14.feed_forward.w3.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
size mismatch for layers.15.attention.wk.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.15.attention.wv.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.15.feed_forward.w1.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
size mismatch for layers.15.feed_forward.w2.weight: copying a param with shape torch.Size([4096, 14336]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
size mismatch for layers.15.feed_forward.w3.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
size mismatch for layers.16.attention.wk.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.16.attention.wv.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.16.feed_forward.w1.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
size mismatch for layers.16.feed_forward.w2.weight: copying a param with shape torch.Size([4096, 14336]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
size mismatch for layers.16.feed_forward.w3.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
size mismatch for layers.17.attention.wk.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.17.attention.wv.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.17.feed_forward.w1.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
size mismatch for layers.17.feed_forward.w2.weight: copying a param with shape torch.Size([4096, 14336]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
size mismatch for layers.17.feed_forward.w3.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
size mismatch for layers.18.attention.wk.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.18.attention.wv.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.18.feed_forward.w1.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
size mismatch for layers.18.feed_forward.w2.weight: copying a param with shape torch.Size([4096, 14336]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
size mismatch for layers.18.feed_forward.w3.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
size mismatch for layers.19.attention.wk.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.19.attention.wv.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.19.feed_forward.w1.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
size mismatch for layers.19.feed_forward.w2.weight: copying a param with shape torch.Size([4096, 14336]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
size mismatch for layers.19.feed_forward.w3.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
size mismatch for layers.20.attention.wk.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.20.attention.wv.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.20.feed_forward.w1.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
size mismatch for layers.20.feed_forward.w2.weight: copying a param with shape torch.Size([4096, 14336]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
size mismatch for layers.20.feed_forward.w3.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
size mismatch for layers.21.attention.wk.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.21.attention.wv.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.21.feed_forward.w1.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
size mismatch for layers.21.feed_forward.w2.weight: copying a param with shape torch.Size([4096, 14336]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
size mismatch for layers.21.feed_forward.w3.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
size mismatch for layers.22.attention.wk.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.22.attention.wv.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.22.feed_forward.w1.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
size mismatch for layers.22.feed_forward.w2.weight: copying a param with shape torch.Size([4096, 14336]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
size mismatch for layers.22.feed_forward.w3.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
size mismatch for layers.23.attention.wk.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.23.attention.wv.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.23.feed_forward.w1.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
size mismatch for layers.23.feed_forward.w2.weight: copying a param with shape torch.Size([4096, 14336]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
size mismatch for layers.23.feed_forward.w3.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
size mismatch for layers.24.attention.wk.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.24.attention.wv.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.24.feed_forward.w1.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
size mismatch for layers.24.feed_forward.w2.weight: copying a param with shape torch.Size([4096, 14336]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
size mismatch for layers.24.feed_forward.w3.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
size mismatch for layers.25.attention.wk.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.25.attention.wv.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.25.feed_forward.w1.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
size mismatch for layers.25.feed_forward.w2.weight: copying a param with shape torch.Size([4096, 14336]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
size mismatch for layers.25.feed_forward.w3.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
size mismatch for layers.26.attention.wk.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.26.attention.wv.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.26.feed_forward.w1.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
size mismatch for layers.26.feed_forward.w2.weight: copying a param with shape torch.Size([4096, 14336]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
size mismatch for layers.26.feed_forward.w3.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
size mismatch for layers.27.attention.wk.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.27.attention.wv.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.27.feed_forward.w1.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
size mismatch for layers.27.feed_forward.w2.weight: copying a param with shape torch.Size([4096, 14336]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
size mismatch for layers.27.feed_forward.w3.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
size mismatch for layers.28.attention.wk.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.28.attention.wv.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.28.feed_forward.w1.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
size mismatch for layers.28.feed_forward.w2.weight: copying a param with shape torch.Size([4096, 14336]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
size mismatch for layers.28.feed_forward.w3.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
size mismatch for layers.29.attention.wk.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.29.attention.wv.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.29.feed_forward.w1.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
size mismatch for layers.29.feed_forward.w2.weight: copying a param with shape torch.Size([4096, 14336]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
size mismatch for layers.29.feed_forward.w3.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
size mismatch for layers.30.attention.wk.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.30.attention.wv.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.30.feed_forward.w1.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
size mismatch for layers.30.feed_forward.w2.weight: copying a param with shape torch.Size([4096, 14336]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
size mismatch for layers.30.feed_forward.w3.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
size mismatch for layers.31.attention.wk.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.31.attention.wv.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.31.feed_forward.w1.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
size mismatch for layers.31.feed_forward.w2.weight: copying a param with shape torch.Size([4096, 14336]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
size mismatch for layers.31.feed_forward.w3.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]). |
can you try using this file params.json instead of config_filtered.json in your export script? https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct/blob/main/original/params.json |
@mergennachin same error. Layer sizes are not matching. |
Hmm interesting, @CypherpunkSamurai (and @raziel fyi) here's how I was able to export and create a pte file just now Step 1: Follow this step to set up ExecuTorch: https://pytorch.org/executorch/main/getting-started-setup
Step 4: Convert the model to pth file.
and run this command
Step 5: Download this file https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct/blob/main/original/params.json and save into /tmp/deepseek-ai/DeepSeek-R1-Distill-Llama-8B/params.json Step 6: Export to .pte file
You can continue with the runtime and iOS/Android integration here: https://github.com/pytorch/executorch/blob/main/examples/models/llama/README.md |
@mergennachin I'm using all nightly versions of torchtune and torchao, and executorch as well. I'm unsure what's causing the issue but sure I'll try this method and update you tomorrow, it's like a few hours past midnight here 🥲 Thank you for the quick reply btw 😄👍🏻 |
@mergennachin the script Is this normal? Is there any way to limit memory usage when loading and converting the model? I'm using the same code: !cd /tmp/executorch && \
python -m examples.models.llama.export_llama \
--checkpoint "/tmp/deepseek-ai/DeepSeek-R1-Distill-Llama-8B/checkpoint.pth" \
-p "/tmp/deepseek-ai/DeepSeek-R1-Distill-Llama-8B/original/params.json" \
-kv \
--use_sdpa_with_kv_cache \
-X \
-qmode 8da4w \
--group_size 128 \
-d fp32 \
--metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' \
--embedding-quantize 4,32 \
--output_name="/kaggle/working/DeepSeek-R1-Distill-Llama-8B.pte" |
I just realised it requires more ram so I switched provider to a different provider and converted the model to pte successfully. Here are the files I tried loading it along with the original llama 3.2 instruct |
@CypherpunkSamurai - For reducing RAM, you can pass in "-d fp16" instead "-d fp32" during the export_llama script. |
I realized yesterday that it's not gonna work. The DeepSeek-R1-Distill-Llama-8B model didn't the change the llama architecture but they did change the tokenizer a bit according the link above.
Currently the runner doesn't take HF native format cc @tarun292 -- who may have some ideas In the meantime, @CypherpunkSamurai - if you have some ideas, please let us know. FWIW, torchtune repo also had similar problems pytorch/torchtune#2287, pytorch/torchtune#2212 (cc @felipemello1, @ebsmothers) |
Interesting. After writing the reply above i realized the same, the tokenizers are different, and thus I used a legacy convert.py script from llama.cpp (according to ggerganov/llama.cpp/issues/2443 and ggerganov/llama.cpp/issues/7912). It is supposed to convert the huggingface format to bpe fromat. # get the model folder
pip3 install -q "huggingface_hub[cli]"
huggingface-cli download "deepseek-ai/DeepSeek-R1-Distill-Llama-8B" \
--local-dir "/tmp/deepseek-ai/DeepSeek-R1-Distill-Llama-8B" \
--local-dir-use-symlinks False
# convert
pip install sentencepiece==0.1.98 gguf>=0.1.0
# convert tokenizer
python convert.py "/tmp/deepseek-ai/DeepSeek-R1-Distill-Llama-8B" --vocab-only --outfile "/tmp/tokenizer.model" --vocab-type bpe But this fails with the same code 18 error, which is im guessing invalid tokenizer error (i read it from this comment). |
one user mentioned in the torchtune thread: " I hacked torchtune to use the HF AutoTokenizer and it seems to be working now" So if unblocking is urgent, this could be a path |
@felipemello1 - yeah, it might be a bit more work for ExecuTorch, since we have to write them in C++ Yeah, I downloaded llama.cpp repo and ran the following command
and on my desktop ran this (instead of a phone) And it is failing around here: executorch/extension/llm/tokenizer/bpe_tokenizer.cpp Lines 92 to 105 in 4796da7
So either the converter is doing something wrong or our bpe_tokenizer.cpp implementation doesn't handle some weird edge cases. Will look into this more, but let us know if you find something. Sine we can run on desktop, can do some debugging directly. |
@CypherpunkSamurai can you try with the 3.1 tokenizer as the base model for this is 3.1. We tried with the tokenizer's from here and are able to get reasonable outputs from the model: https://huggingface.co/meta-llama/Llama-3.1-8B/tree/main/original What you tried is from the 3.2 repo from what i can see, can you share the exact link to the tokenizer.model We're still going to look into getting the tokenizer generated from the deepseek repo working. |
Same resulting errors. Error 18 |
Are you running the CLI or the app? In the app can you pick 3.1 from the dropdown? |
I'm sorry for not mentioning. Yes I tried on the app with the 3.1 and 3.2 configs. Both of them throw error 18. I also tried to convert the tokenizer from DeepSeek R1 to .model file, it returns error 18 as well. All output models are here: |
@CypherpunkSamurai - can you try on CLI first so that we can eliminate some other possibilities? Try with XNNPACK but not without QNN I was able to get this
|
@mergennachin It looks like its working with the LLAMA 3.1-8B Tokenizer 😄 # Test With LLAMA 3.1 Tokenizer
from huggingface_hub import hf_hub_download
hf_hub_download(
repo_id="meta-llama/Llama-3.1-8B",
filename="original/tokenizer.model",
local_dir="/tmp/meta-llama/Llama-3.1-8B"
)
!cd $EXECUTORCH_ROOT; \
./cmake-out/examples/models/llama/llama_main \
--model_path="/tmp/deepseek-ai/DeepSeek-R1-Distill-Llama-8B/converted.fp16.pte" \
--tokenizer_path="/tmp/meta-llama/Llama-3.1-8B/original/tokenizer.model" \
--prompt="If x+7=9, solve for x"
I 00:00:00.000671 executorch:cpuinfo_utils.cpp:61] Reading file /sys/devices/soc0/image_version
I 00:00:00.000709 executorch:cpuinfo_utils.cpp:77] Failed to open midr file /sys/devices/soc0/image_version
I 00:00:00.000732 executorch:cpuinfo_utils.cpp:90] Reading file /sys/devices/system/cpu/cpu0/regs/identification/midr_el1
I 00:00:00.000753 executorch:cpuinfo_utils.cpp:99] Failed to open midr file /sys/devices/system/cpu/cpu0/regs/identification/midr_el1
I 00:00:00.000763 executorch:cpuinfo_utils.cpp:115] CPU info and manual query on # of cpus dont match.
I 00:00:00.000768 executorch:main.cpp:68] Resetting threadpool with num threads = 0
I 00:00:00.000819 executorch:runner.cpp:55] Creating LLaMa runner: model_path=/tmp/deepseek-ai/DeepSeek-R1-Distill-Llama-8B/converted.fp16.pte, tokenizer_path=/tmp/meta-llama/Llama-3.1-8B/original/tokenizer.model
I 00:00:38.793702 executorch:runner.cpp:88] Reading metadata from model
I 00:00:38.793783 executorch:runner.cpp:113] Metadata: get_bos_id = 128000
I 00:00:38.793798 executorch:runner.cpp:113] Metadata: use_kv_cache = 1
I 00:00:38.793804 executorch:runner.cpp:113] Metadata: get_max_seq_len = 128
I 00:00:38.793814 executorch:runner.cpp:113] Metadata: get_vocab_size = 128256
I 00:00:38.793820 executorch:runner.cpp:113] Metadata: use_sdpa_with_kv_cache = 1
I 00:00:38.793829 executorch:runner.cpp:113] Metadata: enable_dynamic_shape = 1
I 00:00:38.793838 executorch:runner.cpp:120] eos_id = 128009
I 00:00:38.793847 executorch:runner.cpp:120] eos_id = 128001
I 00:00:38.793874 executorch:runner.cpp:171] RSS after loading model: 4380.675781 MiB (0 if unsupported)
If x+7=9, solve for xI 00:00:40.829818 executorch:text_prefiller.cpp:52] Prefill token result numel(): 128256
.
I 00:00:40.837835 executorch:runner.cpp:240] RSS after prompt prefill: 4380.675781 MiB (0 if unsupported)
Wait, no, that was my first equation.
Wait, no, I think I confused the equations.
Wait, let me check. So, in the initial problem, we have:
The number of integers x in the interval [1, n] such that x+1 divides n.
And the given condition is that n is equal to (x+1)(x+2)/2, for some integer x.
Wait, so n is equal to (x+1)(x+2)/2, and we're supposed to find the number of integers x in [1,
I 00:01:18.819886 executorch:runner.cpp:254] RSS after finishing text generation: 4380.675781 MiB (0 if unsupported)
PyTorchObserver {"prompt_tokens":10,"generated_tokens":117,"model_load_start_ms":1738272346354,"model_load_end_ms":1738272385147,"inference_start_ms":1738272385147,"inference_end_ms":1738272425173,"prompt_eval_end_ms":1738272387191,"first_token_ms":1738272387191,"aggregate_sampling_time_ms":907,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
I 00:01:18.819988 executorch:stats.h:106] Prompt Tokens: 10 Generated Tokens: 117
I 00:01:18.819997 executorch:stats.h:112] Model Load Time: 38.793000 (seconds)
I 00:01:18.820076 executorch:stats.h:119] Total inference time: 40.026000 (seconds) Rate: 2.923100 (tokens/second)
I 00:01:18.820095 executorch:stats.h:129] Prompt evaluation: 2.044000 (seconds) Rate: 4.892368 (tokens/second)
I 00:01:18.820104 executorch:stats.h:138] Generated 117 tokens: 37.982000 (seconds) Rate: 3.080407 (tokens/second)
I 00:01:18.820113 executorch:stats.h:149] Time to first generated token: 2.044000 (seconds)
I 00:01:18.820121 executorch:stats.h:155] Sampling time over 127 tokens: 0.907000 (seconds But the converted tokenizer from deepseek r1 (tokenizers.json) fails # Test With DeepSeek R1 Tokenizer
!cd $EXECUTORCH_ROOT; \
./cmake-out/examples/models/llama/llama_main \
--model_path="/tmp/deepseek-ai/DeepSeek-R1-Distill-Llama-8B/converted.fp16.pte" \
--tokenizer_path="/tmp/deepseek-ai/DeepSeek-R1-Distill-Llama-8B/tokenizer.bpe.model" \
--prompt="If x+7=9, solve for x"
I 00:00:00.016288 executorch:cpuinfo_utils.cpp:61] Reading file /sys/devices/soc0/image_version
I 00:00:00.016364 executorch:cpuinfo_utils.cpp:77] Failed to open midr file /sys/devices/soc0/image_version
I 00:00:00.016375 executorch:cpuinfo_utils.cpp:90] Reading file /sys/devices/system/cpu/cpu0/regs/identification/midr_el1
I 00:00:00.016388 executorch:cpuinfo_utils.cpp:99] Failed to open midr file /sys/devices/system/cpu/cpu0/regs/identification/midr_el1
I 00:00:00.016396 executorch:cpuinfo_utils.cpp:115] CPU info and manual query on # of cpus dont match.
I 00:00:00.016399 executorch:main.cpp:68] Resetting threadpool with num threads = 0
I 00:00:00.019248 executorch:runner.cpp:55] Creating LLaMa runner: model_path=/tmp/deepseek-ai/DeepSeek-R1-Distill-Llama-8B/converted.fp16.pte, tokenizer_path=/tmp/deepseek-ai/DeepSeek-R1-Distill-Llama-8B/tokenizer.bpe.model
E 00:00:39.108056 executorch:base64.h:169] input length must be larger than 4 and is multiple of 4, got 109
I 00:00:39.108098 executorch:runner.cpp:79] Failed to load /tmp/deepseek-ai/DeepSeek-R1-Distill-Llama-8B/tokenizer.bpe.model as a Tiktoken artifact, trying BPE tokenizer I think my android app build has problems, let me try rebuilding it from current build cache. |
@CypherpunkSamurai Yea the newly generated tokenizer files are failing because the format generated by the script now doesn't match the old format that's expected. I'm trying to figure out why. |
@mergennachin a little offtopic, everything works on the local llama_runner but the arm64 build of llama runner seems to have a linker error. I'm currently using the specified $adb push cmake-android-out/examples/models/llama/llama_main ${DEVICE_DIR}
$adb shell "cd ${DEVICE_DIR} && ./llama_main --model_path deepseek-r1-8b.pte --tokenizer_path deepseek-r1-8b-tokenizer.model --prompt \"<|start_header_id|>system<|end_header_id|>\n\nYou are a funny chatbot.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nCould you tell me about Facebook?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n\" --seq_len 128"
#CANNOT LINK EXECUTABLE "./llama_main": library "libqnn_executorch_backend.so" not found: needed by main executable build config: # check that python3 is your python3 executable
cmake \
-DCMAKE_TOOLCHAIN_FILE="$ANDROID_NDK_ROOT/build/cmake/android.toolchain.cmake" \
-DANDROID_ABI=arm64-v8a \
-DCMAKE_INSTALL_PREFIX=cmake-android-out \
-DCMAKE_BUILD_TYPE=Release \
-DEXECUTORCH_BUILD_EXTENSION_DATA_LOADER=ON \
-DEXECUTORCH_BUILD_EXTENSION_MODULE=ON \
-DEXECUTORCH_BUILD_EXTENSION_TENSOR=ON \
-DEXECUTORCH_BUILD_QNN=ON \
-DQNN_SDK_ROOT=$QNN_SDK_ROOT \
-DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
-DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
-DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \
-DFLATC_EXECUTABLE="$(which flatc)" \
-Bcmake-android-out .
# build JNI qnn_executorch_backend
cmake --build cmake-android-out -j16 --target install --config Release
# build llama runner
echo "Building Llama Runner binary..."
cmake \
-DCMAKE_TOOLCHAIN_FILE="$ANDROID_NDK_ROOT/build/cmake/android.toolchain.cmake" \
-DANDROID_ABI=arm64-v8a \
-DCMAKE_INSTALL_PREFIX=cmake-android-out \
-DCMAKE_BUILD_TYPE=Release \
-DPYTHON_EXECUTABLE=python3 \
-DEXECUTORCH_BUILD_QNN=ON \
-DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
-DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
-DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \
-Bcmake-android-out/examples/models/llama examples/models/llama
# build llama runner binary
cmake --build cmake-android-out/examples/models/llama -j16 --config Release run # this is all the files that im pushing
# DEVICE_DIR=/data/local/tmp/llama
adb shell mkdir -p ${DEVICE_DIR}
adb push ${QNN_SDK_ROOT}/lib/aarch64-android/libQnnHtp.so ${DEVICE_DIR}
adb push ${QNN_SDK_ROOT}/lib/aarch64-android/libQnnSystem.so ${DEVICE_DIR}
adb push ${QNN_SDK_ROOT}/lib/aarch64-android/libQnnHtpV69Stub.so ${DEVICE_DIR}
adb push ${QNN_SDK_ROOT}/lib/aarch64-android/libQnnHtpV73Stub.so ${DEVICE_DIR}
adb push ${QNN_SDK_ROOT}/lib/aarch64-android/libQnnHtpV75Stub.so ${DEVICE_DIR}
adb push ${QNN_SDK_ROOT}/lib/hexagon-v69/unsigned/libQnnHtpV69Skel.so ${DEVICE_DIR}
adb push ${QNN_SDK_ROOT}/lib/hexagon-v73/unsigned/libQnnHtpV73Skel.so ${DEVICE_DIR}
adb push ${QNN_SDK_ROOT}/lib/hexagon-v75/unsigned/libQnnHtpV75Skel.so ${DEVICE_DIR}
# also push model
adb push <model.pte> ${DEVICE_DIR}
adb push <tokenizer.model> ${DEVICE_DIR}
adb push cmake-android-out/lib/libqnn_executorch_backend.so ${DEVICE_DIR}
# push binary
adb push cmake-out-android/examples/models/llama2/llama_main ${DEVICE_DIR} |
🚀 The feature, motivation and pitch
This task is to enable DeepSeek R1 Distill Llama 8B on ExecuTorch. That way, people can run these models in a mobile app, locally, without talking to the server.
In theory, ExecuTorch already supports Llama 3.1 8B architecture anyway, so it should just work out of the box (https://github.com/pytorch/executorch/blob/main/examples/models/llama/README.md)
Please document (and make necessary changes) on how to run DeepSeek R1 Distill Llama 8B e2e via ExecuTorch on iOS and Android.
Update 1:
Was able to verify that export works as such: #7981 (comment)
Update 2:
Currently looking into tokenizers
Alternatives
No response
Additional context
No response
RFC (Optional)
No response
cc @cccclai @helunwencser @dvorjackz @byjlw
The text was updated successfully, but these errors were encountered: