Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproduce/enable DeepSeek R1 Distill Llama 8B #7981

Open
mergennachin opened this issue Jan 27, 2025 · 25 comments
Open

Reproduce/enable DeepSeek R1 Distill Llama 8B #7981

mergennachin opened this issue Jan 27, 2025 · 25 comments
Labels
good first issue Good for newcomers module: llm Issues related to LLM examples and apps, and to the extensions/llm/ code module: user experience Issues related to reducing friction for users triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@mergennachin
Copy link
Contributor

mergennachin commented Jan 27, 2025

🚀 The feature, motivation and pitch

This task is to enable DeepSeek R1 Distill Llama 8B on ExecuTorch. That way, people can run these models in a mobile app, locally, without talking to the server.

In theory, ExecuTorch already supports Llama 3.1 8B architecture anyway, so it should just work out of the box (https://github.com/pytorch/executorch/blob/main/examples/models/llama/README.md)

Please document (and make necessary changes) on how to run DeepSeek R1 Distill Llama 8B e2e via ExecuTorch on iOS and Android.

Update 1:

Was able to verify that export works as such: #7981 (comment)

Update 2:

Currently looking into tokenizers

Alternatives

No response

Additional context

No response

RFC (Optional)

No response

cc @cccclai @helunwencser @dvorjackz @byjlw

@mergennachin mergennachin added good first issue Good for newcomers module: llm Issues related to LLM examples and apps, and to the extensions/llm/ code triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Jan 27, 2025
@jackzhxng
Copy link
Contributor

This should just work out of the box if we convert the distilled weights and use them in export_llama

@mergennachin
Copy link
Contributor Author

This should just work out of the box if we convert the distilled weights and use them in export_llama

Yeah, we have a script to do that

https://github.com/pytorch/executorch/blob/main/examples/models/llama/UTILS.md#download-models-from-hugging-face-and-convert-from-safetensor-format-to-state-dict

@mergennachin mergennachin changed the title Enable DeepSeek R1 Distill Llama 8B Reproduce/enable DeepSeek R1 Distill Llama 8B Jan 27, 2025
@raziel
Copy link

raziel commented Jan 29, 2025

Hi @mergennachin , would love to see a script exporting the deep seek model so we can give it a try with our delegates as well :-)

@CypherpunkSamurai
Copy link

CypherpunkSamurai commented Jan 29, 2025

I have a distilled model but I need help converting this to pte file from pth.

The config file required a little cleaning or the export_llama kept complaining about ModelArgs.__init__() got an unexpected keyword argument 'architectures'

Now the script has mismatch for layers layers.0.attention.wk.weight: copying param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]). error.

The shapes of all layers are different.

Any ideas how I can fix this?

@iseeyuan
Copy link
Contributor

@CypherpunkSamurai It may be related to the compressed projection used in MLA. I'm creating the interface in #8039 so that different attentions can be implemented and added.

@CypherpunkSamurai
Copy link

CypherpunkSamurai commented Jan 29, 2025

@iseeyuan thank you very much for your work man. Let me know if I can help contribute.

Here's the error logs btw:

# removing these json keys fixes the ModelArgs issue
unsupported_args = [
    "architectures",
    "attention_bias",
    "attention_dropout",
    "bos_token_id",
    "eos_token_id",
    "hidden_act",
    "hidden_size",
    "initializer_range",
    "intermediate_size",
    "max_position_embeddings",
    "mlp_bias",
    "model_type",
    "num_attention_heads",
    "num_hidden_layers",
    "num_key_value_heads",
    "pretraining_tp",
    "rms_norm_eps",
    "rope_scaling",
    "tie_word_embeddings",
    "torch_dtype",
    "transformers_version",
    "use_cache"
]

Error logs:

!cd executorch \
    && python -m examples.models.llama.export_llama \
        --checkpoint "/tmp/deepseek-r1-llama-8b-pth/deepseek-r1-llama-8b.pth" \
        --params "/tmp/deepseek-r1-llama-8b-pth/config_filtered.json" \
        --output_name "deepseek-r1-llama-8b.pte" \
        -kv \
        --use_sdpa_with_kv_cache -X -qmode 8da4w --group_size 128 -d fp32

Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/kaggle/working/executorch/examples/models/llama/export_llama.py", line 32, in <module>
    main()  # pragma: no cover
  File "/kaggle/working/executorch/examples/models/llama/export_llama.py", line 28, in main
    export_llama(args)
  File "/kaggle/working/executorch/examples/models/llama/export_llama_lib.py", line 533, in export_llama
    builder = _export_llama(args)
  File "/kaggle/working/executorch/examples/models/llama/export_llama_lib.py", line 668, in _export_llama
    builder_exported = _prepare_for_llama_export(args).export()
  File "/kaggle/working/executorch/examples/models/llama/export_llama_lib.py", line 565, in _prepare_for_llama_export
    _load_llama_model(
  File "/kaggle/working/executorch/examples/models/llama/export_llama_lib.py", line 940, in _load_llama_model
    EagerModelFactory.create_model(
  File "/kaggle/working/executorch/examples/models/model_factory.py", line 44, in create_model
    model = model_class(**kwargs)
  File "/kaggle/working/executorch/examples/models/llama/model.py", line 235, in __init__
    missing, unexpected = self.model_.load_state_dict(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 2581, in load_state_dict
    raise RuntimeError(
RuntimeError: Error(s) in loading state_dict for Transformer:
	size mismatch for layers.0.attention.wk.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
	size mismatch for layers.0.attention.wv.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
	size mismatch for layers.0.feed_forward.w1.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
	size mismatch for layers.0.feed_forward.w2.weight: copying a param with shape torch.Size([4096, 14336]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
	size mismatch for layers.0.feed_forward.w3.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
	size mismatch for layers.1.attention.wk.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
	size mismatch for layers.1.attention.wv.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
	size mismatch for layers.1.feed_forward.w1.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
	size mismatch for layers.1.feed_forward.w2.weight: copying a param with shape torch.Size([4096, 14336]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
	size mismatch for layers.1.feed_forward.w3.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
	size mismatch for layers.2.attention.wk.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
	size mismatch for layers.2.attention.wv.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
	size mismatch for layers.2.feed_forward.w1.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
	size mismatch for layers.2.feed_forward.w2.weight: copying a param with shape torch.Size([4096, 14336]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
	size mismatch for layers.2.feed_forward.w3.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
	size mismatch for layers.3.attention.wk.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
	size mismatch for layers.3.attention.wv.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
	size mismatch for layers.3.feed_forward.w1.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
	size mismatch for layers.3.feed_forward.w2.weight: copying a param with shape torch.Size([4096, 14336]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
	size mismatch for layers.3.feed_forward.w3.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
	size mismatch for layers.4.attention.wk.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
	size mismatch for layers.4.attention.wv.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
	size mismatch for layers.4.feed_forward.w1.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
	size mismatch for layers.4.feed_forward.w2.weight: copying a param with shape torch.Size([4096, 14336]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
	size mismatch for layers.4.feed_forward.w3.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
	size mismatch for layers.5.attention.wk.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
	size mismatch for layers.5.attention.wv.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
	size mismatch for layers.5.feed_forward.w1.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
	size mismatch for layers.5.feed_forward.w2.weight: copying a param with shape torch.Size([4096, 14336]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
	size mismatch for layers.5.feed_forward.w3.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
	size mismatch for layers.6.attention.wk.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
	size mismatch for layers.6.attention.wv.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
	size mismatch for layers.6.feed_forward.w1.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
	size mismatch for layers.6.feed_forward.w2.weight: copying a param with shape torch.Size([4096, 14336]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
	size mismatch for layers.6.feed_forward.w3.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
	size mismatch for layers.7.attention.wk.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
	size mismatch for layers.7.attention.wv.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
	size mismatch for layers.7.feed_forward.w1.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
	size mismatch for layers.7.feed_forward.w2.weight: copying a param with shape torch.Size([4096, 14336]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
	size mismatch for layers.7.feed_forward.w3.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
	size mismatch for layers.8.attention.wk.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
	size mismatch for layers.8.attention.wv.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
	size mismatch for layers.8.feed_forward.w1.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
	size mismatch for layers.8.feed_forward.w2.weight: copying a param with shape torch.Size([4096, 14336]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
	size mismatch for layers.8.feed_forward.w3.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
	size mismatch for layers.9.attention.wk.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
	size mismatch for layers.9.attention.wv.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
	size mismatch for layers.9.feed_forward.w1.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
	size mismatch for layers.9.feed_forward.w2.weight: copying a param with shape torch.Size([4096, 14336]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
	size mismatch for layers.9.feed_forward.w3.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
	size mismatch for layers.10.attention.wk.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
	size mismatch for layers.10.attention.wv.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
	size mismatch for layers.10.feed_forward.w1.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
	size mismatch for layers.10.feed_forward.w2.weight: copying a param with shape torch.Size([4096, 14336]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
	size mismatch for layers.10.feed_forward.w3.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
	size mismatch for layers.11.attention.wk.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
	size mismatch for layers.11.attention.wv.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
	size mismatch for layers.11.feed_forward.w1.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
	size mismatch for layers.11.feed_forward.w2.weight: copying a param with shape torch.Size([4096, 14336]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
	size mismatch for layers.11.feed_forward.w3.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
	size mismatch for layers.12.attention.wk.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
	size mismatch for layers.12.attention.wv.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
	size mismatch for layers.12.feed_forward.w1.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
	size mismatch for layers.12.feed_forward.w2.weight: copying a param with shape torch.Size([4096, 14336]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
	size mismatch for layers.12.feed_forward.w3.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
	size mismatch for layers.13.attention.wk.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
	size mismatch for layers.13.attention.wv.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
	size mismatch for layers.13.feed_forward.w1.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
	size mismatch for layers.13.feed_forward.w2.weight: copying a param with shape torch.Size([4096, 14336]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
	size mismatch for layers.13.feed_forward.w3.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
	size mismatch for layers.14.attention.wk.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
	size mismatch for layers.14.attention.wv.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
	size mismatch for layers.14.feed_forward.w1.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
	size mismatch for layers.14.feed_forward.w2.weight: copying a param with shape torch.Size([4096, 14336]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
	size mismatch for layers.14.feed_forward.w3.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
	size mismatch for layers.15.attention.wk.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
	size mismatch for layers.15.attention.wv.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
	size mismatch for layers.15.feed_forward.w1.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
	size mismatch for layers.15.feed_forward.w2.weight: copying a param with shape torch.Size([4096, 14336]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
	size mismatch for layers.15.feed_forward.w3.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
	size mismatch for layers.16.attention.wk.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
	size mismatch for layers.16.attention.wv.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
	size mismatch for layers.16.feed_forward.w1.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
	size mismatch for layers.16.feed_forward.w2.weight: copying a param with shape torch.Size([4096, 14336]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
	size mismatch for layers.16.feed_forward.w3.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
	size mismatch for layers.17.attention.wk.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
	size mismatch for layers.17.attention.wv.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
	size mismatch for layers.17.feed_forward.w1.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
	size mismatch for layers.17.feed_forward.w2.weight: copying a param with shape torch.Size([4096, 14336]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
	size mismatch for layers.17.feed_forward.w3.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
	size mismatch for layers.18.attention.wk.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
	size mismatch for layers.18.attention.wv.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
	size mismatch for layers.18.feed_forward.w1.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
	size mismatch for layers.18.feed_forward.w2.weight: copying a param with shape torch.Size([4096, 14336]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
	size mismatch for layers.18.feed_forward.w3.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
	size mismatch for layers.19.attention.wk.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
	size mismatch for layers.19.attention.wv.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
	size mismatch for layers.19.feed_forward.w1.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
	size mismatch for layers.19.feed_forward.w2.weight: copying a param with shape torch.Size([4096, 14336]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
	size mismatch for layers.19.feed_forward.w3.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
	size mismatch for layers.20.attention.wk.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
	size mismatch for layers.20.attention.wv.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
	size mismatch for layers.20.feed_forward.w1.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
	size mismatch for layers.20.feed_forward.w2.weight: copying a param with shape torch.Size([4096, 14336]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
	size mismatch for layers.20.feed_forward.w3.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
	size mismatch for layers.21.attention.wk.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
	size mismatch for layers.21.attention.wv.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
	size mismatch for layers.21.feed_forward.w1.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
	size mismatch for layers.21.feed_forward.w2.weight: copying a param with shape torch.Size([4096, 14336]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
	size mismatch for layers.21.feed_forward.w3.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
	size mismatch for layers.22.attention.wk.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
	size mismatch for layers.22.attention.wv.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
	size mismatch for layers.22.feed_forward.w1.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
	size mismatch for layers.22.feed_forward.w2.weight: copying a param with shape torch.Size([4096, 14336]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
	size mismatch for layers.22.feed_forward.w3.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
	size mismatch for layers.23.attention.wk.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
	size mismatch for layers.23.attention.wv.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
	size mismatch for layers.23.feed_forward.w1.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
	size mismatch for layers.23.feed_forward.w2.weight: copying a param with shape torch.Size([4096, 14336]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
	size mismatch for layers.23.feed_forward.w3.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
	size mismatch for layers.24.attention.wk.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
	size mismatch for layers.24.attention.wv.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
	size mismatch for layers.24.feed_forward.w1.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
	size mismatch for layers.24.feed_forward.w2.weight: copying a param with shape torch.Size([4096, 14336]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
	size mismatch for layers.24.feed_forward.w3.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
	size mismatch for layers.25.attention.wk.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
	size mismatch for layers.25.attention.wv.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
	size mismatch for layers.25.feed_forward.w1.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
	size mismatch for layers.25.feed_forward.w2.weight: copying a param with shape torch.Size([4096, 14336]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
	size mismatch for layers.25.feed_forward.w3.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
	size mismatch for layers.26.attention.wk.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
	size mismatch for layers.26.attention.wv.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
	size mismatch for layers.26.feed_forward.w1.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
	size mismatch for layers.26.feed_forward.w2.weight: copying a param with shape torch.Size([4096, 14336]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
	size mismatch for layers.26.feed_forward.w3.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
	size mismatch for layers.27.attention.wk.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
	size mismatch for layers.27.attention.wv.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
	size mismatch for layers.27.feed_forward.w1.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
	size mismatch for layers.27.feed_forward.w2.weight: copying a param with shape torch.Size([4096, 14336]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
	size mismatch for layers.27.feed_forward.w3.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
	size mismatch for layers.28.attention.wk.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
	size mismatch for layers.28.attention.wv.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
	size mismatch for layers.28.feed_forward.w1.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
	size mismatch for layers.28.feed_forward.w2.weight: copying a param with shape torch.Size([4096, 14336]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
	size mismatch for layers.28.feed_forward.w3.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
	size mismatch for layers.29.attention.wk.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
	size mismatch for layers.29.attention.wv.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
	size mismatch for layers.29.feed_forward.w1.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
	size mismatch for layers.29.feed_forward.w2.weight: copying a param with shape torch.Size([4096, 14336]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
	size mismatch for layers.29.feed_forward.w3.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
	size mismatch for layers.30.attention.wk.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
	size mismatch for layers.30.attention.wv.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
	size mismatch for layers.30.feed_forward.w1.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
	size mismatch for layers.30.feed_forward.w2.weight: copying a param with shape torch.Size([4096, 14336]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
	size mismatch for layers.30.feed_forward.w3.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
	size mismatch for layers.31.attention.wk.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
	size mismatch for layers.31.attention.wv.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
	size mismatch for layers.31.feed_forward.w1.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
	size mismatch for layers.31.feed_forward.w2.weight: copying a param with shape torch.Size([4096, 14336]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
	size mismatch for layers.31.feed_forward.w3.weight: copying a param with shape torch.Size([14336, 4096]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).

@mergennachin
Copy link
Contributor Author

@CypherpunkSamurai

can you try using this file params.json instead of config_filtered.json in your export script?

https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct/blob/main/original/params.json

@CypherpunkSamurai
Copy link

@mergennachin same error. Layer sizes are not matching.

@mergennachin
Copy link
Contributor Author

mergennachin commented Jan 29, 2025

Hmm interesting, @CypherpunkSamurai (and @raziel fyi) here's how I was able to export and create a pte file just now

Step 1: Follow this step to set up ExecuTorch: https://pytorch.org/executorch/main/getting-started-setup
Step 2: Run examples/models/llama/install_requirements.sh for Llama specific requirements.
Step 3: Download the model

pip install -U "huggingface_hub[cli]"
huggingface-cli download deepseek-ai/DeepSeek-R1-Distill-Llama-8B --local-dir ~/models/deepseek-ai/DeepSeek-R1-Distill-Llama-8B --local-dir-use-symlinks False

Step 4: Convert the model to pth file.

pip install torchtune

and run this command

from torchtune.models import convert_weights
from torchtune.training import FullModelHFCheckpointer
import torch

# Convert from safetensors to TorchTune. Suppose the model has been downloaded from Hugging Face
checkpointer = FullModelHFCheckpointer(
    checkpoint_dir='/Users/mnachin/models/deepseek-ai/DeepSeek-R1-Distill-Llama-8B',
    checkpoint_files=['model-00001-of-000002.safetensors', 'model-00002-of-000002.safetensors'],
    output_dir='/tmp/deepseek-ai/DeepSeek-R1-Distill-Llama-8B/' ,
    model_type='LLAMA3' # or other types that TorchTune supports
)

print("loading checkpoint")
sd = checkpointer.load_checkpoint()

# Convert from TorchTune to Meta (PyTorch native)
sd = convert_weights.tune_to_meta(sd['model'])

print("saving checkpoint")
torch.save(sd, "/tmp/deepseek-ai/DeepSeek-R1-Distill-Llama-8B/checkpoint.pth")

Step 5: Download this file https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct/blob/main/original/params.json and save into /tmp/deepseek-ai/DeepSeek-R1-Distill-Llama-8B/params.json

Step 6: Export to .pte file

python -m examples.models.llama.export_llama \
    --checkpoint /tmp/deepseek-ai/DeepSeek-R1-Distill-Llama-8B/checkpoint.pth \
	-p /tmp/deepseek-ai/DeepSeek-R1-Distill-Llama-8B/params.json \
	-kv \
	--use_sdpa_with_kv_cache \
	-X \
	-qmode 8da4w \
	--group_size 128 \
	-d fp16 \
	--metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' \
	--embedding-quantize 4,32 \
	--output_name="DeepSeek-R1-Distill-Llama-8B.pte"

You can continue with the runtime and iOS/Android integration here: https://github.com/pytorch/executorch/blob/main/examples/models/llama/README.md

@CypherpunkSamurai
Copy link

CypherpunkSamurai commented Jan 29, 2025

@mergennachin I'm using all nightly versions of torchtune and torchao, and executorch as well. I'm unsure what's causing the issue but sure I'll try this method and update you tomorrow, it's like a few hours past midnight here 🥲

Thank you for the quick reply btw 😄👍🏻

@CypherpunkSamurai
Copy link

CypherpunkSamurai commented Jan 30, 2025

@mergennachin the script examples.models.llama.export_llama uses around 29 GB of RAM and 0 VRAM for some reason. I'm using kaggle and it causes the session to interrupt when the usage is out of limits.

Is this normal? Is there any way to limit memory usage when loading and converting the model?

Image

I'm using the same code:

!cd /tmp/executorch && \
    python -m examples.models.llama.export_llama \
    --checkpoint "/tmp/deepseek-ai/DeepSeek-R1-Distill-Llama-8B/checkpoint.pth" \
	-p "/tmp/deepseek-ai/DeepSeek-R1-Distill-Llama-8B/original/params.json" \
	-kv \
	--use_sdpa_with_kv_cache \
	-X \
	-qmode 8da4w \
	--group_size 128 \
	-d fp32 \
	--metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' \
	--embedding-quantize 4,32 \
	--output_name="/kaggle/working/DeepSeek-R1-Distill-Llama-8B.pte"

@CypherpunkSamurai
Copy link

CypherpunkSamurai commented Jan 30, 2025

I just realised it requires more ram so I switched provider to a different provider and converted the model to pte successfully.

Here are the files
https://huggingface.co/rakeshchow202/DeepSeek-R1-Distill-Llama-8B-pth

I tried loading it along with the original llama 3.2 instruct tokenizer.model in the android app using QNN backend and it gives the following error.

Image
Image
Image

@mergennachin
Copy link
Contributor Author

mergennachin commented Jan 30, 2025

Is there any way to limit memory usage when loading and converting the model?

@CypherpunkSamurai - For reducing RAM, you can pass in "-d fp16" instead "-d fp32" during the export_llama script.

@mergennachin
Copy link
Contributor Author

mergennachin commented Jan 30, 2025

@CypherpunkSamurai

I tried loading it along with the original llama 3.2 instruct tokenizer.model

I realized yesterday that it's not gonna work. The DeepSeek-R1-Distill-Llama-8B model didn't the change the llama architecture but they did change the tokenizer a bit according the link above.

DeepSeek-R1-Distill models are fine-tuned based on open-source models, using samples generated by DeepSeek-R1. We slightly change their configs and tokenizers.

Currently the runner doesn't take HF native format tokenizer.json and tokenizer_config.json files. Either we need a way to convert it to tokenizer.model format or change the runner to support HF formats.

cc @tarun292 -- who may have some ideas

In the meantime, @CypherpunkSamurai - if you have some ideas, please let us know.

FWIW, torchtune repo also had similar problems pytorch/torchtune#2287, pytorch/torchtune#2212 (cc @felipemello1, @ebsmothers)

@mergennachin mergennachin added the module: user experience Issues related to reducing friction for users label Jan 30, 2025
@CypherpunkSamurai
Copy link

@mergennachin

Interesting. After writing the reply above i realized the same, the tokenizers are different, and thus I used a legacy convert.py script from llama.cpp (according to ggerganov/llama.cpp/issues/2443 and ggerganov/llama.cpp/issues/7912).

It is supposed to convert the huggingface format to bpe fromat.

# get the model folder
pip3 install -q "huggingface_hub[cli]"
huggingface-cli download "deepseek-ai/DeepSeek-R1-Distill-Llama-8B" \
    --local-dir "/tmp/deepseek-ai/DeepSeek-R1-Distill-Llama-8B" \
    --local-dir-use-symlinks False

# convert
pip install sentencepiece==0.1.98 gguf>=0.1.0

# convert tokenizer
python convert.py "/tmp/deepseek-ai/DeepSeek-R1-Distill-Llama-8B" --vocab-only --outfile "/tmp/tokenizer.model" --vocab-type bpe

But this fails with the same code 18 error, which is im guessing invalid tokenizer error (i read it from this comment).

@mergennachin mergennachin moved this from To triage to In progress in ExecuTorch DevX improvements Jan 30, 2025
@felipemello1
Copy link

one user mentioned in the torchtune thread: " I hacked torchtune to use the HF AutoTokenizer and it seems to be working now"

So if unblocking is urgent, this could be a path

@mergennachin
Copy link
Contributor Author

mergennachin commented Jan 30, 2025

@felipemello1 - yeah, it might be a bit more work for ExecuTorch, since we have to write them in C++

@CypherpunkSamurai

Yeah, I downloaded llama.cpp repo and ran the following command

python convert_hf_to_gguf.py "/Users/mnachin/models/deepseek-ai/DeepSeek-R1-Distill-Llama-8B" --vocab-only --outfile "/tmp/deepseek-ai/DeepSeek-R1-Distill-Llama-8B/tokenizer.model"

and on my desktop ran this (instead of a phone)

https://github.com/pytorch/executorch/blob/main/examples/models/llama/README.md#step-3-run-on-your-computer-to-validate

And it is failing around here:

int32_t len;
if (fread(&len, sizeof(int32_t), 1, file) != 1) {
ET_LOG(Error, "Failed to read the length of the word at index %d", i);
return Error::InvalidArgument;
}
vocab_[i] = new char[len + 1];
if (fread(vocab_[i], len, 1, file) != 1) {
ET_LOG(
Error,
"Failed to read the word, total length %d, index %d\n",
len,
i);
return Error::InvalidArgument;
}

So either the converter is doing something wrong or our bpe_tokenizer.cpp implementation doesn't handle some weird edge cases. Will look into this more, but let us know if you find something. Sine we can run on desktop, can do some debugging directly.

@tarun292
Copy link
Contributor

tarun292 commented Jan 30, 2025

@CypherpunkSamurai can you try with the 3.1 tokenizer as the base model for this is 3.1. We tried with the tokenizer's from here and are able to get reasonable outputs from the model: https://huggingface.co/meta-llama/Llama-3.1-8B/tree/main/original

What you tried is from the 3.2 repo from what i can see, can you share the exact link to the tokenizer.model

We're still going to look into getting the tokenizer generated from the deepseek repo working.

@CypherpunkSamurai
Copy link

can you try with the 3.1 tokenizer as the base model for this is 3.1. We tried with the tokenizer's from here and are able to get reasonable outputs from the model: https://huggingface.co/meta-llama/Llama-3.1-8B/tree/main/original

Same resulting errors. Error 18

@tarun292
Copy link
Contributor

Are you running the CLI or the app? In the app can you pick 3.1 from the dropdown?

@CypherpunkSamurai
Copy link

CypherpunkSamurai commented Jan 30, 2025

Are you running the CLI or the app? In the app can you pick 3.1 from the dropdown?

I'm sorry for not mentioning. Yes I tried on the app with the 3.1 and 3.2 configs. Both of them throw error 18.

I also tried to convert the tokenizer from DeepSeek R1 to .model file, it returns error 18 as well.

All output models are here:
https://huggingface.co/rakeshchow202/DeepSeek-R1-Distill-Llama-8B-pth/tree/main

@mergennachin
Copy link
Contributor Author

mergennachin commented Jan 30, 2025

@CypherpunkSamurai - can you try on CLI first so that we can eliminate some other possibilities? Try with XNNPACK but not without QNN

https://github.com/pytorch/executorch/blob/main/examples/models/llama/README.md#step-3-run-on-your-computer-to-validate

I was able to get this

(executorch) mnachin@mnachin-mbp executorch % cmake-out/examples/models/llama/llama_main --model_path=DeepSeek-R1-Distill-Llama-8B.pte --tokenizer_path=/Users/mnachin/Downloads/tokenizer_2.model --prompt="If x+7=9, solve for x"
I 00:00:00.000529 executorch:cpuinfo_utils.cpp:61] Reading file /sys/devices/soc0/image_version
I 00:00:00.000548 executorch:cpuinfo_utils.cpp:77] Failed to open midr file /sys/devices/soc0/image_version
I 00:00:00.000551 executorch:cpuinfo_utils.cpp:157] Number of efficient cores 4
I 00:00:00.000552 executorch:main.cpp:69] Resetting threadpool with num threads = 6
I 00:00:00.001876 executorch:runner.cpp:59] Creating LLaMa runner: model_path=DeepSeek-R1-Distill-Llama-8B.pte, tokenizer_path=/Users/mnachin/Downloads/tokenizer_2.model
I 00:00:08.544093 executorch:runner.cpp:88] Reading metadata from model
I 00:00:08.544119 executorch:runner.cpp:113] Metadata: use_sdpa_with_kv_cache = 1
I 00:00:08.544122 executorch:runner.cpp:113] Metadata: use_kv_cache = 1
I 00:00:08.544124 executorch:runner.cpp:113] Metadata: get_vocab_size = 128256
I 00:00:08.544125 executorch:runner.cpp:113] Metadata: get_bos_id = 128000
I 00:00:08.544127 executorch:runner.cpp:113] Metadata: get_max_seq_len = 128
I 00:00:08.544129 executorch:runner.cpp:113] Metadata: enable_dynamic_shape = 1
I 00:00:08.544130 executorch:runner.cpp:120] eos_id = 128009
I 00:00:08.544132 executorch:runner.cpp:120] eos_id = 128001
I 00:00:08.544138 executorch:runner.cpp:174] RSS after loading model: 0.000000 MiB (0 if unsupported)
If x+7=9, solve for xI 00:00:08.827487 executorch:text_prefiller.cpp:53] Prefill token result numel(): 128256
.

I 00:00:08.828177 executorch:runner.cpp:243] RSS after prompt prefill: 0.000000 MiB (0 if unsupported)
First, subtract 7 from both sides to get x=2.

But wait, is there another way to solve this? Maybe by thinking of it as x + 7 equals 9, so x is the difference between 9 and 7. Yeah, that makes sense.

So, x equals 2. That seems straightforward.

But maybe I should check my answer. If x is 2, then plugging back into the original equation: 2 + 7 equals 9. Yep, that's 9, so it works.

Is there a graphical method to
I 00:00:15.167418 executorch:runner.cpp:257] RSS after finishing text generation: 0.000000 MiB (0 if unsupported)
PyTorchObserver {"prompt_tokens":10,"generated_tokens":117,"model_load_start_ms":1738263526051,"model_load_end_ms":1738263534593,"inference_start_ms":1738263534593,"inference_end_ms":1738263541217,"prompt_eval_end_ms":1738263534877,"first_token_ms":1738263534877,"aggregate_sampling_time_ms":71,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
I 00:00:15.167445 executorch:stats.h:110] 	Prompt Tokens: 10    Generated Tokens: 117
I 00:00:15.167446 executorch:stats.h:116] 	Model Load Time:		8.542000 (seconds)
I 00:00:15.167448 executorch:stats.h:126] 	Total inference time:		6.624000 (seconds)		 Rate: 	17.663043 (tokens/second)
I 00:00:15.167450 executorch:stats.h:134] 		Prompt evaluation:	0.284000 (seconds)		 Rate: 	35.211268 (tokens/second)
I 00:00:15.167452 executorch:stats.h:145] 		Generated 117 tokens:	6.340000 (seconds)		 Rate: 	18.454259 (tokens/second)
I 00:00:15.167453 executorch:stats.h:153] 	Time to first generated token:	0.284000 (seconds)
I 00:00:15.167484 executorch:stats.h:160] 	Sampling time over 127 tokens:	0.071000 (seconds)
(executorch) mnachin@mnachin-mbp executorch %

@CypherpunkSamurai
Copy link

@mergennachin It looks like its working with the LLAMA 3.1-8B Tokenizer 😄

# Test With LLAMA 3.1 Tokenizer
from huggingface_hub import hf_hub_download

hf_hub_download(
    repo_id="meta-llama/Llama-3.1-8B",
    filename="original/tokenizer.model",
    local_dir="/tmp/meta-llama/Llama-3.1-8B"
)

!cd $EXECUTORCH_ROOT; \
    ./cmake-out/examples/models/llama/llama_main \
        --model_path="/tmp/deepseek-ai/DeepSeek-R1-Distill-Llama-8B/converted.fp16.pte" \
        --tokenizer_path="/tmp/meta-llama/Llama-3.1-8B/original/tokenizer.model" \
        --prompt="If x+7=9, solve for x"

I 00:00:00.000671 executorch:cpuinfo_utils.cpp:61] Reading file /sys/devices/soc0/image_version
I 00:00:00.000709 executorch:cpuinfo_utils.cpp:77] Failed to open midr file /sys/devices/soc0/image_version
I 00:00:00.000732 executorch:cpuinfo_utils.cpp:90] Reading file /sys/devices/system/cpu/cpu0/regs/identification/midr_el1
I 00:00:00.000753 executorch:cpuinfo_utils.cpp:99] Failed to open midr file /sys/devices/system/cpu/cpu0/regs/identification/midr_el1
I 00:00:00.000763 executorch:cpuinfo_utils.cpp:115] CPU info and manual query on # of cpus dont match.
I 00:00:00.000768 executorch:main.cpp:68] Resetting threadpool with num threads = 0
I 00:00:00.000819 executorch:runner.cpp:55] Creating LLaMa runner: model_path=/tmp/deepseek-ai/DeepSeek-R1-Distill-Llama-8B/converted.fp16.pte, tokenizer_path=/tmp/meta-llama/Llama-3.1-8B/original/tokenizer.model
I 00:00:38.793702 executorch:runner.cpp:88] Reading metadata from model
I 00:00:38.793783 executorch:runner.cpp:113] Metadata: get_bos_id = 128000
I 00:00:38.793798 executorch:runner.cpp:113] Metadata: use_kv_cache = 1
I 00:00:38.793804 executorch:runner.cpp:113] Metadata: get_max_seq_len = 128
I 00:00:38.793814 executorch:runner.cpp:113] Metadata: get_vocab_size = 128256
I 00:00:38.793820 executorch:runner.cpp:113] Metadata: use_sdpa_with_kv_cache = 1
I 00:00:38.793829 executorch:runner.cpp:113] Metadata: enable_dynamic_shape = 1
I 00:00:38.793838 executorch:runner.cpp:120] eos_id = 128009
I 00:00:38.793847 executorch:runner.cpp:120] eos_id = 128001
I 00:00:38.793874 executorch:runner.cpp:171] RSS after loading model: 4380.675781 MiB (0 if unsupported)
If x+7=9, solve for xI 00:00:40.829818 executorch:text_prefiller.cpp:52] Prefill token result numel(): 128256
.

I 00:00:40.837835 executorch:runner.cpp:240] RSS after prompt prefill: 4380.675781 MiB (0 if unsupported)
Wait, no, that was my first equation.

Wait, no, I think I confused the equations.

Wait, let me check. So, in the initial problem, we have:

The number of integers x in the interval [1, n] such that x+1 divides n.

And the given condition is that n is equal to (x+1)(x+2)/2, for some integer x.

Wait, so n is equal to (x+1)(x+2)/2, and we're supposed to find the number of integers x in [1,
I 00:01:18.819886 executorch:runner.cpp:254] RSS after finishing text generation: 4380.675781 MiB (0 if unsupported)
PyTorchObserver {"prompt_tokens":10,"generated_tokens":117,"model_load_start_ms":1738272346354,"model_load_end_ms":1738272385147,"inference_start_ms":1738272385147,"inference_end_ms":1738272425173,"prompt_eval_end_ms":1738272387191,"first_token_ms":1738272387191,"aggregate_sampling_time_ms":907,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
I 00:01:18.819988 executorch:stats.h:106] 	Prompt Tokens: 10    Generated Tokens: 117
I 00:01:18.819997 executorch:stats.h:112] 	Model Load Time:		38.793000 (seconds)
I 00:01:18.820076 executorch:stats.h:119] 	Total inference time:		40.026000 (seconds)		 Rate: 	2.923100 (tokens/second)
I 00:01:18.820095 executorch:stats.h:129] 		Prompt evaluation:	2.044000 (seconds)		 Rate: 	4.892368 (tokens/second)
I 00:01:18.820104 executorch:stats.h:138] 		Generated 117 tokens:	37.982000 (seconds)		 Rate: 	3.080407 (tokens/second)
I 00:01:18.820113 executorch:stats.h:149] 	Time to first generated token:	2.044000 (seconds)
I 00:01:18.820121 executorch:stats.h:155] 	Sampling time over 127 tokens:	0.907000 (seconds

But the converted tokenizer from deepseek r1 (tokenizers.json) fails

# Test With DeepSeek R1 Tokenizer
!cd $EXECUTORCH_ROOT; \
    ./cmake-out/examples/models/llama/llama_main \
        --model_path="/tmp/deepseek-ai/DeepSeek-R1-Distill-Llama-8B/converted.fp16.pte" \
        --tokenizer_path="/tmp/deepseek-ai/DeepSeek-R1-Distill-Llama-8B/tokenizer.bpe.model" \
        --prompt="If x+7=9, solve for x"

I 00:00:00.016288 executorch:cpuinfo_utils.cpp:61] Reading file /sys/devices/soc0/image_version
I 00:00:00.016364 executorch:cpuinfo_utils.cpp:77] Failed to open midr file /sys/devices/soc0/image_version
I 00:00:00.016375 executorch:cpuinfo_utils.cpp:90] Reading file /sys/devices/system/cpu/cpu0/regs/identification/midr_el1
I 00:00:00.016388 executorch:cpuinfo_utils.cpp:99] Failed to open midr file /sys/devices/system/cpu/cpu0/regs/identification/midr_el1
I 00:00:00.016396 executorch:cpuinfo_utils.cpp:115] CPU info and manual query on # of cpus dont match.
I 00:00:00.016399 executorch:main.cpp:68] Resetting threadpool with num threads = 0
I 00:00:00.019248 executorch:runner.cpp:55] Creating LLaMa runner: model_path=/tmp/deepseek-ai/DeepSeek-R1-Distill-Llama-8B/converted.fp16.pte, tokenizer_path=/tmp/deepseek-ai/DeepSeek-R1-Distill-Llama-8B/tokenizer.bpe.model
E 00:00:39.108056 executorch:base64.h:169] input length must be larger than 4 and is multiple of 4, got 109
I 00:00:39.108098 executorch:runner.cpp:79] Failed to load /tmp/deepseek-ai/DeepSeek-R1-Distill-Llama-8B/tokenizer.bpe.model as a Tiktoken artifact, trying BPE tokenizer

I think my android app build has problems, let me try rebuilding it from current build cache.

@tarun292
Copy link
Contributor

@CypherpunkSamurai Yea the newly generated tokenizer files are failing because the format generated by the script now doesn't match the old format that's expected. I'm trying to figure out why.

@CypherpunkSamurai
Copy link

CypherpunkSamurai commented Feb 1, 2025

@mergennachin a little offtopic, everything works on the local llama_runner but the arm64 build of llama runner seems to have a linker error. I'm currently using the specified v2.26.0.240828.zip QNN and android-ndk-r26d-linux.zip for compiling Executorch and llama runner.

$adb push cmake-android-out/examples/models/llama/llama_main ${DEVICE_DIR}
$adb shell "cd ${DEVICE_DIR} && ./llama_main --model_path deepseek-r1-8b.pte --tokenizer_path deepseek-r1-8b-tokenizer.model --prompt \"<|start_header_id|>system<|end_header_id|>\n\nYou are a funny chatbot.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nCould you tell me about Facebook?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n\" --seq_len 128"
#CANNOT LINK EXECUTABLE "./llama_main": library "libqnn_executorch_backend.so" not found: needed by main executable

build config:

# check that python3 is your python3 executable
cmake \
    -DCMAKE_TOOLCHAIN_FILE="$ANDROID_NDK_ROOT/build/cmake/android.toolchain.cmake" \
    -DANDROID_ABI=arm64-v8a \
    -DCMAKE_INSTALL_PREFIX=cmake-android-out \
    -DCMAKE_BUILD_TYPE=Release \
    -DEXECUTORCH_BUILD_EXTENSION_DATA_LOADER=ON \
    -DEXECUTORCH_BUILD_EXTENSION_MODULE=ON \
    -DEXECUTORCH_BUILD_EXTENSION_TENSOR=ON \
    -DEXECUTORCH_BUILD_QNN=ON \
    -DQNN_SDK_ROOT=$QNN_SDK_ROOT \
    -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
    -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
    -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \
    -DFLATC_EXECUTABLE="$(which flatc)" \
    -Bcmake-android-out .
# build JNI qnn_executorch_backend
cmake --build cmake-android-out -j16 --target install --config Release

# build llama runner
echo "Building Llama Runner binary..."
cmake \
    -DCMAKE_TOOLCHAIN_FILE="$ANDROID_NDK_ROOT/build/cmake/android.toolchain.cmake" \
    -DANDROID_ABI=arm64-v8a \
    -DCMAKE_INSTALL_PREFIX=cmake-android-out \
    -DCMAKE_BUILD_TYPE=Release \
    -DPYTHON_EXECUTABLE=python3 \
    -DEXECUTORCH_BUILD_QNN=ON \
    -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
    -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
    -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \
    -Bcmake-android-out/examples/models/llama examples/models/llama

# build llama runner binary
cmake --build cmake-android-out/examples/models/llama -j16 --config Release

run

# this is all the files that im pushing
# DEVICE_DIR=/data/local/tmp/llama
adb shell mkdir -p ${DEVICE_DIR}
adb push ${QNN_SDK_ROOT}/lib/aarch64-android/libQnnHtp.so ${DEVICE_DIR}
adb push ${QNN_SDK_ROOT}/lib/aarch64-android/libQnnSystem.so ${DEVICE_DIR}
adb push ${QNN_SDK_ROOT}/lib/aarch64-android/libQnnHtpV69Stub.so ${DEVICE_DIR}
adb push ${QNN_SDK_ROOT}/lib/aarch64-android/libQnnHtpV73Stub.so ${DEVICE_DIR}
adb push ${QNN_SDK_ROOT}/lib/aarch64-android/libQnnHtpV75Stub.so ${DEVICE_DIR}
adb push ${QNN_SDK_ROOT}/lib/hexagon-v69/unsigned/libQnnHtpV69Skel.so ${DEVICE_DIR}
adb push ${QNN_SDK_ROOT}/lib/hexagon-v73/unsigned/libQnnHtpV73Skel.so ${DEVICE_DIR}
adb push ${QNN_SDK_ROOT}/lib/hexagon-v75/unsigned/libQnnHtpV75Skel.so ${DEVICE_DIR}

# also push model
adb push <model.pte> ${DEVICE_DIR}
adb push <tokenizer.model> ${DEVICE_DIR}
adb push cmake-android-out/lib/libqnn_executorch_backend.so ${DEVICE_DIR}

# push binary
adb push cmake-out-android/examples/models/llama2/llama_main ${DEVICE_DIR}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers module: llm Issues related to LLM examples and apps, and to the extensions/llm/ code module: user experience Issues related to reducing friction for users triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
Status: In progress
Development

No branches or pull requests

7 participants