Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Significantly reduced inference speed after Lora finetunig #1104

Open
hschaeufler opened this issue Nov 12, 2024 · 1 comment
Open

Significantly reduced inference speed after Lora finetunig #1104

hschaeufler opened this issue Nov 12, 2024 · 1 comment

Comments

@hschaeufler
Copy link
Contributor

hschaeufler commented Nov 12, 2024

Describe
I found that after finetuning with Lora, the token throughput is significantly reduced. I trained a model on the unit test generation. And then fused the Lora adapter.

For my test dataset, the Lora-tuned model took 8:55:34h and generated a total of 246,362 tokens. That’s a token throughput of 7.67 tokens per second.

The base model only took 2:2:17h and generated 189,509 tokens. By my calculation, that’s around 21 tokens per second.

In the LoRA-Paper is written:

Our simple linear design allows us to merge the trainable matrices with the frozen weights
when deployed, introducing no inference latency compared to a fully fine-tuned model, by
construction.

Is this reduction normal or within the expected range?

To Reproduce

Include code snippet

mlx_lm.fuse --model "meta-llama/Meta-Llama-3.1-8B-Instruct" \
    --adapter-path "results/llama3_1_8B_instruct_lora/tuning_20/adapters" \
    --save-path "results/llama3_1_8B_instruct_lora/tuning_20/lora_fused_model/"

Expected behavior
I would expect a significantly higher token rate. With only a minor impact from the LoRA tuning

Desktop (please complete the following information):

  • MLX-LM [0.18.2]

Additional context
Add any other context about the problem here.

@awni
Copy link
Member

awni commented Nov 16, 2024

I tried to reproduce this but was not successful.

  • Generate with a base model at a given toks/sec
  • Fine-tune then fuse adapters
  • Generate with fused model

The fused model generation has the same toks/sec as the base model for me.

I think we'll need to know more about what you are doing in order to understand if this is expected and if not to help debug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants