[Usage]: The TP improvement is not as expectation #274

JunxiChhen · 2024-09-12T01:46:41Z

Your current environment

The offline inference of Llama-3-8B with benchmark_latency.py sweeping on 1, 2, 4 cards results:

And the optimum-habana results:

The results show that on 1 card vLLM is greater than optimum-habana. But when inference on multi-card, the TP in vLLM performance gain is not good enough, so that the performance is worse than optimum-habana.

How would you like to use vllm

I want to run inference of a [specific model](put link here). I don't know how to integrate it with vllm.

wpyszka · 2024-09-24T07:50:39Z

@JunxiChhen, Vllm performance improvements are planned in SW1.19 version. Please stay tuned.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Usage]: The TP improvement is not as expectation #274

[Usage]: The TP improvement is not as expectation #274

JunxiChhen commented Sep 12, 2024

wpyszka commented Sep 24, 2024

[Usage]: The TP improvement is not as expectation #274

[Usage]: The TP improvement is not as expectation #274

Comments

JunxiChhen commented Sep 12, 2024

Your current environment

How would you like to use vllm

wpyszka commented Sep 24, 2024