Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Document/examples to enable draft model speculative decoding using c++ executor API #2424

Open
ynwang007 opened this issue Nov 7, 2024 · 2 comments
Assignees
Labels
question Further information is requested triaged Issue has been triaged by maintainers

Comments

@ynwang007
Copy link

Hi,

I am interested to use a draft model as speculative decoding, and the only example I found is: https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/draft_target_model

We use tensorRT LLM (c++ runtime) with the python executor interface: https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/pybind/bindings.cpp, can anyone provide instructions regarding how to support draft model speculative decoding on top of that?

If I understand it correctly, we have to implement the logic to generate draft tokens for each iteration ourselves and then pass them to the target model executor? Is there a way to have an executor API to do the work for us? Thanks!

@hello-11 hello-11 added question Further information is requested triaged Issue has been triaged by maintainers labels Nov 8, 2024
@achartier
Copy link

That's correct, you can find an example using ExternalDraftTokensConfig in https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/runtime/model_runner_cpp.py#L628

An example using the C++ executor API will be provided next update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

3 participants