You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If I understand it correctly, we have to implement the logic to generate draft tokens for each iteration ourselves and then pass them to the target model executor? Is there a way to have an executor API to do the work for us? Thanks!
The text was updated successfully, but these errors were encountered:
Hi,
I am interested to use a draft model as speculative decoding, and the only example I found is: https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/draft_target_model
We use tensorRT LLM (c++ runtime) with the python executor interface: https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/pybind/bindings.cpp, can anyone provide instructions regarding how to support draft model speculative decoding on top of that?
If I understand it correctly, we have to implement the logic to generate draft tokens for each iteration ourselves and then pass them to the target model executor? Is there a way to have an executor API to do the work for us? Thanks!
The text was updated successfully, but these errors were encountered: