You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I'm playing around with LiteLLM Proxy both for cloud (OpenAI) and served models (vLLM) and it's great.
I was trying to implement the rate and token limits but I don't understand if what I want to do is achievable, because it seems not working, but probably I'm doing something wrong.
What I would like to do is limit the invocations of the model, so for example if I have a proxy with one OpenAI model i would like to set a maximum of tokens (e.g. 1000) so that after that limit is reached, the proxy returns an error. The same for the RPM.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hi, I'm playing around with LiteLLM Proxy both for cloud (OpenAI) and served models (vLLM) and it's great.
I was trying to implement the rate and token limits but I don't understand if what I want to do is achievable, because it seems not working, but probably I'm doing something wrong.
What I would like to do is limit the invocations of the model, so for example if I have a proxy with one OpenAI model i would like to set a maximum of tokens (e.g. 1000) so that after that limit is reached, the proxy returns an error. The same for the RPM.
The problem is that given the following:
It doesn't work as I expect, because requests are never blocked even with an higher amount of tokens or an higher amount of requests
Did I miss something?
Thanks
Beta Was this translation helpful? Give feedback.
All reactions