How can I set it up to load and run two models simultaneously, one with some layers on the GPU and the other all CPU? #5667
Replies: 2 comments
-
I can share what I did. I have a shell script that executes the server py file multiple times iterating over different api endpoints (5000, 500*). The front end has a load balancer that posts requests to each endpoint there by handling few parallel sessions ( while there are still other requests in queue). What you could do it load the first server py with a large model at localhost:8000 and a small model at 8001. This means your front end could have a button to choose between one of the two. Either you can give the choice to the user or route request to the appropriate endpoint based on prompts. |
Beta Was this translation helpful? Give feedback.
-
So just run another one, with different ports set? How much overhead is there in having all that extra stuff running twice, as opposed to some hypothetical method where you just got one server that somehow can keep two models loaded and running at the same time and let you address the models separately? |
Beta Was this translation helpful? Give feedback.
-
I'm trying out some VSCodium extensions, and one of them suggests separate models, a small one for quick auto-complete and a bigger one for more advanced chat and tasks; and even if I could figure out how to make it trigger models to get loaded via the OpenAI API (which doesn't seem to be happening, should I file bug on this or is just some bad settings?), even then I would still probably wanna have two models loaded to avoid the wait unloading and reloading when needing to switch from one to the other.
Beta Was this translation helpful? Give feedback.
All reactions