How can I set it up to load and run two models simultaneously, one with some layers on the GPU and the other all CPU? #5667

TiagoTiago · 2024-03-08T15:52:51Z

TiagoTiago
Mar 8, 2024

I'm trying out some VSCodium extensions, and one of them suggests separate models, a small one for quick auto-complete and a bigger one for more advanced chat and tasks; and even if I could figure out how to make it trigger models to get loaded via the OpenAI API (which doesn't seem to be happening, should I file bug on this or is just some bad settings?), even then I would still probably wanna have two models loaded to avoid the wait unloading and reloading when needing to switch from one to the other.

eivorwolfkissed · 2024-03-20T21:23:58Z

eivorwolfkissed
Mar 20, 2024

I can share what I did. I have a shell script that executes the server py file multiple times iterating over different api endpoints (5000, 500*). The front end has a load balancer that posts requests to each endpoint there by handling few parallel sessions ( while there are still other requests in queue). What you could do it load the first server py with a large model at localhost:8000 and a small model at 8001. This means your front end could have a button to choose between one of the two. Either you can give the choice to the user or route request to the appropriate endpoint based on prompts.

0 replies

TiagoTiago · 2024-03-20T22:01:32Z

TiagoTiago
Mar 20, 2024
Author

So just run another one, with different ports set? How much overhead is there in having all that extra stuff running twice, as opposed to some hypothetical method where you just got one server that somehow can keep two models loaded and running at the same time and let you address the models separately?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can I set it up to load and run two models simultaneously, one with some layers on the GPU and the other all CPU? #5667

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

How can I set it up to load and run two models simultaneously, one with some layers on the GPU and the other all CPU? #5667

TiagoTiago Mar 8, 2024

Replies: 2 comments

eivorwolfkissed Mar 20, 2024

TiagoTiago Mar 20, 2024 Author

TiagoTiago
Mar 8, 2024

eivorwolfkissed
Mar 20, 2024

TiagoTiago
Mar 20, 2024
Author