The serve.py
script can be used to create an inference server for any of the
supported models.
Provide the HuggingFace model name and tensor-parallelism (use the default
values and run $ python serve.py
for a single-GPU
mistralai/Mistral-7B-v0.1
deployment):
$ python serve.py --model "mistralai/Mistral-7B-v0.1" tensor-parallel 1
Connect to the persistent deployment and generate text with client.py
. Provide
the HuggingFace model name, maximum generated tokens, and prompt(s) (or if you
are using the default values, run $ python client.py
):
$ python client.py --model "mistralai/Mistral-7B-v0.1" --max-new-tokens 128 --prompts "DeepSpeed is" "Seattle is"
Shutdown the persistent deployment with terminate.py
. Provide the HuggingFace
model name (or if you are using the default values, run $ python terminate.py
):
$ python terminate.py --model "mistralai/Mistral-7B-v0.1