A LeapfrogAI API-compatible llama-cpp-python wrapper for quantized and un-quantized model inferencing across CPU infrastructures.
See the LeapfrogAI documentation website for system requirements and dependencies.
- LeapfrogAI API for a fully RESTful application
The default model that comes with this backend in this repository's officially released images is a quantization of the Synthia-7b model.
Models are pulled from HuggingFace Hub via the model_download.py script. To change what model comes with the llama-cpp-python backend, set the following environment variables:
REPO_ID # eg: "TheBloke/SynthIA-7B-v2.0-GGUF"
FILENAME # eg: "synthia-7b-v2.0.Q4_K_M.gguf"
REVISION # eg: "3f65d882253d1f15a113dabf473a7c02a004d2b5"
If you choose a different model, make sure to modify the default config.yaml using the Hugging Face model repository's model files and model card.
To build and deploy the llama-cpp-python backend Zarf package into an existing UDS Kubernetes cluster:
Important
Execute the following commands from the root of the LeapfrogAI repository
pip install 'huggingface_hub[cli,hf_transfer]' # Used to download the model weights from huggingface
make build-llama-cpp-python LOCAL_VERSION=dev
uds zarf package deploy packages/llama-cpp-python/zarf-package-llama-cpp-python-*-dev.tar.zst --confirm
To run the llama-cpp-python backend locally:
Important
Execute the following commands from this sub-directory
# Install dev and runtime dependencies
make install
# Clone Model
# Supply a REPO_ID, FILENAME and REVISION, as seen in the "Model Selection" section
python scripts/model_download.py
mv .model/*.gguf .model/model.gguf
# Start the model backend
make dev