Connect home devices into a powerful cluster to accelerate LLM inference. More devices mean faster performance, leveraging tensor parallelism and high-speed synchronization over Ethernet.
Supports Linux, macOS, and Windows. Optimized for ARM and x86_64 AVX2 CPUs.
News
- 12 Feb 2025 - 🚧 Merged the fundamental codebase refactor
- 9 Jan 2025 - 🍎 Llama 3.3 70B on 4 x Mac Mini M4 Pro 24GB RAM
- 28 Jul 2024 - 🌳 How to Run Llama 3.1 405B on Home Devices? Build AI Cluster!
Python 3 and C++ compiler required. The command will download the model and the tokenizer.
Model | Size | Command |
---|---|---|
Llama 3.1 8B Instruct Q40 | 6.32 GB | python launch.py llama3_1_8b_instruct_q40 |
Llama 3.1 405B Instruct Q40. | 238 GB | python launch.py llama3_1_405b_instruct_q40 . |
Llama 3.2 1B Instruct Q40 | 1.7 GB | python launch.py llama3_2_1b_instruct_q40 |
Llama 3.2 3B Instruct Q40 | 3.4 GB | python launch.py llama3_2_3b_instruct_q40 |
Llama 3.3 70B Instruct Q40 | 40 GB | python launch.py llama3_3_70b_instruct_q40 |
DeepSeek R1 Distill Llama 8B Q40 | 6.32 GB | python launch.py deepseek_r1_distill_llama_8b_q40 |
Supported architectures: Llama.
- You can run Distributed Llama only on 1, 2, 4... 2^n nodes.
- The maximum number of nodes is equal to the number of KV heads in the model #70.
The project is split up into two parts:
- Root node - it's responsible for loading the model and weights and forward them to workers. Also, it synchronizes the state of the neural network. The root node is also a worker, it processes own slice of the neural network.
- Worker node - it processes own slice of the neural network. It doesn't require any configuration related to the model.
You always need the root node and you can add 2^n - 1 worker nodes to speed up the inference. The RAM usage of the neural network is split up across all nodes. The root node requires a bit more RAM than worker nodes.
dllama inference
- run the inference with a simple benchmark,dllama chat
- run the CLI chat,dllama worker
- run the worker node,dllama-api
- run the API server.
🎹 Supported Arguments
Inference, Chat, API
Argument | Description | Example |
---|---|---|
--model <path> |
Path to model. | dllama_model_meta-llama-3-8b_q40.m |
--tokenizer <path> |
Tokenizer to model. | dllama_tokenizer_llama3.t |
--buffer-float-type <type> |
Float precision of synchronization. | q80 |
--workers <workers> |
Addresses of workers (ip:port), separated by space. | 10.0.0.1:9991 10.0.0.2:9991 |
--max-seq-len <n> |
The maximum sequence length, it helps to reduce the RAM usage. | 4096 |
Inference, Chat, Worker, API
Argument | Description | Example |
---|---|---|
--nthreads <n> |
Amount of threads. Don't set a higher value than number of CPU cores. | 4 |
Worker, API
Argument | Description | Example |
---|---|---|
--port <port> |
Binding port. | 9999 |
Inference
Argument | Description | Example |
---|---|---|
--prompt <prompt> |
Initial prompt. | "Hello World" |
--steps <steps> |
Number of tokens to generate. | 256 |
Please check the discussions section, where many measurements were published on different configurations.
Select and expand one of the sections below:
💻 MacOS, Linux, or Windows
You need x86_64 AVX2 CPUs or ARM CPUs. Different devices may have different CPUs.
The below instructions are for Debian-based distributions but you can easily adapt them to your distribution, macOS.
- Install Git and GCC:
sudo apt install git build-essential
- Clone this repository and compile Distributed Llama on all computers:
git clone https://github.com/b4rtaz/distributed-llama.git
cd distributed-llama
make dllama
make dllama-api
Continue to point 3.
- Install Git and Mingw (via Chocolatey):
choco install mingw
- Clone this repository and compile Distributed Llama on all computers:
git clone https://github.com/b4rtaz/distributed-llama.git
cd distributed-llama
make dllama
make dllama-api
Continue to point 3.
- Transfer weights and the tokenizer file to the root computer.
- Run worker nodes on worker computers:
./dllama worker --port 9998 --nthreads 4
- Run root node on the root computer:
./dllama inference --model dllama_model_meta-llama-3-8b_q40.m --tokenizer dllama_tokenizer_llama3.t --buffer-float-type q80 --prompt "Hello world" --steps 16 --nthreads 4 --workers 192.168.0.1:9998
To add more worker nodes, just add more addresses to the --workers
argument.
./dllama inference ... --workers 192.168.0.1:9998 192.168.0.2:9998 192.168.0.3:9998
📟 Raspberry Pi
- Install
Raspberry Pi OS Lite (64 bit)
on your Raspberry Pi devices. This OS doesn't have desktop environment. - Connect all devices to your switch or router.
- Connect to all devices via SSH.
ssh [email protected]
ssh [email protected]
- Install Git:
sudo apt install git
- Clone this repository and compile Distributed Llama on all devices:
git clone https://github.com/b4rtaz/distributed-llama.git
cd distributed-llama
make dllama
make dllama-api
- Transfer weights and the tokenizer file to the root device.
- Optional: assign static IP addresses.
sudo ip addr add 10.0.0.1/24 dev eth0 # 1th device
sudo ip addr add 10.0.0.2/24 dev eth0 # 2th device
- Run worker nodes on worker devices:
sudo nice -n -20 ./dllama worker --port 9998 --nthreads 4
- Run root node on the root device:
sudo nice -n -20 ./dllama inference --model dllama_model_meta-llama-3-8b_q40.m --tokenizer dllama_tokenizer_llama3.t --buffer-float-type q80 --prompt "Hello world" --steps 16 --nthreads 4 --workers 10.0.0.2:9998
To add more worker nodes, just add more addresses to the --workers
argument.
./dllama inference ... --workers 10.0.0.2:9998 10.0.0.3:9998 10.0.0.4:9998
Feel free to contribute to this project. For small changes, simply create a new merge request. For larger changes, please create an issue to discuss your plans. Please follow these guidelines when contributing:
- Make only minimal changes and avoid modifying files that are not necessary.
- Ensure the code is compatible across all supported systems and CPUs.
- This repository is maintained in English.
This project is released under the MIT license.
@misc{dllama,
author = {Bartłomiej Tadych},
title = {Distributed Llama},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/b4rtaz/distributed-llama}},
commit = {7eb77ca93ec0d502e28d36b6fb20039b449cbea4}
}