Quickstart
From zero to running inference in about five minutes.
curl -O https://tightwad.dev/downloads/tightwad-0.4.2.tar.gz tar -xzf tightwad-0.4.2.tar.gz cd tightwad-0.4.2 python3 -m venv .venv && source .venv/bin/activate pip install .
Prerequisites
You need llama.cpp built with RPC support on each machine that will participate in the pool.
cmake -B build -DGGML_RPC=ON -DGGML_CUDA=ON # or -DGGML_HIP=ON for AMD cmake --build build --config Release -j
You’ll also need a GGUF model file. HuggingFace is the obvious place to grab one.
Option A: Speculative decoding
If you have one cheap draft box and one stronger target box, this is the fastest path.
1. Start both servers
# Machine A (draft) ollama serve ollama pull qwen3:1.7b # Machine B (target) ollama serve ollama pull qwen3:32b
2. Auto-generate config
tightwad init \ --draft-url http://192.168.1.10:11434 \ --draft-model qwen3:1.7b \ --target-url http://192.168.1.20:11434 \ --target-model qwen3:32b
3. Start the proxy
tightwad proxy start
4. Use it
Point your chat UI at http://localhost:8088/v1 and you’re off.
tightwad proxy status
Option B: GPU pool
1. Start RPC workers
llama-rpc-server --host 0.0.0.0 --port 50052
2. Auto-discover and generate config
tightwad init
3. Validate
tightwad doctor
4. Start the coordinator
tightwad start
Your pooled endpoint lands at http://localhost:8080/v1.
Option C: Combined mode
Set up the GPU pool, then layer speculation on top of it.
tightwad start tightwad proxy start
That gives you the proxy endpoint at http://localhost:8088/v1.
Configuration reference
Tightwad uses a YAML config file, usually configs/cluster.yaml. You can override it with -c or TIGHTWAD_CONFIG.
coordinator:
host: 127.0.0.1
port: 8080
binary: /usr/local/bin/llama-server
extra_args: ["--flash-attn"]
workers:
- name: gpu0
host: 192.168.1.10
port: 50052
gpu:
vendor: nvidia
model: RTX 4070 Ti Super
vram_gb: 16
model_dir: /models
models:
default: llama-70b
llama-70b:
name: Llama 3.3 70B
path: /models/llama-3.3-70b-instruct-Q4_K_M.gguf
tensor_split: "16,24"
proxy:
host: 0.0.0.0
port: 8088
draft:
url: http://192.168.1.30:11434
model_name: qwen3:1.7b
backend: ollama
target:
url: http://127.0.0.1:8080
model_name: llama-70b
backend: llamacpp
max_draft_tokens: 32
| Section | Purpose |
|---|---|
coordinator | Runs llama-server and distributes work to RPC workers. |
workers | Machines contributing GPU or CPU compute. |
models | Defines the GGUF models and tensor split layout. |
proxy | Optional speculative decoding layer in front of the target. |
Validation
tightwad doctor tightwad doctor --fix tightwad doctor --json
Use the doctor command before you waste an hour on a bad config.
Downloads
Get the source directly from the site: