Docs — Tightwad

Quickstart

From zero to running inference in about five minutes.

curl -O https://tightwad.dev/downloads/tightwad-0.4.2.tar.gz
tar -xzf tightwad-0.4.2.tar.gz
cd tightwad-0.4.2
python3 -m venv .venv && source .venv/bin/activate
pip install .

Prerequisites

You need llama.cpp built with RPC support on each machine that will participate in the pool.

cmake -B build -DGGML_RPC=ON -DGGML_CUDA=ON  # or -DGGML_HIP=ON for AMD
cmake --build build --config Release -j

You’ll also need a GGUF model file. HuggingFace is the obvious place to grab one.

Option A: Speculative decoding

If you have one cheap draft box and one stronger target box, this is the fastest path.

1. Start both servers

# Machine A (draft)
ollama serve
ollama pull qwen3:1.7b

# Machine B (target)
ollama serve
ollama pull qwen3:32b

2. Auto-generate config

tightwad init \
  --draft-url http://192.168.1.10:11434 \
  --draft-model qwen3:1.7b \
  --target-url http://192.168.1.20:11434 \
  --target-model qwen3:32b

3. Start the proxy

tightwad proxy start

4. Use it

Point your chat UI at http://localhost:8088/v1 and you’re off.

tightwad proxy status

Option B: GPU pool

1. Start RPC workers

llama-rpc-server --host 0.0.0.0 --port 50052

2. Auto-discover and generate config

tightwad init

3. Validate

tightwad doctor

4. Start the coordinator

tightwad start

Your pooled endpoint lands at http://localhost:8080/v1.

Option C: Combined mode

Set up the GPU pool, then layer speculation on top of it.

tightwad start
tightwad proxy start

That gives you the proxy endpoint at http://localhost:8088/v1.

Configuration reference

Tightwad uses a YAML config file, usually configs/cluster.yaml. You can override it with -c or TIGHTWAD_CONFIG.

coordinator:
  host: 127.0.0.1
  port: 8080
  binary: /usr/local/bin/llama-server
  extra_args: ["--flash-attn"]

workers:
  - name: gpu0
    host: 192.168.1.10
    port: 50052
    gpu:
      vendor: nvidia
      model: RTX 4070 Ti Super
      vram_gb: 16
    model_dir: /models

models:
  default: llama-70b
  llama-70b:
    name: Llama 3.3 70B
    path: /models/llama-3.3-70b-instruct-Q4_K_M.gguf
    tensor_split: "16,24"

proxy:
  host: 0.0.0.0
  port: 8088
  draft:
    url: http://192.168.1.30:11434
    model_name: qwen3:1.7b
    backend: ollama
  target:
    url: http://127.0.0.1:8080
    model_name: llama-70b
    backend: llamacpp
  max_draft_tokens: 32

Section	Purpose
`coordinator`	Runs `llama-server` and distributes work to RPC workers.
`workers`	Machines contributing GPU or CPU compute.
`models`	Defines the GGUF models and tensor split layout.
`proxy`	Optional speculative decoding layer in front of the target.

Validation

tightwad doctor
tightwad doctor --fix
tightwad doctor --json

Use the doctor command before you waste an hour on a bad config.

Downloads

Get the source directly from the site:

/downloads/tightwad-0.4.2.tar.gz

Tightwad Docs