← Back to tightwad.dev
Website docs

Tightwad Docs

The practical docs, hosted directly on the site. Enough to install it, configure it, and stop paying cloud rates for inference.

Quickstart

From zero to running inference in about five minutes.

curl -O https://tightwad.dev/downloads/tightwad-0.4.2.tar.gz
tar -xzf tightwad-0.4.2.tar.gz
cd tightwad-0.4.2
python3 -m venv .venv && source .venv/bin/activate
pip install .

Prerequisites

You need llama.cpp built with RPC support on each machine that will participate in the pool.

cmake -B build -DGGML_RPC=ON -DGGML_CUDA=ON  # or -DGGML_HIP=ON for AMD
cmake --build build --config Release -j

You’ll also need a GGUF model file. HuggingFace is the obvious place to grab one.

Option A: Speculative decoding

If you have one cheap draft box and one stronger target box, this is the fastest path.

1. Start both servers

# Machine A (draft)
ollama serve
ollama pull qwen3:1.7b

# Machine B (target)
ollama serve
ollama pull qwen3:32b

2. Auto-generate config

tightwad init \
  --draft-url http://192.168.1.10:11434 \
  --draft-model qwen3:1.7b \
  --target-url http://192.168.1.20:11434 \
  --target-model qwen3:32b

3. Start the proxy

tightwad proxy start

4. Use it

Point your chat UI at http://localhost:8088/v1 and you’re off.

tightwad proxy status

Option B: GPU pool

1. Start RPC workers

llama-rpc-server --host 0.0.0.0 --port 50052

2. Auto-discover and generate config

tightwad init

3. Validate

tightwad doctor

4. Start the coordinator

tightwad start

Your pooled endpoint lands at http://localhost:8080/v1.

Option C: Combined mode

Set up the GPU pool, then layer speculation on top of it.

tightwad start
tightwad proxy start

That gives you the proxy endpoint at http://localhost:8088/v1.

Configuration reference

Tightwad uses a YAML config file, usually configs/cluster.yaml. You can override it with -c or TIGHTWAD_CONFIG.

coordinator:
  host: 127.0.0.1
  port: 8080
  binary: /usr/local/bin/llama-server
  extra_args: ["--flash-attn"]

workers:
  - name: gpu0
    host: 192.168.1.10
    port: 50052
    gpu:
      vendor: nvidia
      model: RTX 4070 Ti Super
      vram_gb: 16
    model_dir: /models

models:
  default: llama-70b
  llama-70b:
    name: Llama 3.3 70B
    path: /models/llama-3.3-70b-instruct-Q4_K_M.gguf
    tensor_split: "16,24"

proxy:
  host: 0.0.0.0
  port: 8088
  draft:
    url: http://192.168.1.30:11434
    model_name: qwen3:1.7b
    backend: ollama
  target:
    url: http://127.0.0.1:8080
    model_name: llama-70b
    backend: llamacpp
  max_draft_tokens: 32
SectionPurpose
coordinatorRuns llama-server and distributes work to RPC workers.
workersMachines contributing GPU or CPU compute.
modelsDefines the GGUF models and tensor split layout.
proxyOptional speculative decoding layer in front of the target.

Validation

tightwad doctor
tightwad doctor --fix
tightwad doctor --json

Use the doctor command before you waste an hour on a bad config.

Downloads

Get the source directly from the site:

/downloads/tightwad-0.4.2.tar.gz