v0.4.2

Your GPUs are
|

Tightwad pools your mixed CUDA + ROCm GPUs into a single OpenAI-compatible endpoint.
Speculative decoding proxy: draft fast, verify smart, stream everything.
Same output quality. 1.86× measured on 70B.* Zero cloud bill (fully local setup).

1.86× measured speedup (70B pooled)*
100% acceptance (same-family, greedy)
70B across 4 consumer GPUs over WiFi
$0 cloud bill (fully local setup)

* 1.86× measured on Llama 3.1 8B โ†’ Llama 3.3 70B across a 4-GPU RPC pool (52GB VRAM over WiFi) with greedy decoding (temperature=0). 100% acceptance with same-family models under greedy decoding. Speedup depends on hardware, model pairing, network, and configuration. Your results will vary.

terminal โ€” your junk drawer, unified
$ pip install tightwad
$ tightwad proxy start --combined
 Draft:  Llama-3.1-8B  @ localhost:8081   (M4 Metal — drafts 32 tokens/round)
 Pool:   4 GPUs / 52GB VRAM over WiFi    (4070 Ti + 3060 + 2070 + M2 Metal)
 Target: Llama-3.3-70B across pool       (too big for any single machine)
 Proxy listening on http://localhost:8088
 Acceptance rate: 100% | 1.86× speedup | 4.1 tok/s (was 2.2)
# CUDA โœ“  Metal โœ“  WiFi โœ“  4 machines โœ“  โ€” one endpoint.

What do you actually do?

Most people don't get it at first. So here it is, dead simple. One change. That's it.

BEFORE
๐Ÿ’ฌ
Open WebUI
your chat app
โ†’
๐Ÿข
Ollama :11434
Llama 3.3 70B โ€” slow
Base URL: http://192.168.1.10:11434
โณ Every token generated one at a time. Waiting.
โ†’
AFTER
๐Ÿ’ฌ
Open WebUI
same app, no changes
โ†’
๐Ÿท
Tightwad :8088
invisible proxy
โ†’
โšก
Llama 3.3 70B
same output quality, 1.86ร— faster
Base URL: http://192.168.1.10:8088 โ† only change
โœ“ Equivalent output quality. Just faster.
๐Ÿ”—

One URL change

Point your chat app at port 8088 instead of 11434. That's the entire setup from your app's perspective.

๐Ÿซฅ

The small model is invisible

You never configure it, select it, or see it. It's like autocomplete on your phone โ€” it suggests tokens, the big model accepts or corrects. You only see the final output.

๐Ÿ”ฌ

Output quality is preserved

With greedy decoding (temperature=0), output is mathematically identical to running the large model alone. With other sampling methods, output is statistically equivalent โ€” drawn from the same probability distribution. The big model always has final say on every token.

๐Ÿš€

Up to 100% of tokens come from your cheap GPU

With same-family models and greedy decoding (Llama 3.1 8B โ†’ Llama 3.3 70B), we measured 100% acceptance โ€” every single draft token accepted. Cross-family pairs (Qwen3-8B โ†’ 397B) still hit 80%. The big model always has final say.

That's it. Change one URL. Get up to 2-3x faster responses. Same quality.

Set It Up in 20 Minutes โ†’

Four ways to stop wasting money

Pick your poison. Stack them. Run all four. Tightwad doesn't judge — it just saves you cash.

01

RPC Cluster Mode

Got GPUs scattered across machines? Pool them. CUDA on one box, ROCm on another — Tightwad doesn't care. It distributes model layers across all of them and hands you one clean OpenAI-compatible endpoint.

  [OpenAI Client]
        |
        v
+-------------------+
|  Tightwad         |  <-- One endpoint to rule them all
|  Coordinator :8080|
+--------+----------+
         |  distributes layers
    +----+----+
    v         v
+--------+ +--------+
| Worker | | Worker |
| NVIDIA | |  AMD   |
| 4070Ti | | 7900XTX|
| 16 GB  | | 24 GB  |
+--------+ +--------+
  70B model: covered ✓
  • Mix NVIDIA + AMD GPUs freely
  • Run 70B+ models on consumer hardware
  • Hot-swap models without restarting workers
  • Built-in benchmarking CLI
👑 THE KILLER FEATURE
03

Combined Mode — Speculation Over a Pool

When a model doesn't fit on any single machine, pool your GPUs AND speculate on top. RPC pooling alone is slow (one network round-trip per token). Speculation amortizes that — 32 tokens verified per round-trip instead of 1. Result: models that fit nowhere become usable.

  [Junk Hardware โ€” P400 2GB, GTX 770, laptop CPU]
        | runs 1.7B draft, ~30 tok/s
        | sends token IDs (bytes)
        v
  [Tightwad Proxy :8088]
        | sends draft to pool for BATCH verify
        v
  [RPC GPU Pool โ€” 4 GPUs, 52GB total, WiFi]
        | verifies 32 tokens in ONE forward pass
        v
  4.1 tok/s instead of 2.2 tok/s โ€” 70B fits nowhere else
  • 1.86× measured speedup on Llama 3.3 70B (4 GPUs over WiFi)*
  • 100% acceptance rate with same-family draft model (greedy decoding)
  • Any junk hardware can be the drafter — 2GB GPU, CPU, laptop
  • Pool CUDA + ROCm + Metal GPUs, speculate on top

* Measured: Llama 3.1 8B draft โ†’ Llama 3.3 70B target across RTX 4070 Ti Super + RTX 3060 + RTX 2070 + M2 Metal = 52GB VRAM over WiFi. 519 tokens in 127s vs 512 tokens in 231s direct. Your results will vary with hardware and network conditions.

🌐 P2P DISTRIBUTION
04

Swarm Transfer — P2P Model Distribution

Models are huge. Downloading 70B from HuggingFace takes hours. Pull from every machine that already has it. Chunked transfer with piece verification — like torrents, but for GGUF files. New machine joins the cluster? It downloads the model from all your existing machines in parallel.

  [New Machine Joins Cluster]
        |
        | "I need Llama-3.3-70B-Q4_K_M.gguf"
        v
  +---------------------------+
  | Tightwad Swarm Discovery  |
  |                           |
  |  Piece 1 <--- Machine A  |  (4070 Ti — has full model)
  |  Piece 2 <--- Machine B  |  (RTX 2070 — has full model)
  |  Piece 3 <--- Machine C  |  (M2 Metal — has pieces 1-6)
  |  Piece 4 <--- Machine A  |  (parallel, rarest-first)
  |  ...                      |
  +---------------------------+
        |
        v
  SHA256-verified • ready to serve in minutes, not hours
  • Multi-source parallel download — pull from every peer simultaneously
  • SHA256 piece verification — every chunk validated before use
  • Rarest-first selection — ensures model availability across the cluster
  • Delta updates — new quantization? Only transfer the changed pieces
  • Zero central server — machines discover each other automatically

Your junk drawer of compute, unified

YOUR HARDWARE (any mix works)
RTX 4070 Ti Super (16GB)
RTX 3060 (12GB)
RTX 2070 (8GB)
GTX 770 (2GB — why not)
RX 7900 XTX (24GB, AMD!)
Old Xeon (CPU only)
Laptop (M2, CPU draft)
CUDA โœ“ ROCm โœ“ CPU โœ“ Mixed โœ“
TIGHTWAD
One endpoint
Draft fast. Verify smart.
localhost:8088
OpenAI-compatible API
Without Tightwad: big model generates every token, one at a time  •  With Tightwad: all your hardware works together, big model only handles the hard tokens  •  Output quality: equivalent* • Speed: up to 2–3× faster*

The math behind the magic

🚀

Draft

Small model blazes through 32 candidate tokens at ~30+ tok/s. Fast and cheap.

🔍

Verify

Big model evaluates all 32 tokens in a single forward pass. Batch is basically free.

Accept

Keep every token both models agree on. Take the big model's token at the first disagreement.

📡

Stream

Accepted tokens stream to your app instantly. Repeat until done. Output quality is equivalent to the target model alone.

Benchmarks that hit different

Real hardware. Real numbers. No cherry-picking. Logprobs-based batch verification is live โ€” these acceptance rates translate directly to wall-clock speedup.

Llama 3.3 70B · 4-GPU RPC pool (52GB VRAM over WiFi) · Llama 3.1 8B draft on M4 Metal
Mode Tokens Time Speed
RPC pool direct (autoregressive) 512 231s 2.2 tok/s
RPC pool + speculation 519 127s 4.1 tok/s
⚡ Speedup 100% acceptance · 33 tokens/round 1.86×

The 70B model doesn't fit on any single machine. It's distributed across 4 GPUs (RTX 4070 Ti Super + RTX 3060 + RTX 2070 + M2 Metal) over WiFi. Without speculation: painfully slow. With speculation: usable.

👑

The killer result

A 70B model across 4 consumer GPUs over WiFi โ€” from 2.2 to 4.1 tok/s. No single machine could run this model. Speculation makes it usable.

100% acceptance

Same-family drafting (Llama 3.1 8B โ†’ Llama 3.3 70B) achieves 100% acceptance with greedy decoding. Every draft token is accepted.

⚠️

Family matters

Llama 3.2 3B โ†’ Llama 3.3 70B got only 1.6% acceptance despite sharing a tokenizer. Architecture match is critical โ€” Llama 3.1 8B is the correct drafter.

Qwen3-32B · 4-GPU RPC pool · Qwen3-1.7B draft on M4 CPU
Mode Speed Notes
Desktop local only (4070+3060, 32B) 17.0 tok/s Best case โ€” fits on one machine
4-GPU RPC pool (autoregressive) 3.0 tok/s Each token = full RPC round-trip
RPC pool + speculation 5.4 tok/s 32 tokens verified per batch, 100% acceptance (greedy decoding)
⚡ Pool speedup 1.8× over pool-only (3.0 โ†’ 5.4 tok/s)

RPC pooling alone is slow over WiFi (one network round-trip per token). Speculation amortizes that โ€” 32 tokens per round-trip instead of 1. Don't pool when the model fits locally (17 tok/s local vs 5.4 tok/s pooled).

When to use combined mode

Only when the model doesn't fit on one machine. If it fits locally (17 tok/s), don't pool โ€” just use speculation with a remote drafter.

💡

Why it works

Pool autoregressive: 1 token per network round-trip = slow. Pool + speculation: 32 tokens per round-trip = 1.8× faster. The draft model amortizes network overhead.

Qwen3-8B (RTX 2070) โ†’ Qwen3-32B (RTX 4070 Ti Super) ยท 130 prompts
Prompt Type Acceptance Rate Rounds Verdict
🧮 Reasoning
89%
32 Math is deterministic. Love it.
💻 Code
76%
34 Syntax is law. Both models agree.
📚 Factual
73%
18 Strong agreement on facts.
📋 List
42%
40 Phrasing varies. Still worthwhile.
🎨 Creative
39%
6 Many valid outputs. Expected.
⚡ Average
63.8%
26 64% of tokens = free.
💸

What 64% means

Nearly two-thirds of your tokens come from the cheap GPU. The expensive model only works on the hard parts.

🎯

Output quality

Equivalent to running the big model alone. With greedy decoding, mathematically identical; with other sampling, statistically equivalent.

Logprobs: live

Logprobs-based batch verification is implemented. These acceptance rates are real wall-clock speedup, not just acceptance stats.

Qwen3-8B (local GPU) โ†’ Qwen3.5-397B (API) ยท logprobs + whitespace normalization
Prompt Type Acceptance Rate Notes
🧮 Reasoning
88%
Highest โ€” deterministic math
⚡ Average (normalized)
80%
Key result: 4 in 5 tokens local.
๐Ÿ†

80% acceptance: Qwen3-8B โ†’ Qwen3.5-397B

With whitespace normalization, a consumer GPU running an 8B model drafts 4 out of every 5 tokens for a 397B model. That means up to 80% fewer output tokens billed to the cloud API for the same quality output. The bigger the gap between draft and target quality, the more you save.

๐Ÿ†

Notable result

Up to 80% acceptance on a 397B model (same-family models). Your gaming PC is doing up to 80% of the work that would otherwise cost API money.

💰

API cost math

At $0.60/M output tokens (Qwen3.5-397B), 80% acceptance means you pay for roughly 20% of output tokens via the API โ€” up to 5× reduction in output token costs. Input/prompt tokens are still processed by the API. Local GPU electricity and hardware costs not included.

📝

Same-family is key

Qwen3-8B + Qwen3.5-397B are from the same model family. Cross-family (e.g. Llama โ†’ Qwen) drops to ~3%. Same family = high acceptance.

Wall-clock speedup ยท Qwen3-8B (RTX 2070) โ†’ Qwen3-32B (RTX 4070 Ti + RTX 3060) ยท llama-server ยท max_draft_tokens=32
Prompt Baseline Speculative Speedup
Capital of France 1.17s 0.90s 1.30x
Thermodynamics 12.73s 9.09s 1.40x
Prime checker 12.76s 10.15s 1.28x
Average speed 13.24s 10.95s 1.21x
TCP vs UDP 5.58s 4.88s 1.14x
Total 45.43s 35.96s 1.27x

Set max_draft_tokens: auto and Tightwad finds the sweet spot for you. Or pin it at 32 for manual control.

โšก

Real wall-clock time

1.27x overall speedup measured end-to-end. Not theoretical โ€” actual seconds off the clock per response.

๐ŸŽ›๏ธ

Tune for your setup

Cross-machine HTTP overhead is the enemy. Set max_draft_tokens: auto to let Tightwad optimize round trips for you, or pin at 32 for manual control.

How much are you leaving on the table?

Slide to see your monthly cloud inference waste. Then stop doing that.

10M
$15
😭 Without Tightwad $150/mo
🐷 With Tightwad $63/mo
You save $87/mo

* Estimated savings assume the selected acceptance rate is sustained. Default uses 60% acceptance rate typical of local GPU setups. Rates vary by model pair and prompt type (58-64% local, up to 80% same-family API). Savings do not account for local electricity, hardware costs, or maintenance. Your results will vary.

Pick your archetype

Tightwad was built by and for people who think paying $200/month for inference is genuinely offensive.

🏠

The Homelab Hoarder

You have a 2070 in the old desktop, a 4070 in the main rig, and a Quadro doing nothing in the server rack. They're all lazy freeloaders. Tightwad makes them a team.

RPC Cluster Mode
  • Pool all your random GPUs into one endpoint
  • Run 70B models across consumer hardware
  • Zero wasted VRAM, zero cloud spend
🏗️

The Budget Builder

You want 70B model quality on a $600 GPU budget. That's not a dream — that's just math. RPC mode distributes layers across whatever you've got collecting dust.

RPC Cluster Mode
  • Llama 3.3 70B on 4× consumer GPUs
  • No enterprise hardware required
  • Benchmark built-in to tune your setup

The Mixed Vendor Maverick

You bought AMD because it was on sale and NVIDIA for the CUDA ecosystem. Now they won't cooperate. Tightwad makes CUDA and ROCm work the same model together. Finally.

RPC Cluster Mode
  • CUDA + ROCm on the same model
  • llama.cpp RPC handles the hard parts
  • Coordinator distributes layers intelligently

Pick your compute configuration

Tightwad works however your hardware is set up. Consumer GPU, no GPU, cloud API โ€” there's a config for you.

GPU โ†’ API

Slash Your API Bills

Even a GTX 1060 can draft for GPT-4. Any GPU you have โ€” old, cheap, low VRAM โ€” reduces your API bill. Up to 80% fewer output tokens billed to the cloud API.

๐Ÿ–ฅ๏ธ
Your PC
GTX 1060 / RTX 2070 / any GPU ยท 8B draft
โ†’
โ˜๏ธ
Cloud API
Qwen3.5-397B ยท pay per token
80% acceptance ยท up to 5x output token cost reduction ยท any CUDA/ROCm GPU
CPU โ†’ GPU

Zero GPU Required to Participate

Run a tiny draft model on any CPU. Verify on a remote GPU server. Your CPU-only server, your laptop, your NAS โ€” all can contribute to the cluster.

๐Ÿ’ป
Any machine
CPU only ยท Qwen3-1.7B draft ยท even a laptop
โ†’
๐Ÿ–ฅ๏ธ
GPU Server
Any GPU ยท any big target model
~68% acceptance ยท zero GPU required to participate
ENTERPRISE PLAY
CPU โ†’ API

Literally Any Computer

Data centers often run at 10โ€“30% average utilization (industry estimates). Idle CPUs, stranded servers, that old Xeon doing nothing โ€” put them to work drafting tokens. No GPU required, ever.

๐Ÿข
Any idle machine
Old Xeon ยท 32-core server ยท spare laptop ยท Qwen3-1.7B
โ†’
โ˜๏ธ
Cloud API
397B model ยท only for hard tokens
Stranded compute โ†’ inference revenue

Legacy GPU revival: that GTX 770 from 2013 can run Qwen3-1.7B as a drafter for a 70B target โ€” turning e-waste into productive infrastructure.

Config Draft Target Use Case Acceptance
GPU โ†’ GPU Any GPU โ€” old, new, NVIDIA, AMD Any bigger GPU Homelab, mixed hardware ~64%
GPU โ†’ API Any GPU (even GTX 1060) Cloud API Slash API bills ~80%
CPU โ†’ GPU Any CPU, no GPU needed GPU server Zero-GPU participants ~68%
CPU โ†’ API Literally any computer Cloud API Data centers, enterprise ~68%

Homelab Setup in 30 Minutes

Four machines. One 70B model. Start with two, add machines anytime. The cluster grows.

โšก Draft Brain
๐Ÿ’ป
MacBook Air M4
Llama 3.1 8B · Apple Silicon
Tightwad proxy :8088
Proposes 32 tokens/batch
at ~60 tok/s locally
propose โ†’
WiFi
โ† verify
๐Ÿ–ฅ๏ธ GPU Pool โ€” Target Model
Llama 3.3 70B ยท 52GB VRAM distributed
๐Ÿ–ฅ๏ธ
Desktop
RTX 4070 Ti Super + RTX 3060
28 GB VRAM
๐Ÿ–ฅ๏ธ
Gaming PC
RTX 2070
8 GB VRAM
๐Ÿ’ป
MacBook Air M2
Apple Metal
16 GB unified
3 machines ยท 52 GB total ยท rpc-server :50052
1.86ร— speedup
4.1 tok/s was 2.2 tok/s
100% acceptance rate
$0 cloud spend
1
On Machines A, B, C: Start RPC workers
Pool Workers
bash (on each pool machine)
# Machine A โ€” Desktop (4070 Ti + 3060, 28GB)
$ rpc-server -p 50052
# Machine B โ€” Old Gaming PC (RTX 2070, 8GB)
$ rpc-server -p 50052
# Machine C โ€” MacBook Air M2 (Metal, 16GB)
$ rpc-server -p 50052
2
On Machine D: Start the draft model
Machine D
bash
# Machine D โ€” MacBook Air M4 (runs draft + proxy)
$ ollama run llama3.1:8b
# Confirm:
$ ollama ps
 llama3.1:8b  running
3
Install Tightwad (either machine)
Either
bash
$ git clone https://github.com/akivasolutions/tightwad.git
$ cd tightwad
$ python3 -m venv .venv && source .venv/bin/activate
$ pip install tightwad
4
Edit configs/cluster.yaml
Either
configs/cluster.yaml โ€” combined mode (pool + speculation)
proxy:
  host: 0.0.0.0
  port: 8088
  max_draft_tokens: auto            # auto-tunes based on acceptance rate
  mode: combined              # speculation over pooled GPUs
  draft:
    url: http://localhost:11434   # Machine D (M4, local draft)
    model_name: llama3.1:8b
    backend: ollama

coordinator:
  host: 0.0.0.0
  port: 8080
  model: Llama-3.3-70B-Q4_K_M.gguf

workers:
  - host: 192.168.1.10        # Machine A (4070 Ti + 3060)
    rpc_port: 50052
  - host: 192.168.1.20        # Machine B (RTX 2070)
    rpc_port: 50052
  - host: 192.168.1.30        # Machine C (M2 Metal)
    rpc_port: 50052

Find your IPs: ip addr on Linux, ipconfig on Windows, ifconfig on macOS. Add more workers anytime โ€” the cluster grows.

5
Start the proxy
Either
bash
$ tightwad proxy start --combined
 Draft model healthy  (llama3.1:8b @ localhost:11434) โ€” Machine D
 Pool: 3 workers online (52GB VRAM total) โ€” A + B + C
 Target: Llama-3.3-70B distributed across pool
 Proxy listening on http://localhost:8088
6
Test it
Either
bash
$ curl http://localhost:8088/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"messages": [{"role": "user", "content": "What is 17 * 24?"}], "max_tokens": 100}'

# Check acceptance rate
$ tightwad proxy status
 Acceptance rate: ~58% | Rounds: N | Tokens saved: N
8
Point your chat app at it
Done โœ“

In Open WebUI, ChatBot UI, or any OpenAI-compatible app, change the base URL from:

http://192.168.1.10:11434 โ†’ http://192.168.1.10:8088

That's it. Four machines, one endpoint. Same app. Same model name. Same output quality. Machines A, B, and C pool a 70B model that fits on no single machine. Machine D drafts and proxies. You just see 4.1 tok/s instead of 2.2.

What to expect with this setup

Same-family models (Llama 3.1 8B โ†’ Llama 3.3 70B) with greedy decoding:

MetricResult
⚡ Acceptance rate
100%
🚀 Speedup
1.86×
💬 Tokens per round
33
⏱️ Speed (pool only)
2.2 tok/s
⏱️ Speed (pool + speculation)
4.1 tok/s

Quick Start

No Docker Compose files with 300 environment variables. No Kubernetes YAML. Just Python and one config file.

1

Install

bash
$ git clone https://github.com/akivasolutions/tightwad.git
$ cd tightwad
$ python3 -m venv .venv && source .venv/bin/activate
$ pip install tightwad
2

Configure your hardware

configs/cluster.yaml
proxy:
  host: 0.0.0.0
  port: 8088
  max_draft_tokens: auto          # auto-tunes based on acceptance rate
  fallback_on_draft_failure: true
  draft:
    url: http://192.168.1.50:11434  # Your cheap GPU (Ollama)
    model_name: qwen3:8b
    backend: ollama
  target:
    url: http://192.168.1.100:11434   # Your big GPU (Ollama)
    model_name: qwen3:32b
    backend: ollama
3

Start it & test

bash
$ tightwad proxy start
 Draft model healthy
 Target model healthy
 Proxy listening on http://localhost:8088

# Test it (drop-in for any OpenAI SDK call)
$ curl http://localhost:8088/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"messages": [{"role": "user", "content": "Hello"}], "max_tokens": 50}'

# Check acceptance rate stats
$ tightwad proxy status
 Acceptance rate: 73.2% | Rounds: 34 | Tokens saved: 14,891
1

Build RPC workers (CUDA — Windows/Linux)

bash (worker machine)
# Or use scripts/install-worker.sh
$ cmake -B build -DGGML_CUDA=ON -DGGML_RPC=ON
$ cmake --build build --config Release
$ build/bin/rpc-server -p 50052  # GPU 0
2

Configure cluster topology

configs/cluster.yaml
coordinator:
  host: 0.0.0.0
  port: 8080
  backend: hip  # or cuda
  gpus:
    - name: "7900 XTX #0"
      vram_gb: 24

workers:
  - host: 192.168.1.100  # NVIDIA box
    gpus:
      - name: "RTX 4070 Ti Super"
        vram_gb: 16
    rpc_port: 50052

models:
  llama-3.3-70b:
    path: /models/Llama-3.3-70B-Q4_K_M.gguf
    ctx_size: 8192
    flash_attn: true
    default: true
3

Start the cluster

bash
$ tightwad start
 Coordinator started
 Worker @ 192.168.1.100:50052 online
 Model llama-3.3-70b loaded across 52 GB VRAM

# Hot-swap to a different model anytime
$ tightwad swap deepseek-r1-70b

# Run the benchmark
$ tightwad benchmark

Everything you need, nothing you don't

Built for terminal people who hate bloat as much as they hate cloud bills.

🔁

OpenAI Compatible

Drop-in replacement for any OpenAI SDK. Change one URL. That's it. No code changes required.

🔄

Hot-Swap Models

tightwad swap model-name โ€” swap the model while workers keep running. Zero downtime.

📡

SSE Streaming

Full streaming support on all endpoints. Tokens flow as they're accepted. No buffering.

⌨️

CLI-First

tightwad start, tightwad proxy start, tightwad status. Simple commands for complex infrastructure.

📄

YAML Config

One file describes your entire hardware topology. Version control it. Share it. Ship it.

📊

A/B Benchmark

tightwad bench โ€” proxy vs direct target comparison. See your exact speedup, tok/s, and per-prompt breakdown.

🧪

Dual Backends

Ollama for quick setup. llama.cpp for maximum performance. Switch per-model in the config.

🔒

Fallback Safety

Draft server down? fallback_on_draft_failure: true routes straight to target. Never breaks.

💻

Mixed Vendor

NVIDIA CUDA + AMD ROCm on the same model, same cluster, same endpoint. No compromises.

🧬

Family Validation

Auto-detects model architecture families. Warns you before a mismatched draft/target pair wastes hours at 1.6% acceptance.

💬

Auto Chat Templates

Detects Llama 3, Mistral, Gemma, Phi, and more. No more hardcoded Qwen3 template breaking other model families.

🎯

Auto-Tune

max_draft_tokens: auto โ€” adjusts at runtime based on acceptance rate. Zero-config optimization.

🤝

Consensus Verification

Multiple drafters vote on tokens. When they agree, skip the target entirely. Three modes: strict, majority, any-disagree.

🌐

Peer Agent

Cross-platform cluster management without SSH. REST API on every node for version checks, GPU info, and remote control.

🛡️

Safety Checks

Version enforcement, MoE VRAM warnings, SSRF protection, bearer token auth. Production-ready out of the box.

Why not just use vLLM?

Fair question. Here's the honest answer. The other tools are good. Tightwad is for a different problem.

Comparison accurate as of March 2026. These tools evolve quickly โ€” check their docs for the latest capabilities.

vs

vLLM

Excellent production inference engine. CUDA-only. Built for ML teams.

  • Primarily CUDA-focused. ROCm support is experimental/limited. Tightwad treats CUDA and ROCm as first-class citizens in the same cluster.
  • Can't mix GPU generations. vLLM can't pool a GTX 770 with a 4070 Ti. Tightwad doesn't care what generation or vendor your hardware is from.
  • Speculative decoding, but single-machine only. Tightwad does it across your network โ€” draft on one box, verify on another.
  • No CPU nodes. Can't add a CPU-only machine to a vLLM cluster. Tightwad: CPU drafting is fully supported.
  • Use vLLM if: you have a single powerful CUDA machine and need production-grade throughput.
vs

Ollama

The reason most people have local models. One model, one machine, beautifully simple.

  • One model, one machine. When you outgrow a single GPU, Ollama can't pool across machines. Your RTX 2070 and RTX 4070 are completely isolated from each other.
  • Can't combine machines at all. Ollama has no concept of cross-machine inference. Your hardware can't cooperate.
  • Tightwad works with Ollama. Keep Ollama on each machine โ€” Tightwad just coordinates between them.
  • Use Ollama for getting started. Use Tightwad when you have a second machine and want them to work together.
vs

llama.cpp RPC

The low-level primitive Tightwad is built on. Powerful. Requires a lot of scripting.

  • Tightwad is built on llama.cpp RPC. We add the orchestration, YAML config, CLI, and speculative proxy on top.
  • RPC ships 100–300 MB of tensor data per network step. Tightwad's speculative proxy ships token IDs โ€” bytes, not megabytes.
  • Use raw RPC if you want maximum control. Use Tightwad if you want it to just work.
vs

TGI (HuggingFace)

Production inference for the HuggingFace ecosystem. Great if you're already there.

  • Optimized for the HuggingFace ecosystem. Designed to work best with HuggingFace's model hub and services.
  • Tightwad is vendor-neutral. Works with your existing Ollama or llama.cpp setup. No accounts required.
  • Use TGI if you're in the HuggingFace ecosystem. Use Tightwad if you want backend-agnostic, no-strings-attached inference.

The honest summary

Single powerful CUDA machine, production workloads Use vLLM
One machine, just want to run models Use Ollama
Two or more machines โ€” mixed GPUs, old & new, NVIDIA & AMD, or CPU-only โ€” want them all working together 🐷 Use Tightwad