Tightwad pools your mixed CUDA + ROCm GPUs into a single OpenAI-compatible endpoint.
Speculative decoding proxy: draft fast, verify smart, stream everything.
Same output quality. 1.86× measured on 70B.* Zero cloud bill (fully local setup).
* 1.86× measured on Llama 3.1 8B โ Llama 3.3 70B across a 4-GPU RPC pool (52GB VRAM over WiFi) with greedy decoding (temperature=0). 100% acceptance with same-family models under greedy decoding. Speedup depends on hardware, model pairing, network, and configuration. Your results will vary.
$ pip install tightwad $ tightwad proxy start --combined ✓ Draft: Llama-3.1-8B @ localhost:8081 (M4 Metal — drafts 32 tokens/round) ✓ Pool: 4 GPUs / 52GB VRAM over WiFi (4070 Ti + 3060 + 2070 + M2 Metal) ✓ Target: Llama-3.3-70B across pool (too big for any single machine) ✓ Proxy listening on http://localhost:8088 → Acceptance rate: 100% | 1.86× speedup | 4.1 tok/s (was 2.2) # CUDA โ Metal โ WiFi โ 4 machines โ โ one endpoint.
Most people don't get it at first. So here it is, dead simple. One change. That's it.
Point your chat app at port 8088 instead of 11434. That's the entire setup from your app's perspective.
You never configure it, select it, or see it. It's like autocomplete on your phone โ it suggests tokens, the big model accepts or corrects. You only see the final output.
With greedy decoding (temperature=0), output is mathematically identical to running the large model alone. With other sampling methods, output is statistically equivalent โ drawn from the same probability distribution. The big model always has final say on every token.
With same-family models and greedy decoding (Llama 3.1 8B โ Llama 3.3 70B), we measured 100% acceptance โ every single draft token accepted. Cross-family pairs (Qwen3-8B โ 397B) still hit 80%. The big model always has final say.
That's it. Change one URL. Get up to 2-3x faster responses. Same quality.
Set It Up in 20 Minutes โPick your poison. Stack them. Run all four. Tightwad doesn't judge — it just saves you cash.
Got GPUs scattered across machines? Pool them. CUDA on one box, ROCm on another — Tightwad doesn't care. It distributes model layers across all of them and hands you one clean OpenAI-compatible endpoint.
[OpenAI Client]
|
v
+-------------------+
| Tightwad | <-- One endpoint to rule them all
| Coordinator :8080|
+--------+----------+
| distributes layers
+----+----+
v v
+--------+ +--------+
| Worker | | Worker |
| NVIDIA | | AMD |
| 4070Ti | | 7900XTX|
| 16 GB | | 24 GB |
+--------+ +--------+
70B model: covered ✓
Your cheap GPU isn't slow — it's a draft engine. A fast small model guesses tokens. A big model batch-verifies them. Same output quality as running the big model alone. Ships token IDs (bytes), not 100–300 MB of tensor data over the wire.
[Your App / OpenAI SDK]
|
v
+--------------------------+
| Tightwad Proxy :8088 |
| |
| 1. Draft 32 tokens -----+--> Qwen3-8B
| (~100 tok/s, cheap) | RTX 2070 (the dusty one)
| |
| 2. Verify batch --------+--> Llama 3.3 70B
| (one forward pass) | 4070Ti / Cloud API
| |
| 3. Accept/reject <------+
| 4. Stream to client |
+--------------------------+
Output quality = equivalent to 70B alone ✓
When a model doesn't fit on any single machine, pool your GPUs AND speculate on top. RPC pooling alone is slow (one network round-trip per token). Speculation amortizes that — 32 tokens verified per round-trip instead of 1. Result: models that fit nowhere become usable.
[Junk Hardware โ P400 2GB, GTX 770, laptop CPU]
| runs 1.7B draft, ~30 tok/s
| sends token IDs (bytes)
v
[Tightwad Proxy :8088]
| sends draft to pool for BATCH verify
v
[RPC GPU Pool โ 4 GPUs, 52GB total, WiFi]
| verifies 32 tokens in ONE forward pass
v
4.1 tok/s instead of 2.2 tok/s โ 70B fits nowhere else
* Measured: Llama 3.1 8B draft โ Llama 3.3 70B target across RTX 4070 Ti Super + RTX 3060 + RTX 2070 + M2 Metal = 52GB VRAM over WiFi. 519 tokens in 127s vs 512 tokens in 231s direct. Your results will vary with hardware and network conditions.
Models are huge. Downloading 70B from HuggingFace takes hours. Pull from every machine that already has it. Chunked transfer with piece verification — like torrents, but for GGUF files. New machine joins the cluster? It downloads the model from all your existing machines in parallel.
[New Machine Joins Cluster]
|
| "I need Llama-3.3-70B-Q4_K_M.gguf"
v
+---------------------------+
| Tightwad Swarm Discovery |
| |
| Piece 1 <--- Machine A | (4070 Ti — has full model)
| Piece 2 <--- Machine B | (RTX 2070 — has full model)
| Piece 3 <--- Machine C | (M2 Metal — has pieces 1-6)
| Piece 4 <--- Machine A | (parallel, rarest-first)
| ... |
+---------------------------+
|
v
SHA256-verified • ready to serve in minutes, not hours
Small model blazes through 32 candidate tokens at ~30+ tok/s. Fast and cheap.
Big model evaluates all 32 tokens in a single forward pass. Batch is basically free.
Keep every token both models agree on. Take the big model's token at the first disagreement.
Accepted tokens stream to your app instantly. Repeat until done. Output quality is equivalent to the target model alone.
Real hardware. Real numbers. No cherry-picking. Logprobs-based batch verification is live โ these acceptance rates translate directly to wall-clock speedup.
| Mode | Tokens | Time | Speed |
|---|---|---|---|
| RPC pool direct (autoregressive) | 512 | 231s | 2.2 tok/s |
| RPC pool + speculation | 519 | 127s | 4.1 tok/s |
| ⚡ Speedup | 100% acceptance · 33 tokens/round | 1.86× | |
The 70B model doesn't fit on any single machine. It's distributed across 4 GPUs (RTX 4070 Ti Super + RTX 3060 + RTX 2070 + M2 Metal) over WiFi. Without speculation: painfully slow. With speculation: usable.
| Mode | Speed | Notes |
|---|---|---|
| Desktop local only (4070+3060, 32B) | 17.0 tok/s | Best case โ fits on one machine |
| 4-GPU RPC pool (autoregressive) | 3.0 tok/s | Each token = full RPC round-trip |
| RPC pool + speculation | 5.4 tok/s | 32 tokens verified per batch, 100% acceptance (greedy decoding) |
| ⚡ Pool speedup | 1.8× over pool-only (3.0 โ 5.4 tok/s) | |
RPC pooling alone is slow over WiFi (one network round-trip per token). Speculation amortizes that โ 32 tokens per round-trip instead of 1. Don't pool when the model fits locally (17 tok/s local vs 5.4 tok/s pooled).
| Prompt Type | Acceptance Rate | Rounds | Verdict |
|---|---|---|---|
| Reasoning | 32 | Math is deterministic. Love it. | |
| Code | 34 | Syntax is law. Both models agree. | |
| Factual | 18 | Strong agreement on facts. | |
| List | 40 | Phrasing varies. Still worthwhile. | |
| Creative | 6 | Many valid outputs. Expected. | |
| ⚡ Average | 26 | 64% of tokens = free. |
| Prompt Type | Acceptance Rate | Notes |
|---|---|---|
| Reasoning | Highest โ deterministic math | |
| ⚡ Average (normalized) | Key result: 4 in 5 tokens local. |
With whitespace normalization, a consumer GPU running an 8B model drafts 4 out of every 5 tokens for a 397B model. That means up to 80% fewer output tokens billed to the cloud API for the same quality output. The bigger the gap between draft and target quality, the more you save.
| Prompt | Baseline | Speculative | Speedup |
|---|---|---|---|
| Capital of France | 1.17s | 0.90s | 1.30x |
| Thermodynamics | 12.73s | 9.09s | 1.40x |
| Prime checker | 12.76s | 10.15s | 1.28x |
| Average speed | 13.24s | 10.95s | 1.21x |
| TCP vs UDP | 5.58s | 4.88s | 1.14x |
| Total | 45.43s | 35.96s | 1.27x |
Set max_draft_tokens: auto and Tightwad finds the sweet spot for you. Or pin it at 32 for manual control.
Slide to see your monthly cloud inference waste. Then stop doing that.
* Estimated savings assume the selected acceptance rate is sustained. Default uses 60% acceptance rate typical of local GPU setups. Rates vary by model pair and prompt type (58-64% local, up to 80% same-family API). Savings do not account for local electricity, hardware costs, or maintenance. Your results will vary.
Tightwad was built by and for people who think paying $200/month for inference is genuinely offensive.
You have a 2070 in the old desktop, a 4070 in the main rig, and a Quadro doing nothing in the server rack. They're all lazy freeloaders. Tightwad makes them a team.
You're still paying OpenAI/Anthropic for some tasks. Fine. But why let them do the easy parts? Draft locally, verify via API. 58% fewer API calls. Same answers.
You want 70B model quality on a $600 GPU budget. That's not a dream — that's just math. RPC mode distributes layers across whatever you've got collecting dust.
You bought AMD because it was on sale and NVIDIA for the CUDA ecosystem. Now they won't cooperate. Tightwad makes CUDA and ROCm work the same model together. Finally.
Tightwad works however your hardware is set up. Consumer GPU, no GPU, cloud API โ there's a config for you.
Draft on any GPU you have, verify on a bigger one. Mix any generations, any vendors. RTX 4070 + GTX 770 + RX 7900 XTX โ all in one cluster.
Even a GTX 1060 can draft for GPT-4. Any GPU you have โ old, cheap, low VRAM โ reduces your API bill. Up to 80% fewer output tokens billed to the cloud API.
Run a tiny draft model on any CPU. Verify on a remote GPU server. Your CPU-only server, your laptop, your NAS โ all can contribute to the cluster.
Data centers often run at 10โ30% average utilization (industry estimates). Idle CPUs, stranded servers, that old Xeon doing nothing โ put them to work drafting tokens. No GPU required, ever.
Legacy GPU revival: that GTX 770 from 2013 can run Qwen3-1.7B as a drafter for a 70B target โ turning e-waste into productive infrastructure.
| Config | Draft | Target | Use Case | Acceptance |
|---|---|---|---|---|
| GPU โ GPU | Any GPU โ old, new, NVIDIA, AMD | Any bigger GPU | Homelab, mixed hardware | ~64% |
| GPU โ API | Any GPU (even GTX 1060) | Cloud API | Slash API bills | ~80% |
| CPU โ GPU | Any CPU, no GPU needed | GPU server | Zero-GPU participants | ~68% |
| CPU โ API | Literally any computer | Cloud API | Data centers, enterprise | ~68% |
Four machines. One 70B model. Start with two, add machines anytime. The cluster grows.
# Machine A โ Desktop (4070 Ti + 3060, 28GB) $ rpc-server -p 50052 # Machine B โ Old Gaming PC (RTX 2070, 8GB) $ rpc-server -p 50052 # Machine C โ MacBook Air M2 (Metal, 16GB) $ rpc-server -p 50052
# Machine D โ MacBook Air M4 (runs draft + proxy) $ ollama run llama3.1:8b # Confirm: $ ollama ps ✓ llama3.1:8b running
$ git clone https://github.com/akivasolutions/tightwad.git $ cd tightwad $ python3 -m venv .venv && source .venv/bin/activate $ pip install tightwad
configs/cluster.yamlproxy: host: 0.0.0.0 port: 8088 max_draft_tokens: auto # auto-tunes based on acceptance rate mode: combined # speculation over pooled GPUs draft: url: http://localhost:11434 # Machine D (M4, local draft) model_name: llama3.1:8b backend: ollama coordinator: host: 0.0.0.0 port: 8080 model: Llama-3.3-70B-Q4_K_M.gguf workers: - host: 192.168.1.10 # Machine A (4070 Ti + 3060) rpc_port: 50052 - host: 192.168.1.20 # Machine B (RTX 2070) rpc_port: 50052 - host: 192.168.1.30 # Machine C (M2 Metal) rpc_port: 50052
Find your IPs: ip addr on Linux, ipconfig on Windows, ifconfig on macOS. Add more workers anytime โ the cluster grows.
$ tightwad proxy start --combined ✓ Draft model healthy (llama3.1:8b @ localhost:11434) โ Machine D ✓ Pool: 3 workers online (52GB VRAM total) โ A + B + C ✓ Target: Llama-3.3-70B distributed across pool ✓ Proxy listening on http://localhost:8088
$ curl http://localhost:8088/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"messages": [{"role": "user", "content": "What is 17 * 24?"}], "max_tokens": 100}' # Check acceptance rate $ tightwad proxy status → Acceptance rate: ~58% | Rounds: N | Tokens saved: N
In Open WebUI, ChatBot UI, or any OpenAI-compatible app, change the base URL from:
http://192.168.1.10:11434
โ
http://192.168.1.10:8088
That's it. Four machines, one endpoint. Same app. Same model name. Same output quality. Machines A, B, and C pool a 70B model that fits on no single machine. Machine D drafts and proxies. You just see 4.1 tok/s instead of 2.2.
Same-family models (Llama 3.1 8B โ Llama 3.3 70B) with greedy decoding:
| Metric | Result |
|---|---|
| ⚡ Acceptance rate | |
| 🚀 Speedup | |
| 💬 Tokens per round | |
| ⏱️ Speed (pool only) | |
| ⏱️ Speed (pool + speculation) |
No Docker Compose files with 300 environment variables. No Kubernetes YAML. Just Python and one config file.
$ git clone https://github.com/akivasolutions/tightwad.git $ cd tightwad $ python3 -m venv .venv && source .venv/bin/activate $ pip install tightwad
proxy: host: 0.0.0.0 port: 8088 max_draft_tokens: auto # auto-tunes based on acceptance rate fallback_on_draft_failure: true draft: url: http://192.168.1.50:11434 # Your cheap GPU (Ollama) model_name: qwen3:8b backend: ollama target: url: http://192.168.1.100:11434 # Your big GPU (Ollama) model_name: qwen3:32b backend: ollama
$ tightwad proxy start ✓ Draft model healthy ✓ Target model healthy ✓ Proxy listening on http://localhost:8088 # Test it (drop-in for any OpenAI SDK call) $ curl http://localhost:8088/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"messages": [{"role": "user", "content": "Hello"}], "max_tokens": 50}' # Check acceptance rate stats $ tightwad proxy status → Acceptance rate: 73.2% | Rounds: 34 | Tokens saved: 14,891
# Or use scripts/install-worker.sh $ cmake -B build -DGGML_CUDA=ON -DGGML_RPC=ON $ cmake --build build --config Release $ build/bin/rpc-server -p 50052 # GPU 0
coordinator: host: 0.0.0.0 port: 8080 backend: hip # or cuda gpus: - name: "7900 XTX #0" vram_gb: 24 workers: - host: 192.168.1.100 # NVIDIA box gpus: - name: "RTX 4070 Ti Super" vram_gb: 16 rpc_port: 50052 models: llama-3.3-70b: path: /models/Llama-3.3-70B-Q4_K_M.gguf ctx_size: 8192 flash_attn: true default: true
$ tightwad start ✓ Coordinator started ✓ Worker @ 192.168.1.100:50052 online ✓ Model llama-3.3-70b loaded across 52 GB VRAM # Hot-swap to a different model anytime $ tightwad swap deepseek-r1-70b # Run the benchmark $ tightwad benchmark
Built for terminal people who hate bloat as much as they hate cloud bills.
Drop-in replacement for any OpenAI SDK. Change one URL. That's it. No code changes required.
tightwad swap model-name โ swap the model while workers keep running. Zero downtime.
Full streaming support on all endpoints. Tokens flow as they're accepted. No buffering.
tightwad start, tightwad proxy start, tightwad status. Simple commands for complex infrastructure.
One file describes your entire hardware topology. Version control it. Share it. Ship it.
tightwad bench โ proxy vs direct target comparison. See your exact speedup, tok/s, and per-prompt breakdown.
Ollama for quick setup. llama.cpp for maximum performance. Switch per-model in the config.
Draft server down? fallback_on_draft_failure: true routes straight to target. Never breaks.
NVIDIA CUDA + AMD ROCm on the same model, same cluster, same endpoint. No compromises.
Auto-detects model architecture families. Warns you before a mismatched draft/target pair wastes hours at 1.6% acceptance.
Detects Llama 3, Mistral, Gemma, Phi, and more. No more hardcoded Qwen3 template breaking other model families.
max_draft_tokens: auto โ adjusts at runtime based on acceptance rate. Zero-config optimization.
Multiple drafters vote on tokens. When they agree, skip the target entirely. Three modes: strict, majority, any-disagree.
Cross-platform cluster management without SSH. REST API on every node for version checks, GPU info, and remote control.
Version enforcement, MoE VRAM warnings, SSRF protection, bearer token auth. Production-ready out of the box.
Fair question. Here's the honest answer. The other tools are good. Tightwad is for a different problem.
Comparison accurate as of March 2026. These tools evolve quickly โ check their docs for the latest capabilities.
Excellent production inference engine. CUDA-only. Built for ML teams.
The reason most people have local models. One model, one machine, beautifully simple.
The low-level primitive Tightwad is built on. Powerful. Requires a lot of scripting.
Production inference for the HuggingFace ecosystem. Great if you're already there.