I tried 4 LLM speedup techniques on CPU. Three made it slower.

TurboQuant — "8× faster"

The headline is a synthetic GPU-kernel number. On real CPU end-to-end it ran 2.2× slower and dropped Qwen accuracy 17 pp. Memory savings real; speed wins conditional.

Speculative decoding

Published 2.5–3× CPU speedup is real — only on 7B+ targets. At 4B on a 4-core cgroup: 1.48× SLOWER. Draft + verify orchestration eats more than the drafts save.

ik_llama.cpp

1.53× faster end-to-end on Qwen via IQK matmul kernels. But parallel tool calls collapse 80% → 0% and Phi-4 will not even load. Hard veto for production.

Gemma-4-E4B-it quietly won

94.3% overall. 100% on multi-function AND parallel calls. 6.2 s p50. Beat Qwen 3.5 4B and Phi-4-mini-instruct outright. This is the ship recommendation.

Phi-4-mini ships broken

Drop-in `--jinja` tool-calling = 0.0% pass — llama.cpp falls back to a prose parser. A 30-character system prompt rescues 74%. Lose parallel calls anyway.

What to actually ship

Stock `ghcr.io/ggml-org/llama.cpp:full` + Gemma-4-E4B-it Q4_K_M + FP16 KV + `--jinja --reasoning off --reasoning-budget 0`. No fork. No quantized KV. No draft model.

In one paragraph

Three ~4B open-weight tool-calling models (Qwen 3.5 4B, Google Gemma-4-E4B-it, Microsoft Phi-4-mini), four CPU speedup techniques (TurboQuant KV quantization, speculative decoding, ik_llama.cpp, OpenVINO/vLLM as outside references), one shared Xeon E-2176G box, 35 BFCL tool-calling cases per cell, full cgroup isolation, sanitized public artifacts. Eleven cells of measured pain. The TL;DR is in the hero. The story is in the article. The data is on /results.

Where to go

The article — narrative version, ~6 min, the one you share.

Engine bake-off — deeper dive: stock vs specdec vs ik_llama.cpp, with numbers.

Results table — all 11 cells, sortable, no prose.

HTTP API — grab the JSON directly.

Spec: TurboQuant bake-off · Spec: Engine bake-off — methodology and why each measurement was chosen.

All MIT, all reproducible from a public repo.

Gemma quietly won.I tried 4 LLM speedups on CPU. 3 made it slower.