TurboQuant — "8× faster"
The headline is a synthetic GPU-kernel number. On real CPU end-to-end it ran 2.2× slower and dropped Qwen accuracy 17 pp. Memory savings real; speed wins conditional.
94% tool-calling accuracy. 6.2 s p50. Single Xeon, no GPU. Stock llama.cpp + Gemma-4-E4B-it beat every clever trick I threw at it — TurboQuant, speculative decoding, ik_llama.cpp, the lot.
Three ~4B open-weight tool-calling models (Qwen 3.5 4B, Google Gemma-4-E4B-it, Microsoft Phi-4-mini), four CPU speedup techniques (TurboQuant KV quantization, speculative decoding, ik_llama.cpp, OpenVINO/vLLM as outside references), one shared Xeon E-2176G box, 35 BFCL tool-calling cases per cell, full cgroup isolation, sanitized public artifacts. Eleven cells of measured pain. The TL;DR is in the hero. The story is in the article. The data is on /results.
All MIT, all reproducible from a public repo.