Skip to content

Gemma quietly won.I tried 4 LLM speedups on CPU. 3 made it slower.

94% tool-calling accuracy. 6.2 s p50. Single Xeon, no GPU. Stock llama.cpp + Gemma-4-E4B-it beat every clever trick I threw at it — TurboQuant, speculative decoding, ik_llama.cpp, the lot.

In one paragraph

Three ~4B open-weight tool-calling models (Qwen 3.5 4B, Google Gemma-4-E4B-it, Microsoft Phi-4-mini), four CPU speedup techniques (TurboQuant KV quantization, speculative decoding, ik_llama.cpp, OpenVINO/vLLM as outside references), one shared Xeon E-2176G box, 35 BFCL tool-calling cases per cell, full cgroup isolation, sanitized public artifacts. Eleven cells of measured pain. The TL;DR is in the hero. The story is in the article. The data is on /results.

Where to go

All MIT, all reproducible from a public repo.

Benchmarks run on a single shared CPU host · Xeon E-2176G · CPU-only