Skip to content

I tried four LLM inference techniques on a CPU box. Three made it slower.

A week of benchmarks. 11 cells. Two days fixing other people's bugs. One winner — and it's the boring default.

Published 2026-05-20 · raw data at /api · all 11 cells on /results


The setup, in one paragraph

A single Xeon E-2176G — 6c/12t, AVX2, no AVX-512, no GPU, 62 GB RAM, shared with other unrelated workloads. The kind of commodity Hetzner-class box you actually run a side project on. I picked the three best ~4B open-weight tool-callers as of May 2026 (Qwen 3.5 4B, Google gemma-4-E4B-it, Microsoft Phi-4-mini), all at Q4_K_M imatrix weights, and put them through 35 BFCL-style tool-calling cases on top of stock llama.cpp:full. That's my baseline. Then I tried to make it go faster — four different ways.

TL;DR

Ship stock llama.cpp + gemma-4-E4B-it at Q4_K_M with FP16 KV. Don't change inference engine. Don't enable TurboQuant. Don't add speculative decoding. The boring default beat everything else I tried on this hardware shape.

The scoreboard

Technique I triedThroughput vs stockWhat broke
TurboQuant tbq3_0 KV (PR #21089)2.2× sloweraccuracy −17 pp on Qwen, parallel calls 0 %
Speculative decoding (0.8B draft → 4B target)1.48× slowersmall accuracy regression
ik_llama.cpp (ikawrakow CPU fork)1.05–1.53× fasterparallel calls 0 % on Qwen + Gemma; Phi-4 doesn't load
OpenVINO backendnot run(real follow-up territory)
vLLM CPU backendnot rundifferent model format, multi-user design

Three losses. One technique-with-caveats that's faster but loses the headline feature (parallel tool calls). The boring default — ghcr.io/ggml-org/llama.cpp:full, FP16 KV, --jinja --reasoning off — is the local optimum at this hardware shape.

Surprise #1: the TurboQuant 8× speedup doesn't survive contact with real hardware

TurboQuant (Zandieh et al., ICLR 2026, from Google Research / DeepMind) is a vector-quantization technique for the KV cache. The paper's headline is 8× faster attention math. There's an active llama.cpp PR (#21089, open) bringing CPU AVX2 support. I had high expectations.

I measured four data points across published runs + my own:

SourceHardwareTurboQuant vs baseline
Google paper headlineH100 GPU, attention-kernel isolation8× faster ← marketing
PR #21089's own table4-thread CPU, Qwen3.5-4B2.1× slower vs q4_0 KV
Discussion #21829 user2× H200 GPU1.18× slower vs FP16 KV
My runXeon E-2176G AVX2 CPU2.2× slower end-to-end per BFCL turn vs FP16 KV

That 8× number is real, but only inside a specific synthetic benchmark: GPU, attention-kernel-bound workload, dequant+matmul math isolated from end-to-end inference. Move any of those variables and TurboQuant is slower. The maintainers themselves admit it in discussion #21829: at high memory bandwidth, the dequant overhead exceeds the memory savings; they recommend it only for "mid-range single GPU setups" at long context.

It's a memory-saving technique partly marketed as a speed technique. The memory savings are real and unconditional. The speed wins are conditional on a specific failure mode (memory-bandwidth-bound) that most workloads don't have. Mine especially: 60 GB free RAM at 4 K context, the KV cache is hundreds of megabytes.

And the accuracy claim — "matches FP16 within rounding distance" — was a PPL number, not a tool-calling number. On BFCL, Qwen dropped from 91.4 % to 74.3 % overall and parallel call accuracy collapsed from 80 % to 0 %.

Surprise #2: speculative decoding is anti-pattern below 7B targets

Speculative decoding is built into llama.cpp (--spec-draft-model or --lookup). The pitch: a small draft model speculates ahead, the target verifies in parallel, net 2.5–3× CPU speedup with zero quality loss.

That's the number on 7B+ targets. At my 4B-class scale on a 4-core cgroup, with a Qwen 0.8B draft against a Qwen 3.5 4B target, I measured 1.48× slower end-to-end. Same accuracy, just slower.

Why: with 4 cores allocated, the cost of running the draft model and verifying its tokens against the target adds up to more than the per-token savings, because the target isn't large enough for its per-token cost to dwarf the orchestration overhead. Speculative decoding is great when target ≫ draft. When target = 5× draft on a small CPU budget, it's noise that costs you.

Surprise #3: Phi-4-mini doesn't tool-call out of the box

I gave Phi-4-mini-instruct the same 35-case BFCL run with stock llama.cpp and --jinja. 0.0 % overall pass. Not one single tool call.

Investigation: llama.cpp's chat-format detector logs Chat format: peg-native when Phi-4 loads — meaning it didn't recognise Phi-4's tool-calling format and fell back to a generic prose parser. With no tool schemas surfaced to the model, Phi responded in English prose to tool-able queries ("you can find weather at weather.com…"). With tool_choice: required it invented Python-like syntax (get_weather_celsius("Tokyo")). Either way: zero parseable tool calls.

The fix: prepend a one-line system prompt that contains the tool schemas in JSON and tells Phi to emit {name, arguments} JSON. Same 35 cases re-run: 74.3 % overall pass. The model knows how to tool-call — the GGUF's chat template just doesn't tell it.

So: if you want Phi-4-mini and you're using llama.cpp's drop-in --jinja tool-calling, expect 0 % until either Microsoft's GGUF chat template is updated or llama.cpp adds a Phi-4 tool-format parser. If you can prepend a tools-in-system-prompt, you recover most of the way — but you've also lost parallel-call support (Phi can't emit a JSON array of calls under that prompt: 0/5 parallel cases pass).

Surprise #4: Gemma-4-E4B-it quietly wins everything

I expected Qwen to win — Qwen has dominated BFCL in its weight class for most of 2025. It did not.

Modelgen tok/sp50 msTool overall
gemma-4-E4B-it8.596,24094.3 %
Qwen 3.5 4B9.7913,73991.4 %
Phi-4-mini (drop-in)10.627,5170.0 %
Phi-4-mini (+ system prompt)n/a7,98374.3 %

Gemma is 13 % slower on raw gen_eval_tps but 2.2× faster end-to-end per turn — because it emits shorter, tighter responses around the tool call. It's the only model that hits 100 % on both multi-function selection and parallel calls in this run. The dedicated tool tokens Google added in Gemma 4 (<|tool>, <|tool_call>, <|tool_result>) and llama.cpp's matching peg-gemma4 chat format parser are doing real work.

If you're shipping a small tool-calling LLM on CPU today, you ship Gemma.

Surprise #5: ik_llama.cpp wins on prompt eval, then breaks the headline feature

ik_llama.cpp is ikawrakow's CPU-tuned fork — he's a core llama.cpp dev, and the fork ships AVX2-specific IQK matmul kernels plus IQK Flash Attention. Published benchmark: ~2× faster on AVX2 Xeon.

On my hardware, on Qwen:

Metricstockik_llamaΔ
prompt tok/s35.8558.081.62× faster
gen tok/s9.798.370.85× (slower)
p50 latency13,739 ms8,977 ms1.53× faster
overall_pass91.4 %82.9 %−8.6 pp
simple95 %100 %+5 pp
parallel80 %0 %−80 pp

The win is in prompt eval, not generation — and tool-calling prompts include the tool schemas, so prompt eval matters a lot. But parallel tool calls collapse entirely on both Qwen and Gemma (gemma same pattern, 100 % → 0 %), and Phi-4-mini doesn't load at all (tied-embeddings tensor layout not supported in ik_llama master). It's a real speedup with hard coverage gaps.

If you don't need parallel tool calls, and you only run Qwen or Gemma, ik_llama is worth a look. For our actual ship target (Gemma + parallel calls intact), it's a hard veto.

What I actually shipped

Same as where I started. Stock llama.cpp:full. Gemma-4-E4B-it at Q4_K_M. FP16 KV. --jinja --reasoning off --reasoning-budget 0. No quantized KV cache. No draft model. No fork. No alternative engine.

The interesting wins all live elsewhere:

  • A larger target model (where speculative decoding actually pays off).
  • A wider/multi-socket CPU with AMX (where vLLM's NUMA story kicks in).
  • A different hardware shape — GPU, or Apple Silicon with the PippBauda/llama.cpp-turboquant-mtp Metal fork.
  • Long contexts (32 K–128 K) where TurboQuant's KV memory savings become real.

For 4 K-context interactive tool-calling on commodity AVX2 — knowing the local optimum is the optimum is, itself, a result. Stock llama.cpp is the answer. Don't outsmart it.

What's on the page

If you find this useful or you have a counter-result on different hardware, please share.

Method (short version)

Each cell: Docker container with cgroup caps (--cpus=4 --cpuset-cpus=8-11 --memory=12g) so the benchmark can't disturb co-tenant workloads. Boot the engine, wait for /health, run llama-bench -p 256 -n 128 -r 2 for throughput, then 35 BFCL-style cases (20 simple + 10 multiple-function with distractors + 5 parallel) via harness/run_bfcl.py for tool-calling accuracy. Strict pass = format ∧ function ∧ argument. Latencies are end-to-end wall-clock from the harness, including TCP round-trip.

Spec, harness source, and per-cell JSON all in the public repo.

Benchmarks run on a single shared CPU host · Xeon E-2176G · CPU-only