Spec — llama.cpp Small-Model Tool-Calling Benchmark (Qwen × Gemma × Phi, Standard vs TurboQuant)
Repo: deemwar-products/llama-cpu-benchmarksStatus: Draft v1 · awaiting Muthu's approval Date: 2026-05-20 Driver: Muthukumaran Navaneethakrishnan Target host: a single shared CPU box (identity intentionally omitted in public docs)
1. Goal
Benchmark the three best open-weight small (~4B) edge models from Qwen, Google, and Microsoft on a single production-class CPU box, head-to-head on tool-calling accuracy, throughput (tokens/sec), and memory footprint — with and without TurboQuant (Google DeepMind, ICLR 2026) KV-cache compression. Produce a decision-grade table that says: which model + quant config we ship for edge / local-LLM workloads on commodity x86 CPU hardware.
2. Non-goals
- No GPU benchmarking. Target host has no NVIDIA GPU (Intel UHD P630 iGPU only).
- No multimodal / vision testing. Text + tool-calling only.
- No fine-tuning. Stock instruct checkpoints only.
- No model >9B. Edge-size focus.
- Not a llama.cpp upstream contribution. Internal benchmark.
3. Target Hardware (probed 2026-05-20)
| Resource | Value |
|---|---|
| CPU | Intel Xeon E-2176G — 6c/12t @ 3.7 GHz, AVX2 yes, AVX-512 no |
| RAM | 62 GB total, ~60 GB available (38 GB reclaimable from buff/cache) |
| Disk | 847 GB total, 766 GB free |
| GPU | Intel UHD P630 iGPU (no CUDA, Vulkan possible but out-of-scope) |
| OS | Ubuntu 22.04.5 LTS, kernel 5.15.0-164 |
| Other workloads | host is shared with unrelated production containers — benchmark must not starve them |
Implication: CPU-only inference. Speed bound by AVX2 throughput, not memory. Other workloads share the box — benchmark must be cgroup-isolated.
4. Model Matrix
Three best-in-class ~4B instruct models with native tool-calling, May 2026:
| Vendor | Model | Params | Context | Tool-calling | License | GGUF source |
|---|---|---|---|---|---|---|
| Alibaba | Qwen3.5-4B-Instruct | 4B | 128K | Native (Qwen tool format) | Apache 2.0 | Bartowski / Unsloth |
gemma-4-E4B-it | ~4B effective (MatFormer) | 128K | Native (6 dedicated tool tokens) | Apache 2.0 | unsloth/ggml-org/bartowski | |
| Microsoft | Phi-4-mini-instruct | 3.8B | 128K | Native (JSON schema) | MIT | Bartowski / microsoft |
Why these three:
- Qwen3.5-4B: Qwen series has led BFCL in its weight class for most of 2025-26.
- gemma-4-E4B-it: Released 2026-04-02, purpose-built for edge/mobile with dedicated tool-call special tokens (
<|tool>,<|tool_call>,<|tool_result>). TheE4Bvariant uses Google's MatFormer architecture — ~4B "effective" parameters at runtime. - Phi-4-mini: Microsoft's flagship small tool-caller — built-in function calling, JSON schema, 200K vocab, 128K context.
5. Quantization Matrix
| Cell ID | Weight quant | KV cache | Note |
|---|---|---|---|
std | Q4_K_M imatrix (Bartowski) | FP16 (default) | Baseline; what most users run today |
tbq3 | Q4_K_M imatrix (Bartowski) | TurboQuant tbq3_0 (3-bit, PR #21089 CPU AVX2) | TurboQuant arm |
Weight quant kept constant at Q4_K_M imatrix across all six runs — the variable under test is the KV-cache compression, not weight precision. This isolates TurboQuant's effect.
Why Q4_K_M imatrix: best quality-per-byte at the 4-bit weight level on CPU; widely published; reproducible via Bartowski's pipeline.
6. Full Run Matrix (6 cells)
Qwen3.5-4B gemma-4-E4B Phi-4-mini
std/Q4_K_M 1 2 3
tq/Q4_K_M 4 5 67. TurboQuant source choice (corrected)
The initial spec listed GPU-targeted community forks (atomicmilkshake, TheTom, MartinCrespoC, PippBauda). All four gate TurboQuant kernels behind GGML_CUDA=ON, so a -DGGML_CUDA=OFF build produces a binary functionally equivalent to upstream llama.cpp with no tbq* cache types registered. Documented in results/build-status.json.
The actual CPU AVX2 path is upstream PR ggml-org/llama.cpp#21089 by elusznik. It adds CPU-only cache types:
tbq3_0— 3.0625 bits/elem, ~5.19× compression vs FP16tbq4_0— 4.0625 bits/elem, ~3.94× compression vs FP16
Includes generic-C fallback + AVX2 kernel. ARM NEON added later by a community contributor. Exposed via --cache-type-k tbq3_0 --cache-type-v tbq3_0 — flag name uses tbq prefix, not the turbo prefix used by the GPU forks.
As of May 2026 the PR is open, not merged. Build from the PR branch directly. Merge tracking in discussion #20969.
8. llama.cpp Build & Run Configuration
8.1 Baseline llama.cpp (for std cells)
git clone https://github.com/ggml-org/llama.cpp.git llama.cpp-std
cd llama.cpp-std
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=ON -DGGML_AVX2=ON -DGGML_LLAMAFILE=ON
cmake --build build --config Release -j$(nproc)8.2 TurboQuant llama.cpp (for tq cells)
git clone https://github.com/atomicmilkshake/llama-cpp-turboquant.git llama.cpp-tq
cd llama.cpp-tq
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=ON -DGGML_AVX2=ON -DGGML_TURBOQUANT=ON
cmake --build build --config Release -j$(nproc)Falls back to forks #2/#3/#4 in order if #1 fails to build or run on AVX2-only CPU.
8.3 Host-sharing constraints (mandatory)
All benchmark runs are wrapped in Docker with strict resource caps to avoid disturbing other workloads:
docker run --rm \
--cpus=4 --cpuset-cpus=8-11 \
--memory=12g --memory-swap=12g \
--name llamabench-${cell_id} \
-v $(pwd)/models:/models:ro \
-v $(pwd)/results:/results \
llamabench:${variant} \
${cmd}- Pinned to cores 8-11 (4 cores). Cores 0-7 left to the rest of the system.
- 12 GB memory cap (~6 GB model + 6 GB headroom).
- Off-peak window: runs scheduled in low-traffic windows. Per-run runtime ≤30 min.
- Kill switch: if host
load_avg(1m)> 8.0, abort the run.
8.4 Run flags (per cell)
# Standard cell (std)
llama-server --model /models/${model}-Q4_K_M.gguf \
--threads 4 --ctx-size 8192 \
--jinja --port 11434
# TurboQuant cell (tq)
llama-server --model /models/${model}-Q4_K_M.gguf \
--threads 4 --ctx-size 8192 \
--jinja --port 11434 \
--cache-type-k tbq3_0 --cache-type-v tbq3_0(Exact TurboQuant flag spelling confirmed against chosen fork's docs in Phase 0.)
9. Two-Phase Execution
Phase 0 — Feasibility (1-2 hours, blocking gate)
- Clone & build candidate TurboQuant fork (#1) inside Docker on the benchmark host.
- Run
llama-clismoke test with Gemma-4-2B-Q4_K_M +--cache-type-k turbo3 --cache-type-v turbo3and a one-tool prompt. - Gate criteria:
- (a) Build succeeds on AVX2-only x86, no GPU/Metal required.
- (b) Smoke test completes one tool-call turn end-to-end.
- (c) Memory + CPU stay inside the cgroup caps.
- If gate fails → fall back through fork ranks 2 → 3 → 4. If all four fail, Phase 1 reduces to 3 cells (std only) and the TurboQuant arm is documented as "not feasible on commodity CPU as of 2026-05-20."
Phase 1 — Full sweep (≈ 1 day of off-peak runs)
For each of the 6 cells:
llama-bench→ raw prompt eval tok/s + gen eval tok/s + peak RSS.- Tool-calling harness (§10) → BFCL-subset accuracy %.
- Latency probe → 100 × {256-in / 128-out} turns, record p50 / p95 wall-clock.
Output → results/${cell_id}.json + aggregated results/summary.md table.
10. Tool-Calling Test Harness
10.1 Test set
Subset of Berkeley Function Calling Leaderboard (BFCL) v3, three categories:
| Category | N cases | What it tests |
|---|---|---|
simple | 50 | Single function, single arg set |
parallel | 25 | Multiple functions in one turn |
multiple_function | 25 | Pick the right function from N candidates |
Total: 100 cases per model × quant cell = 600 evaluations.
10.2 Driver
Python harness (harness/run_bfcl.py) — talks to llama-server over its OpenAI-compatible /v1/chat/completions endpoint with tools=[...]. Server is invoked with --jinja so the model's native chat template handles tool-call formatting.
10.3 Scoring
| Metric | Definition |
|---|---|
format_pass_rate | % of cases where output is a valid JSON tool call (parseable) |
function_accuracy | % where the correct function was selected (AST match) |
argument_accuracy | % where all args match expected (AST match) |
overall_pass | strict: format ∧ function ∧ argument all pass |
11. Metrics & Output Schema
Per cell, written to results/${cell_id}.json:
{
"cell_id": "qwen3.5-4b_tbq3",
"model": "Qwen3.5-4B-Instruct",
"weight_quant": "Q4_K_M",
"kv_quant": "tbq3_0",
"llamacpp_variant": "atomicmilkshake/llama-cpp-turboquant@<sha>",
"host": "shared-cpu-host",
"throughput": {
"prompt_eval_tps": 0.0,
"gen_eval_tps": 0.0
},
"memory": {
"peak_rss_mb": 0,
"kv_cache_rss_mb": 0
},
"latency_ms": {
"p50": 0,
"p95": 0
},
"tool_calling": {
"format_pass_rate": 0.0,
"function_accuracy": 0.0,
"argument_accuracy": 0.0,
"overall_pass": 0.0,
"n_cases": 100
},
"started_at": "ISO8601",
"duration_sec": 0
}Aggregated into results/summary.md as a Markdown table for human review.
12. Success Criteria
Per-model gate (any one model passes "ship for edge"):
| Metric | Threshold | Rationale |
|---|---|---|
gen_eval_tps | ≥ 10 tok/s | usable for interactive tool-use on edge |
tool_calling.overall_pass | ≥ 70% | matches BFCL "competent" tier for ~4B class |
tool_calling.format_pass_rate | ≥ 95% | reliable JSON emission is table-stakes |
memory.peak_rss_mb | ≤ 6000 | leaves headroom inside 12 GB cgroup |
⚠️ OPEN: Muthu to confirm or override these thresholds. Defaults set conservatively against what's reported in BFCL v3 for 4B-class models. Tighten or relax per your edge product requirements.
TurboQuant arm success (separate gate):
| Metric | TurboQuant must achieve | vs. std baseline |
|---|---|---|
kv_cache_rss_mb | ≥ 3× reduction | confirms compression works |
tool_calling.overall_pass | within 2 pp of std | confirms no quality regression |
gen_eval_tps | within ±10% of std | confirms no major CPU-path slowdown |
If TurboQuant fails the "no regression" gate on this hardware, the recommendation is to ship std quants — the experiment still publishable as a negative result.
13. Risks & Mitigations
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| TurboQuant fork doesn't build on AVX2-only CPU | High | Phase-1 reduced to 3 cells | Phase 0 gate; try all 4 forks |
| Benchmark disturbs prod tenants | Medium | Prod incident | cgroup cap; off-peak; load_avg kill switch |
| Gemma-4 / Qwen3.5 / Phi-4 chat-template parser bug in llama.cpp | Medium | Tool-call format failures inflated | Pin to llama.cpp ≥ 2026-05-01 build with Qwen3/Gemma4 fixes; smoke-test each chat template before BFCL run |
| Bartowski-Gemma-4 imatrix GGUF not yet published | Low | Have to generate our own imatrix | Fallback to vanilla Q4_K_M; flag in results |
| BFCL v3 subset doesn't generalize to our actual edge workloads | Low | Wrong winner chosen | Document delta; if Muthu has internal tool-call traces, add as Cat-4 |
14. Deliverables
results/summary.md— markdown comparison table, all six cells.results/${cell_id}.json× 6 — raw per-cell data.results/decision.md— one-page recommendation: which model + quant config to ship.harness/run_bfcl.py— reusable test harness (so we can re-benchmark when Phi-5 / Qwen3.6 / Gemma-5 drop).- Updated
CLAUDE.mdreflecting the chosen winner (after spec approval).
15. Open Questions (Muthu must answer before Phase 0 starts)
- Success-bar numbers — accept defaults in §12 (10 tok/s, 70% overall_pass, 95% format, 6 GB RSS), or override?
- Fork preference — start Phase 0 with
atomicmilkshake/llama-cpp-turboquant(default), or another? - Off-peak window — confirm an off-peak window safe for co-tenant workloads.
- BFCL subset adequacy — do you have internal tool-calling traces from
reqsume/video-aiwe should add as a Cat-4 evaluation set? - Phi-4-mini vs Phi-4 multimodal — confirm we test the text-only
Phi-4-mini-instruct, not the multimodal variant (multimodal adds vision tokens which skew tool-call benchmarks).