Skip to content

Target hardware

All benchmarks ran on a single shared CPU box. The point of the experiment is what these models do on commodity x86 silicon without a GPU — the box's identity isn't load-bearing for any of the numbers, so it's deliberately omitted.

Probed specs (2026-05-20)

ResourceValue
CPUIntel Xeon E-2176G — 6 cores / 12 threads @ 3.7 GHz
ISA featuresAVX2 yes, AVX-512 no
RAM62 GB total (~60 GB available)
Swap31 GB
Disk847 GB, ~765 GB free
GPUIntel UHD P630 iGPU only (no NVIDIA, no CUDA)
OSUbuntu 22.04.5 LTS, kernel 5.15.0-164

Shared with other workloads

The host runs other unrelated production containers in parallel. Every benchmark container is cgroup-pinned so it can't starve them:

bash
docker run --rm \
  --cpus=4 --cpuset-cpus=8-11 \
  --memory=12g --memory-swap=12g \
  ...
  • Cores 8-11 pinned to the benchmark; cores 0-7 reserved for the rest of the system.
  • 12 GB memory cap — leaves ~50 GB for everything else.
  • Runs scheduled in low-traffic windows.
  • An automatic kill triggers if host load_avg(1m) exceeds 8.0.

What this implies for the experiment

  • CPU-only inference. No CUDA path, no Vulkan path tested.
  • Speed-bound by AVX2, not memory. TurboQuant's headline benefit is KV-cache memory reduction — interesting on GPUs with tight VRAM, much less interesting on a box with 60 GB free RAM.
  • AVX-512 absent. Some llama.cpp speedups (specific Q4_K_M matmul kernels) only kick in with AVX-512. This Coffee Lake box won't see them.

Benchmarks run on a single shared CPU host · Xeon E-2176G · CPU-only