Why mochallama (the decisions)
mochallama is the only in-process, tool-calling local LLM for the JVM — Spring-first, OpenAI-compatible, llama.cpp-backed through Project Panama FFM. No JNI, no daemon, no native-install dance.
This page is the honest version of why it is shaped the way it is. Each decision below is an ADR-shaped block: the context (the constraint we were actually staring at), the decision, and the trade-off it costs you. No hand-waving — where a number matters it's a real number, and where another tool is honestly better we say so on the Compare page.
Try it first (10 seconds, no Java install)
The fastest way to understand the decisions is to feel the result. The CLI ships its own JDK 22 runtime, so this needs nothing but npx:
# Multi-turn chat against a local, tool-capable model.
npx @deemwario/mochallama chat -m qwen2.5-1.5bnpx @deemwario/mochallama modelsThe adopt path is one dependency. Plain Java:
dependencies {
implementation("io.github.deemwario:mochallama-core:0.1.6")
// Pulls the right native for the host (all 5 platforms behind one aggregator POM).
runtimeOnly("io.github.deemwario:mochallama-core-platform:0.1.6")
}<dependency>
<groupId>io.github.deemwario</groupId>
<artifactId>mochallama-core</artifactId>
<version>0.1.6</version>
</dependency>
<dependency>
<groupId>io.github.deemwario</groupId>
<artifactId>mochallama-core-platform</artifactId>
<version>0.1.6</version>
<scope>runtime</scope>
</dependency>Runtime requirement
JDK 22+ (FFM is GA there) with --enable-native-access=ALL-UNNAMED. The Spring path is one starter; see Architecture.
The decision cascade (six whys)
Each step answers the question the step before it raises.
1. Why run the model in-process at all — not just call Ollama?
Context. Every existing local-LLM path for the JVM is a separate process reached over HTTP. Spring AI itself has no built-in capability to run models in-process; it points an HTTP client at Ollama, llama-server, LM Studio, or vLLM. A sidecar is a second thing to install, supervise, and keep alive — plus a network hop, idle resource use, and re-sending the whole conversation every call.
Decision. Run inference inside the host JVM. One dependency, the app's own lifecycle, Actuator health and Micrometer metrics on the same process, and stateful inference with no daemon.
Trade-off. You give up the things a shared daemon is genuinely good at — one model server fronting many apps, and automatic GPU offload out of the box. If that's what you want, Ollama is the better tool, and we say so on Compare.
Proven in: Architecture — the lifecycle, state machine, and Actuator wiring.
2. Why Panama FFM — not JNI?
Context. Consumers should need no C toolchain and no fragile hand-written JNI glue. And JNI has a sharp failure mode that is not theoretical: a native fault in the bound library takes down the whole JVM. The reference JNI binding (kherud/java-llama.cpp) has a real reported crash — a hard SIGILL in ggml_init that "happened outside the Java Virtual Machine in native code." A SIGSEGV in the native layer is the same story: no Java stack trace, no catch, just a dead process.
Decision. Use Project Panama FFM — the Foreign Function & Memory API, GA on JDK 22 (not an incubator/preview module). It carries roughly 4–5× lower call overhead than JNI and turns many mistakes into Java exceptions instead of segfaults. The bridge itself is a thin extern-C shim over llama.cpp's common_chat — about ~700 LOC, 7 functions — hand written, not generated by jextract. A tiny, deliberate ABI surface means fewer descriptors to get wrong, which means fewer crash vectors.
Trade-off. FFM is GA only on JDK 22+. If you're pinned to an older JDK (Java 11), FFM isn't available and a JNI binding is your option — accepting that its native faults can SIGILL / SIGSEGV the JVM.
Proven in: Architecture — the FFM bindings and the bridge ABI.
3. Why download prebuilt llama.cpp — and compile only the bridge?
Context. A from-source llama.cpp build is ~95 minutes and OOM-prone in CI, and "many users previously got stuck at compilation and dependency setup." Compiling what upstream already ships is wasted time and a support burden.
Decision. Consume upstream's official prebuilt release libs (tag b9371) and compile only the ~700-LOC bridge — that's ~2–11 seconds, not 95 minutes. The owner's stance, plainly: don't compile what upstream already ships. The bonus is strategic: we inherit llama.cpp's entire model zoo, quant formats, and Metal/CUDA/AVX kernels the day upstream ships them — instead of hand-porting each architecture the way a pure-Java engine must.
Trade-off. We're bound to the platforms and the tag llama.cpp prebuilds. An exotic CPU target with no upstream binary isn't covered. (A pure-Java engine like Jlama wins exactly there — Compare.)
Proven in: Architecture — the native build downloads release libs, not sources.
4. Why tool-calling-only — reject non-tool models at load?
Context. mochallama exists for agentic / function-calling workloads. The market silently degrades here: raw llama-server falls back to a "Generic" handler that is "less efficient," and Ollama tool-calling "starts breaking" on exactly the small models people run locally. A model either has a tool-aware chat template or it doesn't — and when it doesn't, most stacks hand you plausible-looking garbage with no signal.
Decision. Make tool-capability an explicit load-time contract. A non-tool-capable model is rejected at load with MODEL_NOT_TOOL_CAPABLE — fail fast, loudly, before you've shipped a silent regression. Gated Hugging Face repos likewise fail early with MODEL_GATED instead of failing deep in a download. Honesty over silent degradation.
Trade-off. You can't load a non-tool chat model "just to try it." That's deliberate: if you want an unconstrained general chat engine, this isn't the contract for you.
Proven in: Tool Calling — the load-time contract and the error codes. Model presets: Models.
5. Why Spring-first — plus an OpenAI-compatible wire API?
Context. OpenAI's /v1/chat/completions is the de-facto standard nobody wants to abandon — the whole promise is "take existing code written for the OpenAI API and point it at a local instance with a single line change."
Decision. Ship a @AutoConfiguration starter that is one dependency and exposes: POST /v1/chat/completions (with SSE when stream: true), GET /v1/models, and Actuator health + metrics — with no spring-ai dependency required. Tools, tool_choice, and SSE streaming all work together. Code already written against OpenAI (or Ollama's OpenAI shim) is drop-in. For Spring AI users, a separate mochallama-spring-ai adapter exposes a ChatModel/ChatClient so mochallama slots in under Spring AI as the local provider.
Trade-off. mochallama is an inference engine + wire API, not a RAG/agent framework. For orchestration, memory, and provider-portability you still want Spring AI or LangChain4j on top — mochallama is the local engine beneath them, not a replacement.
Proven in: Architecture — the starter, endpoints, and the Spring AI adapter.
6. Why per-platform classifier jars — and a jlink npx CLI?
Context. Packaging natives is a chore every Java dev otherwise hand-rolls: a jar per platform plus a runtime loader, an LD_LIBRARY_PATH dance, and "go install a JDK first" for anyone who just wants to try the thing.
Decision. Ship the native libs as per-platform classifier jars behind one aggregator POM (mochallama-core-platform): add a single runtimeOnly dependency and the correct native for the host resolves itself — no C toolchain, no LD_LIBRARY_PATH. And the CLI bundles its own jlink JDK-22 runtime image via npm optionalDependencies (~31 MB per host), so npx @deemwario/mochallama chat needs no Java install at all — strictly less install than Ollama.
Trade-off. Per-host artifacts mean a larger total release footprint and a real release pipeline that has to produce every platform's native + jlink image. That cost lives in our CI, not in your build.
Proven in: Architecture — the classifier-jar layout and the npm CLI packaging.
Set CPU expectations honestly
mochallama runs on a CPU dev box by default, and that shapes the model defaults. The usable path on CPU is small, quantized presets — roughly 1.5B–4B parameters at Q4_K_M. That is exactly why the default preset is qwen2.5-1.5b: it's tool-capable and fast enough to be pleasant on a laptop CPU.
The tool-capable presets:
| Preset | Size | Notes |
|---|---|---|
qwen2.5-1.5b | 1.5B | Default. Best CPU latency. |
qwen2.5-3b | 3B | More capable, still CPU-friendly. |
qwen3-4b | 4B | Top of the comfortable CPU range. |
phi-4-mini | ~3.8B | Microsoft Phi, tool-aware. |
Or pass any tool-capable Hugging Face GGUF id (org/repo); gated repos fail early with MODEL_GATED. Models cache under ~/.chatbot_models.
Not a GPU server
If you need big models with GPU offload out of the box, a shared standalone server (Ollama, vLLM) is the honest answer. mochallama optimizes for in-process, tool-calling, on the box you already have. See Compare.
Proven in: Models — the full preset table and HF-by-id behavior.
The new CLI (multi-turn, sessions, --resume)
As of 0.1.6, mochallama chat is real multi-turn — it keeps the full conversation history instead of treating each message as an amnesiac single-turn. Conversations persist as sessions under ~/.chatbot_models/sessions/<id>.json.
npx @deemwario/mochallama chat -m qwen2.5-1.5b
# In-REPL slash commands: /reset /help /exit# Continue a prior conversation by id.
npx @deemwario/mochallama chat --resume <id># id, model, turns, last-updated.
npx @deemwario/mochallama sessions# Don't persist this conversation.
npx @deemwario/mochallama chat -m qwen2.5-1.5b --no-saveWhere this is all proven
| Decision | See |
|---|---|
| In-process, FFM-not-JNI, prebuilt bridge | Architecture |
| Tool-calling-only contract + error codes | Tool Calling |
| Model presets, CPU expectations, HF-by-id | Models |
| Where alternatives are honestly better | Compare |
Corrections welcome — the comparison page invites PRs.