mochallamaA local, tool-calling LLM inside your JVM
The only in-process, tool-calling local LLM for the JVM — Spring-first, OpenAI-compatible, llama.cpp-backed via Project Panama FFM. No JNI, no daemon, no native-install dance. Requires JDK 22+.
The model runs inside your application's own process. No Ollama-style sidecar to install and supervise, no HTTP round-trip, no idle resource drain. Inference is stateful and rides your app's lifecycle, Actuator health, and Micrometer metrics.
No JNI — all Panama FFM
Java talks to llama.cpp through the JDK 22 Foreign Function & Memory API (GA, not incubator), over a thin ~700-LOC extern-C bridge on llama.cpp's common_chat. No hand-written JNI glue, far fewer crash vectors.
Prebuilt llama.cpp, 5 platforms, zero native-install
Consumes upstream's official prebuilt llama.cpp release libs (tag b9371) and compiles only the bridge (~2–11s, not a 95-minute from-source build). Per-platform classifier jars auto-load the right native — macOS Intel + Apple Silicon, Linux x86-64 + ARM64, Windows x86-64.
Spring autoconfig, OpenAI-compatible
One @AutoConfiguration dependency exposes POST /v1/chat/completions (with SSE when stream:true) and GET /v1/models. Tools and streaming work together. Drop-in for code already written against OpenAI or Ollama.
Tool-calling-only — fail-fast
Built for agentic / function-calling work. Non-tool-capable models are rejected at load with MODEL_NOT_TOOL_CAPABLE — an explicit contract instead of silent degradation on small models.
Metrics via Actuator
The starter registers inference meters (timer, token distributions, tool-call counter, tokens/sec) and a model health indicator through Actuator + Micrometer. Prometheus is opt-in.
No Java install, no daemon, no native build — npx a tool-calling local LLM and start chatting:
bash
npx @deemwario/mochallama chat -m qwen2.5-1.5b
The CLI ships its own jlink JDK-22 runtime image via npm, so this needs no JDK on the host. qwen2.5-1.5b is the default tool-capable preset; the model downloads on first run into ~/.chatbot_models.
Tell it which model to load — a Hugging Face id is the simplest (it resolves + caches the GGUF on first start). In src/main/resources/application.properties:
properties
llamacpp.model.hf-id=Qwen/Qwen2.5-1.5B-Instruct-GGUF# or an explicit url + filename:# llamacpp.model.url=https://.../qwen2.5-1.5b-instruct-q4_k_m.gguf# llamacpp.model.filename=qwen2.5-1.5b-instruct-q4_k_m.gguf
Start the app (the model loads asynchronously — endpoints return 503 until state: READY), then point any OpenAI client at it. POST /v1/chat/completions handles non-streaming, stream:true SSE, and tools / tool_choice; GET /v1/models lists the loaded model.
bash
curl http://localhost:8080/v1/chat/completions \ -H 'Content-Type: application/json' \ -d '{"messages":[{"role":"user","content":"Hello from local llama.cpp"}]}'
mochallama chat is a stateful REPL — it keeps the full conversation history, not amnesiac single turns.
bash
# List the tool-capable presets / loaded modelsnpx @deemwario/mochallama models# Start a multi-turn chat; the conversation is saved as a sessionnpx @deemwario/mochallama chat -m qwen2.5-1.5b# List past sessions (id, model, turns, last-updated)npx @deemwario/mochallama sessions# Continue a prior conversationnpx @deemwario/mochallama chat --resume <id>
Sessions persist at ~/.chatbot_models/sessions/<id>.json. Pass --no-save for an ephemeral run. Inside the REPL, slash commands /reset, /help, and /exit are available.
Today every local-LLM path for the JVM reaches your app over HTTP — Ollama, llama-server, LM Studio and friends are all separate processes, and Spring AI / LangChain4j just point an HTTP client at them. The other in-process options are non-JVM, or on the JVM are pure-Java Jlama (reimplements inference on the incubating Vector API, GGUF-less) or JNI bindings whose native faults can take down the whole JVM. mochallama fills the empty quadrant: FFM (GA) + real upstream llama.cpp + Spring-autoconfigured OpenAI wire API + tools-and-SSE-together + zero native-install.
It is an inference engine and wire API, not a RAG/agent framework. For orchestration, memory, and provider-portability you still want Spring AI or LangChain4j — mochallama slots in under them as the local provider via its Spring AI ChatModel adapter. And if you want a shared standalone model server with automatic GPU offload and the widest model catalogue, Ollama is the easier on-ramp. See the full, PR-welcome breakdown in Compare.