Skip to content

Models & profiles

Reference for which models mochallama loads, how to point it at your own GGUF, where models are cached, and the load-time error contract. The lineup is tool-callers only — non-tool models are rejected at load, by design (the reasoning lives in tool-calling-support.md).

Quickstart

Pull a model and chat with no Java and no model files of your own — the CLI ships its own JDK-22 runtime image and downloads the default GGUF on first run:

bash
# default model is qwen2.5-1.5b (~1.1 GB, downloaded once)
npx @deemwario/mochallama chat -m qwen2.5-1.5b

From a JVM app, point ChatEngine at any tool-capable GGUF:

groovy
implementation 'io.github.deemwario:mochallama-core:0.1.6'
runtimeOnly    'io.github.deemwario:mochallama-core-platform:0.1.6'
xml
<dependency>
  <groupId>io.github.deemwario</groupId>
  <artifactId>mochallama-core</artifactId>
  <version>0.1.6</version>
</dependency>
<dependency>
  <groupId>io.github.deemwario</groupId>
  <artifactId>mochallama-core-platform</artifactId>
  <version>0.1.6</version>
  <scope>runtime</scope>
</dependency>

In a Spring Boot app the model is selected by configuration; activate a profile or set llamacpp.model.*. The id reported on GET /v1/models is derived from the model filename, so switching profiles also switches the OpenAI-compatible model id.

sh
./gradlew bootRun --args='--spring.profiles.active=qwen2.5-3b'

Built-in profiles

All four are tool-capable and quantized Q4_K_M. The default is qwen2.5-1.5b — the smallest/fastest kept model and the proven tool-caller in this lineup, so first boot is quick.

ProfileHF repoFilenameSize on diskRAM (approx)Gen tok/s (CPU, approx)Tool callingNotes
qwen2.5-1.5bQwen/Qwen2.5-1.5B-Instruct-GGUFqwen2.5-1.5b-instruct-q4_k_m.gguf1.1 GB~2 GB~12Yes (proven)Default. Smallest + fastest kept model; native agent tool test passed on this exact GGUF.
qwen2.5-3bQwen/Qwen2.5-3B-Instruct-GGUFqwen2.5-3b-instruct-q4_k_m.gguf2.1 GB~3 GB~7YesQuality step up; same Qwen2.5 tool template.
qwen3-4bunsloth/Qwen3-4B-Instruct-2507-GGUFQwen3-4B-Instruct-2507-Q4_K_M.gguf2.5 GB~4.5 GB~9.8 (measured)YesStrongest general-purpose 4B in the lineup; strong tool calling. Profile id is historical; the model is "Qwen3".
phi-4-miniunsloth/Phi-4-mini-instruct-GGUFPhi-4-mini-instruct-Q4_K_M.gguf2.5 GB~3.5 GB~10.6 (measured)YesFastest 4B-class in the lineup; ships a tool-capable chat template.

To switch models in a Spring app, activate the matching profile (the names above) or set llamacpp.model.url / llamacpp.model.filename directly.

Why the defaults are small

Q4_K_M weights load fast and fit in a few GB of RAM, so a plain npx or bootRun reaches first-token quickly on a laptop CPU with no GPU. The rationale for the curated, tool-only lineup is in tool-calling-support.md — this page is the reference.

Load any model by Hugging Face id

You are not limited to the four built-in profiles. Point at any tool-capable Hugging Face GGUF repo by id (org/repo); the resolver picks the preferred quant (Q4_K_M, falling back through Q5_K_M → Q4_0 → …) via the Hub API and downloads it into the shared cache.

properties
# Alternative to llamacpp.model.url + .filename
llamacpp.model.hf-id=Qwen/Qwen2.5-3B-Instruct-GGUF
llamacpp.model.quant=Q4_K_M
bash
# --model accepts a profile name, an HF id, OR a local .gguf path
npx @deemwario/mochallama chat --model Qwen/Qwen2.5-3B-Instruct-GGUF
npx @deemwario/mochallama chat --model /path/to/your-model.gguf

Resolution precedence (Spring): explicit llamacpp.model.url + .filename > llamacpp.model.hf-id + .quant > the built-in default.

Both the starter and the CLI share one resolver/downloader and one model cache, so a model pulled by either is reused by the other.

Model cache

Models cache under ~/.chatbot_models (overridable in Spring via llamacpp.model.cache-dir). The first run downloads the GGUF there and reuses it after that. The CLI and the Spring app share this directory.

Load-time error contract

Model selection fails fast and explicitly rather than degrading silently. These are documented, stable error codes:

CodeWhen it firesBehavior
MODEL_NOT_TOOL_CAPABLEThe GGUF's chat template does not support tool calling (e.g. Gemma GGUFs; Llama-3.2 1B/3B are only partial).The model is rejected at load — never silently downgraded to a non-tool fallback. In Java, ChatEngine/MochallamaClient throw LlamaException(MODEL_NOT_TOOL_CAPABLE); the native bridge returns NULL.
MODEL_GATEDThe repo is gated/private and no accepted license / HF_TOKEN is available.Fails early at resolve/download time. Set HF_TOKEN (and accept the license on Hugging Face) to use a gated repo.

Tool-callers only

A non-tool HF id is rejected, not silently accepted and downgraded. This is the core contract of the library — see tool-calling-support.md for the detection mechanism and the HF fetch/verify flow.

Dropped (not tool-capable / unreliable)

Former profileWhy dropped
gemma-4-e4b (unsloth/gemma-3n-E4B-it-GGUF)Gemma GGUFs generally lack a tool-calling chat template. Also the highest RAM cost in the old lineup.
llama-3.2-1b (mukel/Llama-3.2-1B-Instruct-GGUF)Only partial tool support; not reliable enough to emit tool-call JSON.
llama-3.2-3b (mukel/Llama-3.2-3B-Instruct-GGUF)Same partial tool support as the 1B.

CLI: models, chat & sessions

mochallama models lists the built-in profiles. mochallama chat is a real multi-turn REPL (it keeps full conversation history, not amnesiac single-turns) and persists conversations as sessions.

bash
npx @deemwario/mochallama models                       # list profiles
npx @deemwario/mochallama chat -m qwen2.5-1.5b         # start a chat
npx @deemwario/mochallama sessions                      # list saved sessions
npx @deemwario/mochallama chat --resume <id>           # continue a prior chat

Sessions persist as one JSON file per session under ~/.chatbot_models/sessions/<id>.json. mochallama sessions lists them (id, model, turns, last-updated); resume a prior conversation with chat --resume <id> (a resumed session decides its own model). mochallama chat --no-save runs an ephemeral session that is never written to disk.

chat options:

OptionDefaultMeaning
-m, --modelqwen2.5-1.5bProfile name, HF id (org/repo), or local .gguf path.
--resume <id>Continue a saved session (see mochallama sessions).
--no-saveoffEphemeral session — nothing written to disk.
--max-tokens256Max tokens per reply.
--temperature0.7Sampling temperature.

In-REPL slash commands:

CommandEffect
/helpList the commands.
/resetClear the conversation history (keeps the session id).
/exitQuit (EOF / Ctrl-D also quits).

Notes on the numbers

  • Size on disk is the HF-reported file size for the listed .gguf.
  • RAM is a rule-of-thumb: Q4 weights × ~1.3 plus the KV cache for the configured context size. Longer contexts use more.
  • Gen tok/s marked measured is single-stream on a stock AVX2 CPU build, Q4_K_M, fp16 KV. The other figures are extrapolated rules of thumb, not measured here.

Honest CPU latency

These are small Q4 models on CPU. Expect roughly single-digit to ~12 tokens/sec on a typical laptop/desktop CPU — readable and interactive, but not GPU-fast. Your CPU, RAM, and thread count matter more than the choice of model. mochallama runs in-process via llama.cpp; it does not auto-offload to a GPU the way a standalone server might. Pick the smallest model that meets your quality bar.

All four shipped GGUF resolve URLs are HEAD-checked for anonymous download — no gated/private models are referenced in any built-in profile.

Released under the MIT License.