Skip to content

Observability

The mochallama-spring-boot-starter ships inference visibility out of the box via Spring Boot Actuator + Micrometer. Adding the starter to a Spring Boot app auto-configures a model health indicator and a set of inference meters — no extra wiring required.

The starter declares spring-boot-starter-actuator as an implementation dependency, so Actuator (and Micrometer core) arrive transitively. This is an accepted trade-off for this project: Actuator/Micrometer are Spring infra and do not pin a Spring AI version, so the starter stays version-resilient. (To make it optional instead, switch the dependency to compileOnly and gate the metric registration on @ConditionalOnClass(MeterRegistry.class).)

Actuator endpoints

The demo app exposes these in application.properties:

properties
management.endpoints.web.exposure.include=health,info,metrics,prometheus
management.endpoint.health.show-details=always
EndpointPurpose
GET /actuator/healthAggregate health incl. mochallama group
GET /actuator/metricsList of all meter names
GET /actuator/metrics/{name}A single meter's measurements
GET /actuator/prometheusPrometheus scrape (opt-in, see below)

Health indicator

tools.deemwar.mochallama.actuate.MochallamaHealthIndicator reflects model load progress so probes and dashboards can gate on real inference availability.

Load stateHealth
DOWNLOADINGDOWN
LOADINGDOWN
READYUP
FAILEDDOWN (adds error detail = last error message)

Details on every response: model (id), state, loadDurationMs.

It is auto-configured by MochallamaActuatorAutoConfiguration, gated @ConditionalOnClass(HealthIndicator.class) and @ConditionalOnBean(LlamaCppService.class).

Meters

All inference meters are registered by LlamaCppService against the injected MeterRegistry. Naming is uniform: mochallama.<area>.<thing>, and every meter is tagged model=<modelId> (the GGUF filename without its .gguf suffix) where it makes sense (the process-wide tokens_per_second gauge is the only untagged one).

MeterTypeTagsMeaning
mochallama.inferencesCountermodel, outcome=success|error, stream=true|falseOne increment per chat/chatStream call
mochallama.tool_callsCountermodelIncremented by the number of tool calls the model emitted in a turn
mochallama.inference.durationTimermodelWall-clock time per inference call
mochallama.tokens.completionDistributionSummarymodelReal completion tokens per call (from ChatResult.usage)
mochallama.tokens.promptDistributionSummarymodelReal prompt tokens per call (from ChatResult.usage)
mochallama.tokens_per_secondGaugeThroughput of the last completed turn
mochallama.model.stateGaugemodelLoadState.ordinal(): 0=DOWNLOADING 1=LOADING 2=READY 3=FAILED
mochallama.model.load.durationTimeGauge (ms)modelElapsed time from load start until READY

Real token usage

Token counts are now the exact values reported by the native bridge, read from ChatResult.usage() (promptTokens / completionTokens / totalTokens) after every successful inference. The earlier length / 4 approximation is gone — both the mochallama.tokens.* summaries and the OpenAI endpoint's usage block carry real counts.

The stream tag on mochallama.inferences distinguishes streaming (chatStream, SSE) from non-streaming calls. mochallama.tool_calls only moves when the model actually requests tools (counted per call by the number of calls in ChatResult.toolCalls()).

mochallama.tokens_per_second is computed per successful turn as completionTokens / elapsedSeconds (real completion tokens) and held in a volatile field the gauge reads; it reflects the last turn only.

Examples

bash
# Health (DOWN while loading, UP once READY)
curl -s localhost:8080/actuator/health

# Inference timer
curl -s localhost:8080/actuator/metrics/mochallama.inference.duration

# Inference counter (filter to streaming successes)
curl -s 'localhost:8080/actuator/metrics/mochallama.inferences?tag=outcome:success&tag=stream:true'

# Tool-call counter
curl -s localhost:8080/actuator/metrics/mochallama.tool_calls

# Real prompt-token distribution
curl -s localhost:8080/actuator/metrics/mochallama.tokens.prompt

# Last-turn throughput
curl -s localhost:8080/actuator/metrics/mochallama.tokens_per_second

Sample output

GET /actuator/health once the model is READY (note the mochallama component carries model / state / loadDurationMs details):

json
{
  "status": "UP",
  "components": {
    "mochallama": {
      "status": "UP",
      "details": {
        "model": "qwen2.5-1.5b-instruct-q4_k_m",
        "state": "READY",
        "loadDurationMs": 1843
      }
    }
  }
}

While the model is still loading the same component is DOWN (so the aggregate status is DOWN), with state = DOWNLOADING or LOADING; on a load failure it is DOWN with an extra error detail.

GET /actuator/metrics/mochallama.inference.duration — the inference timer, with its count, total, and max statistics and the available model tag:

json
{
  "name": "mochallama.inference.duration",
  "baseUnit": "seconds",
  "measurements": [
    { "statistic": "COUNT", "value": 7.0 },
    { "statistic": "TOTAL_TIME", "value": 18.42 },
    { "statistic": "MAX", "value": 3.91 }
  ],
  "availableTags": [
    { "tag": "model", "values": ["qwen2.5-1.5b-instruct-q4_k_m"] }
  ]
}

GET /actuator/metrics/mochallama.tokens_per_second — the last-turn throughput gauge (untagged):

json
{
  "name": "mochallama.tokens_per_second",
  "measurements": [ { "statistic": "VALUE", "value": 12.4 } ],
  "availableTags": []
}

Enabling Prometheus

Prometheus support is opt-in. The starter declares micrometer-registry-prometheus as compileOnly, so consumers add it to light up the scrape endpoint:

groovy
runtimeOnly 'io.micrometer:micrometer-registry-prometheus'

Then GET /actuator/prometheus returns all mochallama_* meters in Prometheus text format (already exposed in the include list above).

Demo web UI

app's web UI shows a compact live footer stat that polls /actuator/metrics/mochallama.tokens_per_second and mochallama.model.state every 3s, rendering x.x tok/s · model: <id> · <state>.

Released under the MIT License.