curl
Every example below talks to the OpenAI-compatible endpoint on the demo app (POST /v1/chat/completions, GET /v1/models). Start the service first:
./gradlew :app:bootRunThe HTTP port (8080) comes up immediately, but the model loads asynchronously. Until it is READY the endpoint returns 503:
{ "error": "model loading", "state": "DOWNLOADING" }Watch the logs for state: READY (or poll /actuator/health) before sending chat requests.
List models
curl -s http://localhost:8080/v1/models{ "object": "list", "data": [ { "id": "qwen2.5-1.5b-instruct-q4_k_m", "object": "model" } ] }The id is derived from the loaded GGUF filename, so it changes when you switch model profiles.
Chat (non-streaming)
curl -s -X POST http://localhost:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"messages": [
{"role": "user", "content": "Write a haiku about Project Panama."}
],
"max_tokens": 128,
"temperature": 0.7
}'{
"id": "chatcmpl-…",
"object": "chat.completion",
"created": 1716800000,
"model": "qwen2.5-1.5b-instruct-q4_k_m",
"choices": [
{
"index": 0,
"message": { "role": "assistant", "content": "Bridges of memory…" },
"finish_reason": "stop"
}
],
"usage": { "prompt_tokens": 18, "completion_tokens": 22, "total_tokens": 40 }
}usage carries the real token counts reported by the native bridge.
Streaming (SSE)
Pass "stream": true and use curl -N to disable buffering. The response is text/event-stream: a role chunk, one content chunk per token, a final chunk with finish_reason, then data: [DONE].
curl -N -X POST http://localhost:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"messages": [{"role": "user", "content": "count 1 to 5"}],
"stream": true,
"max_tokens": 32
}'data: {"id":"chatcmpl-…","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}
data: {"id":"chatcmpl-…","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"1"},"finish_reason":null}]}
data: {"id":"chatcmpl-…","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":", 2"},"finish_reason":null}]}
…
data: {"id":"chatcmpl-…","object":"chat.completion.chunk","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}
data: [DONE]Tool calling
Declare tools[] in OpenAI shape. When the model decides to call a function, the reply has finish_reason: "tool_calls" and the assistant message carries tool_calls[] instead of (or alongside) content.
curl -s -X POST http://localhost:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"messages": [{"role": "user", "content": "What is the weather in Paris?"}],
"tools": [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": { "location": { "type": "string" } },
"required": ["location"]
}
}
}]
}'{
"choices": [{
"index": 0,
"message": {
"role": "assistant",
"content": null,
"tool_calls": [{
"id": "call_0",
"type": "function",
"function": { "name": "get_weather", "arguments": "{\"location\":\"Paris\"}" }
}]
},
"finish_reason": "tool_calls"
}],
"usage": { "prompt_tokens": 180, "completion_tokens": 22, "total_tokens": 202 }
}tool_choice accepts a string ("auto" / "none" / "required") or an object form {"type":"function","function":{"name":"get_weather"}} (mapped to required). See Tools & streaming for the full round-trip.
Full parameter set
All sampling parameters are server-side defaults (bound from llamacpp.model.*) overridable per request. A system message steers the model — it's just a messages entry with role: system.
curl -s -X POST http://localhost:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"messages": [
{"role": "system", "content": "You are terse. Answer in one short sentence."},
{"role": "user", "content": "Explain Project Panama."}
],
"max_tokens": 64,
"temperature": 0.4,
"top_k": 20,
"top_p": 0.9,
"min_p": 0.05,
"repeat_penalty": 1.1,
"seed": 42,
"stop": ["\n\n"]
}'| Field | Type | Default |
|---|---|---|
max_tokens | int | 256 |
temperature | float | 0.7 |
top_k | int | 40 |
top_p | float | 0.95 |
min_p | float | 0.05 |
repeat_penalty | float | 1.0 |
seed | long | -1 (random) |
stop | string[] | (none) |
stream | bool | false |
Pin seed to a fixed value for reproducible output (with all other sampling params held constant).