fix(extract): disable model thinking for grammar-constrained extraction

Hybrid/reasoning models (Qwen3 and similar) default to emitting reasoning tokens, which collide with Ollama's format-grammar constraint — on CPU this produced null/unparseable output and blew the latency budget (qwen3:8b: null or 300s timeouts vs ~20s with thinking off). Send think:false on the /api/chat call; Ollama ignores it for non-thinking models (verified on qwen2.5:7b), so it's safe and unlocks the stronger Qwen3 family.
This commit is contained in:
Maurice
2026-06-26 14:50:50 +02:00
committed by Maurice
parent 4abe96fe01
commit d95d26e493
@@ -54,6 +54,10 @@ export async function extractEnforced(input: EnforcedExtractInput): Promise<Reco
model: input.model,
stream: false,
format: input.schema,
// Disable "thinking" for hybrid/reasoning models (Qwen3, etc.): the reasoning tokens
// collide with the format-grammar constraint here — they produce unparseable output and
// blow the latency budget on CPU. Ollama ignores this for non-thinking models, so it's safe.
think: false,
// Keep the model resident a while so back-to-back imports don't pay the cold load.
keep_alive: '30m',
options: { temperature: 0, num_predict: input.numPredict ?? 512, num_ctx: input.numCtx ?? 8192 },