mirror of
https://github.com/mauriceboe/TREK.git
synced 2026-06-30 18:46:00 +00:00
fix(extract): disable model thinking for grammar-constrained extraction
Hybrid/reasoning models (Qwen3 and similar) default to emitting reasoning tokens, which collide with Ollama's format-grammar constraint — on CPU this produced null/unparseable output and blew the latency budget (qwen3:8b: null or 300s timeouts vs ~20s with thinking off). Send think:false on the /api/chat call; Ollama ignores it for non-thinking models (verified on qwen2.5:7b), so it's safe and unlocks the stronger Qwen3 family.
This commit is contained in:
@@ -54,6 +54,10 @@ export async function extractEnforced(input: EnforcedExtractInput): Promise<Reco
|
||||
model: input.model,
|
||||
stream: false,
|
||||
format: input.schema,
|
||||
// Disable "thinking" for hybrid/reasoning models (Qwen3, etc.): the reasoning tokens
|
||||
// collide with the format-grammar constraint here — they produce unparseable output and
|
||||
// blow the latency budget on CPU. Ollama ignores this for non-thinking models, so it's safe.
|
||||
think: false,
|
||||
// Keep the model resident a while so back-to-back imports don't pay the cold load.
|
||||
keep_alive: '30m',
|
||||
options: { temperature: 0, num_predict: input.numPredict ?? 512, num_ctx: input.numCtx ?? 8192 },
|
||||
|
||||
Reference in New Issue
Block a user