mirror of
https://github.com/mauriceboe/TREK.git
synced 2026-06-30 18:46:00 +00:00
fix(extract): disable model thinking for grammar-constrained extraction
Hybrid/reasoning models (Qwen3 and similar) default to emitting reasoning tokens, which collide with Ollama's format-grammar constraint — on CPU this produced null/unparseable output and blew the latency budget (qwen3:8b: null or 300s timeouts vs ~20s with thinking off). Send think:false on the /api/chat call; Ollama ignores it for non-thinking models (verified on qwen2.5:7b), so it's safe and unlocks the stronger Qwen3 family.
This commit is contained in:
@@ -54,6 +54,10 @@ export async function extractEnforced(input: EnforcedExtractInput): Promise<Reco
|
|||||||
model: input.model,
|
model: input.model,
|
||||||
stream: false,
|
stream: false,
|
||||||
format: input.schema,
|
format: input.schema,
|
||||||
|
// Disable "thinking" for hybrid/reasoning models (Qwen3, etc.): the reasoning tokens
|
||||||
|
// collide with the format-grammar constraint here — they produce unparseable output and
|
||||||
|
// blow the latency budget on CPU. Ollama ignores this for non-thinking models, so it's safe.
|
||||||
|
think: false,
|
||||||
// Keep the model resident a while so back-to-back imports don't pay the cold load.
|
// Keep the model resident a while so back-to-back imports don't pay the cold load.
|
||||||
keep_alive: '30m',
|
keep_alive: '30m',
|
||||||
options: { temperature: 0, num_predict: input.numPredict ?? 512, num_ctx: input.numCtx ?? 8192 },
|
options: { temperature: 0, num_predict: input.numPredict ?? 512, num_ctx: input.numCtx ?? 8192 },
|
||||||
|
|||||||
Reference in New Issue
Block a user