Files
WeKnora/internal
wizardchen 3ae3ea97c5 fix(knowledge): prevent documents from getting stuck in "processing"
Several failure modes left Knowledge.parse_status pinned at "processing"
forever, with no signal to users beyond a permanent spinner. This commit
addresses the root causes and adds a safety net.

- Asynq worker pool: explicit Concurrency (default 16, env-tunable via
  WEKNORA_ASYNQ_CONCURRENCY) so batch uploads don't queue behind a
  CPU-count-sized worker pool. Redis op timeouts raised to 500ms/1000ms
  (WEKNORA_REDIS_OP_TIMEOUT_MS) to absorb bursty multimodal counter ops.

- DocReader RPC: cap each call with WEKNORA_DOCREADER_CALL_TIMEOUT
  (default 30m). Without this, a hung docreader pinned a worker for the
  full DocumentProcessTimeout window.

- ImageMultimodal: finalize-on-last-attempt semantics. A permanently
  failing single image no longer strands the parent — the asynq retry
  is allowed to run, but on the final attempt we count the image
  regardless of outcome. Redis DECR errors fall back to enqueuing the
  post-process task instead of returning silently.

- Dead-letter callback: when DocumentProcess / KnowledgePostProcess /
  ManualProcess exhausts retries, immediately mark the corresponding
  Knowledge as failed with the last error. This surfaces the failure
  in the UI without waiting for the housekeeping sweep.

- HousekeepingService: 5-minute cron that flips knowledge rows stuck
  in "processing" past DocumentProcessTimeout + 10m to failed, plus
  summary rows stuck > 1h. Catches anything the other safety nets
  miss (worker SIGKILL mid-handler, etc.). Disable with
  WEKNORA_HOUSEKEEPING_ENABLED=false.

- Distributed startup recovery: previously the post-restart sweep was
  skipped whenever REDIS_ADDR was set, even though Asynq does not
  reschedule the task that was actively running on the dead instance.
  Now the sweep runs in distributed mode too, but only against rows
  older than 30 minutes to avoid racing peer instances.
2026-05-28 15:14:45 +08:00
..