mirror of
https://github.com/Tencent/WeKnora.git
synced 2026-06-04 13:30:32 +08:00
Several failure modes left Knowledge.parse_status pinned at "processing" forever, with no signal to users beyond a permanent spinner. This commit addresses the root causes and adds a safety net. - Asynq worker pool: explicit Concurrency (default 16, env-tunable via WEKNORA_ASYNQ_CONCURRENCY) so batch uploads don't queue behind a CPU-count-sized worker pool. Redis op timeouts raised to 500ms/1000ms (WEKNORA_REDIS_OP_TIMEOUT_MS) to absorb bursty multimodal counter ops. - DocReader RPC: cap each call with WEKNORA_DOCREADER_CALL_TIMEOUT (default 30m). Without this, a hung docreader pinned a worker for the full DocumentProcessTimeout window. - ImageMultimodal: finalize-on-last-attempt semantics. A permanently failing single image no longer strands the parent — the asynq retry is allowed to run, but on the final attempt we count the image regardless of outcome. Redis DECR errors fall back to enqueuing the post-process task instead of returning silently. - Dead-letter callback: when DocumentProcess / KnowledgePostProcess / ManualProcess exhausts retries, immediately mark the corresponding Knowledge as failed with the last error. This surfaces the failure in the UI without waiting for the housekeeping sweep. - HousekeepingService: 5-minute cron that flips knowledge rows stuck in "processing" past DocumentProcessTimeout + 10m to failed, plus summary rows stuck > 1h. Catches anything the other safety nets miss (worker SIGKILL mid-handler, etc.). Disable with WEKNORA_HOUSEKEEPING_ENABLED=false. - Distributed startup recovery: previously the post-restart sweep was skipped whenever REDIS_ADDR was set, even though Asynq does not reschedule the task that was actively running on the dead instance. Now the sweep runs in distributed mode too, but only against rows older than 30 minutes to avoid racing peer instances.