WeKnora

mirror of https://github.com/Tencent/WeKnora.git synced 2026-06-04 13:30:32 +08:00

Files

wizardchen 7b1bb1054f feat(docreader): speed up scanned-PDF parsing, stream image results, isolate heavy async queues

Large scanned PDFs (hundreds of pages) were slow and fragile end-to-end.
This change addresses the parse, transport, and task-scheduling layers:

docreader (parse + transport):
- Parallelize per-page scanned rendering across processes (forkserver/fork),
  with serial fallback. ~4-7x faster on large scanned PDFs; pdfium is not
  thread-safe so we fan out across processes. Configurable via
  DOCREADER_PDF_RENDER_PARALLELISM.
- Add server-streaming ReadStream RPC: emit one meta frame then one frame per
  image, so documents with many page images are no longer capped by the unary
  gRPC message-size limit (a 874-page PDF produced ~193MiB of images, far over
  the 50MB cap) and memory is bounded on both ends. Unary Read is kept for
  backward compatibility; the Go production reader switches to ReadStream.

VLM:
- Make the VLM HTTP timeout configurable (VLM_HTTP_TIMEOUT_SECONDS) and raise
  the default 90s -> 180s so dense scanned-page OCR does not time out with
  "context deadline exceeded".

Async task queues:
- Isolate high-volume, model-heavy fan-out tasks into dedicated asynq queues so
  a single large document cannot saturate the shared worker pool and block
  user-facing document parsing:
    image:multimodal  -> "multimodal"
    chunk:extract     -> "graph"
    question:generation -> "question"
- Register the new queues in the server weight map and the cancel inspector's
  scanned-queue set (so cancelling a knowledge still purges its pending tasks).

2026-06-03 12:29:13 +08:00

cloud-image

chore: release v0.5.2

2026-05-13 15:04:15 +08:00

build_images.sh

feat: enhance Dockerfile and build scripts for customizable APT mirror

2026-03-02 21:21:49 +08:00

check-env.sh

feat(storage): 集成S3存储适配器

2026-03-09 10:39:46 +08:00

dev.sh

feat(docreader): speed up scanned-PDF parsing, stream image results, isolate heavy async queues