3 Commits

Author SHA1 Message Date
wizardchen
ef1047bf67 feat(parser): add OpenDataLoader, PaddleOCR-VL engines, and parser improvements
Introduce opendataloader and PaddleOCR-VL parser engines with tenant-level
settings UI, replace liteparse, and harden Excel/PPT/Markdown parsing.
Optional odl-hybrid sidecar stays local-build only and is excluded from
default dev-start and full profiles.
2026-06-03 12:29:13 +08:00
wizardchen
7b1bb1054f feat(docreader): speed up scanned-PDF parsing, stream image results, isolate heavy async queues
Large scanned PDFs (hundreds of pages) were slow and fragile end-to-end.
This change addresses the parse, transport, and task-scheduling layers:

docreader (parse + transport):
- Parallelize per-page scanned rendering across processes (forkserver/fork),
  with serial fallback. ~4-7x faster on large scanned PDFs; pdfium is not
  thread-safe so we fan out across processes. Configurable via
  DOCREADER_PDF_RENDER_PARALLELISM.
- Add server-streaming ReadStream RPC: emit one meta frame then one frame per
  image, so documents with many page images are no longer capped by the unary
  gRPC message-size limit (a 874-page PDF produced ~193MiB of images, far over
  the 50MB cap) and memory is bounded on both ends. Unary Read is kept for
  backward compatibility; the Go production reader switches to ReadStream.

VLM:
- Make the VLM HTTP timeout configurable (VLM_HTTP_TIMEOUT_SECONDS) and raise
  the default 90s -> 180s so dense scanned-page OCR does not time out with
  "context deadline exceeded".

Async task queues:
- Isolate high-volume, model-heavy fan-out tasks into dedicated asynq queues so
  a single large document cannot saturate the shared worker pool and block
  user-facing document parsing:
    image:multimodal  -> "multimodal"
    chunk:extract     -> "graph"
    question:generation -> "question"
- Register the new queues in the server weight map and the cancel inspector's
  scanned-queue set (so cancelling a knowledge still purges its pending tasks).
2026-06-03 12:29:13 +08:00
wolfkill
450a5bd2dd fix(docreader): throttle heavy parser concurrency 2026-05-07 17:36:09 +08:00