5 Commits

Author SHA1 Message Date
wizardchen
7b1bb1054f feat(docreader): speed up scanned-PDF parsing, stream image results, isolate heavy async queues
Large scanned PDFs (hundreds of pages) were slow and fragile end-to-end.
This change addresses the parse, transport, and task-scheduling layers:

docreader (parse + transport):
- Parallelize per-page scanned rendering across processes (forkserver/fork),
  with serial fallback. ~4-7x faster on large scanned PDFs; pdfium is not
  thread-safe so we fan out across processes. Configurable via
  DOCREADER_PDF_RENDER_PARALLELISM.
- Add server-streaming ReadStream RPC: emit one meta frame then one frame per
  image, so documents with many page images are no longer capped by the unary
  gRPC message-size limit (a 874-page PDF produced ~193MiB of images, far over
  the 50MB cap) and memory is bounded on both ends. Unary Read is kept for
  backward compatibility; the Go production reader switches to ReadStream.

VLM:
- Make the VLM HTTP timeout configurable (VLM_HTTP_TIMEOUT_SECONDS) and raise
  the default 90s -> 180s so dense scanned-page OCR does not time out with
  "context deadline exceeded".

Async task queues:
- Isolate high-volume, model-heavy fan-out tasks into dedicated asynq queues so
  a single large document cannot saturate the shared worker pool and block
  user-facing document parsing:
    image:multimodal  -> "multimodal"
    chunk:extract     -> "graph"
    question:generation -> "question"
- Register the new queues in the server weight map and the cancel inspector's
  scanned-queue set (so cancelling a knowledge still purges its pending tasks).
2026-06-03 12:29:13 +08:00
wizardchen
d65e647f95 chore: bump Go to 1.26 and slim docreader dependencies
- Bump base image in docker/Dockerfile.app from golang:1.24 to golang:1.26
  to match `go 1.26` declared in go.mod (fixes CI build failure on
  `go mod download`).
- Drop unused docreader components and their dependencies:
  - Remove `docreader/ocr/` package (paddle/vlm/dummy backends are
    unreferenced by the main flow; OCR/VLM is handled by the Go App).
  - Remove `docreader/parser/storage.py` (dead code; image persistence
    happens in the Go App via inline ImageRef bytes).
  - Remove `docreader/scripts/download_deps.py` (PaddleOCR pre-download).
  - Drop deps: paddleocr, paddlepaddle, openai, ollama, minio,
    cos-python-sdk-v5, oss2, asyncio, pypdf2, markdown, mistletoe,
    goose3, markdownify, pdfplumber, antiword, urllib3.
- Re-lock uv.lock: 145 -> 79 packages.
- Update docreader/README.md to reflect that OCR/VLM/storage are no
  longer configured at the docreader level.
2026-05-09 13:32:40 +08:00
wizardchen
397689d2f3 feat: introduce WeKnora Lite edition with lightweight configuration and deployment
- Added a new `.env.lite.example` file for the Lite version, providing a minimal configuration template.
- Updated `.env.example` to remove deprecated variables and include new Docreader settings.
- Enhanced Docker configurations to support the Lite version, including a new Dockerfile for the Docreader service.
- Introduced a Makefile target for building and running the Lite version, along with packaging capabilities.
- Created GitHub workflows for building and releasing Lite binaries, including Homebrew formula support.
- Implemented a new service file for managing the Lite version as a system service.

This update enables a streamlined, single-binary deployment of WeKnora, reducing external dependencies and simplifying setup.
2026-03-02 21:21:49 +08:00
begoniezhao
2d66abedf0 feat: 新增文档模型类,调整配置与解析逻辑,优化日志及导入
移除日志设置与冗余代码,优化导入、类型提示及OCR后端管理
统一调整各文件模块导入路径为绝对导入
调整导入路径,移除部分导入,优化日志及注释
升级文档解析器为 Docx2Parser,优化超时与图片处理逻辑
2025-11-18 22:37:01 +08:00
begoniezhao
c1f731e026 chore(docreader): 重新组织模块文件 2025-11-05 12:07:39 +08:00