WeKnora

mirror of https://github.com/Tencent/WeKnora.git synced 2026-06-04 13:30:32 +08:00

Author	SHA1	Message	Date
wizardchen	ef1047bf67	feat(parser): add OpenDataLoader, PaddleOCR-VL engines, and parser improvements Introduce opendataloader and PaddleOCR-VL parser engines with tenant-level settings UI, replace liteparse, and harden Excel/PPT/Markdown parsing. Optional odl-hybrid sidecar stays local-build only and is excluded from default dev-start and full profiles.	2026-06-03 12:29:13 +08:00
wizardchen	7b1bb1054f	feat(docreader): speed up scanned-PDF parsing, stream image results, isolate heavy async queues Large scanned PDFs (hundreds of pages) were slow and fragile end-to-end. This change addresses the parse, transport, and task-scheduling layers: docreader (parse + transport): - Parallelize per-page scanned rendering across processes (forkserver/fork), with serial fallback. ~4-7x faster on large scanned PDFs; pdfium is not thread-safe so we fan out across processes. Configurable via DOCREADER_PDF_RENDER_PARALLELISM. - Add server-streaming ReadStream RPC: emit one meta frame then one frame per image, so documents with many page images are no longer capped by the unary gRPC message-size limit (a 874-page PDF produced ~193MiB of images, far over the 50MB cap) and memory is bounded on both ends. Unary Read is kept for backward compatibility; the Go production reader switches to ReadStream. VLM: - Make the VLM HTTP timeout configurable (VLM_HTTP_TIMEOUT_SECONDS) and raise the default 90s -> 180s so dense scanned-page OCR does not time out with "context deadline exceeded". Async task queues: - Isolate high-volume, model-heavy fan-out tasks into dedicated asynq queues so a single large document cannot saturate the shared worker pool and block user-facing document parsing: image:multimodal -> "multimodal" chunk:extract -> "graph" question:generation -> "question" - Register the new queues in the server weight map and the cancel inspector's scanned-queue set (so cancelling a knowledge still purges its pending tasks).	2026-06-03 12:29:13 +08:00
wolfkill	450a5bd2dd	fix(docreader): throttle heavy parser concurrency	2026-05-07 17:36:09 +08:00
wizardchen	1938094dcc	feat(parser): add PDFScannedParser for handling scanned PDFs - Introduced PDFScannedParser as a fallback parser for scanned PDFs that converts pages into images for OCR processing. - Updated PDFParser to include PDFScannedParser in the parsing chain, enhancing the document parsing capabilities for scanned content. - Improved logging for better error tracking during PDF parsing operations.	2026-04-16 18:13:20 +08:00
wizardchen	397689d2f3	feat: introduce WeKnora Lite edition with lightweight configuration and deployment - Added a new `.env.lite.example` file for the Lite version, providing a minimal configuration template. - Updated `.env.example` to remove deprecated variables and include new Docreader settings. - Enhanced Docker configurations to support the Lite version, including a new Dockerfile for the Docreader service. - Introduced a Makefile target for building and running the Lite version, along with packaging capabilities. - Created GitHub workflows for building and releasing Lite binaries, including Homebrew formula support. - Implemented a new service file for managing the Lite version as a system service. This update enables a streamlined, single-binary deployment of WeKnora, reducing external dependencies and simplifying setup.	2026-03-02 21:21:49 +08:00
begoniezhao	3e31fdeefd	style: add necessary comments to improve code quality - Add docstrings and inline comments for key functions and complex logic - Unify comment style, eliminate magic numbers and ambiguous variable names - No functional changes, only improve maintainability	2025-12-01 17:43:26 +08:00
begoniezhao	2d66abedf0	feat: 新增文档模型类，调整配置与解析逻辑，优化日志及导入移除日志设置与冗余代码，优化导入、类型提示及OCR后端管理统一调整各文件模块导入路径为绝对导入调整导入路径，移除部分导入，优化日志及注释升级文档解析器为 Docx2Parser，优化超时与图片处理逻辑	2025-11-18 22:37:01 +08:00
begoniezhao	c1f731e026	chore(docreader): 重新组织模块文件	2025-11-05 12:07:39 +08:00

8 Commits