42 Commits

Author SHA1 Message Date
wizardchen
bbd3f6324a refactor(parser): reorganize Markdown parser and enhance gRPC document reading
- Moved the _SEPARATOR_CELL regex definition to a more appropriate location in the Markdown parser.
- Implemented a fallback mechanism in the gRPC document reader to handle cases where the ReadStream RPC is unimplemented, ensuring compatibility with older versions.
- Added a readUnary method to maintain backward compatibility with the legacy unary Read RPC.
- Improved cancellation handling in the MinerUCloud and PaddleOCR-VL readers to prevent excessive API calls during context cancellation.
2026-06-03 12:29:13 +08:00
wizardchen
ef1047bf67 feat(parser): add OpenDataLoader, PaddleOCR-VL engines, and parser improvements
Introduce opendataloader and PaddleOCR-VL parser engines with tenant-level
settings UI, replace liteparse, and harden Excel/PPT/Markdown parsing.
Optional odl-hybrid sidecar stays local-build only and is excluded from
default dev-start and full profiles.
2026-06-03 12:29:13 +08:00
wizardchen
7b1bb1054f feat(docreader): speed up scanned-PDF parsing, stream image results, isolate heavy async queues
Large scanned PDFs (hundreds of pages) were slow and fragile end-to-end.
This change addresses the parse, transport, and task-scheduling layers:

docreader (parse + transport):
- Parallelize per-page scanned rendering across processes (forkserver/fork),
  with serial fallback. ~4-7x faster on large scanned PDFs; pdfium is not
  thread-safe so we fan out across processes. Configurable via
  DOCREADER_PDF_RENDER_PARALLELISM.
- Add server-streaming ReadStream RPC: emit one meta frame then one frame per
  image, so documents with many page images are no longer capped by the unary
  gRPC message-size limit (a 874-page PDF produced ~193MiB of images, far over
  the 50MB cap) and memory is bounded on both ends. Unary Read is kept for
  backward compatibility; the Go production reader switches to ReadStream.

VLM:
- Make the VLM HTTP timeout configurable (VLM_HTTP_TIMEOUT_SECONDS) and raise
  the default 90s -> 180s so dense scanned-page OCR does not time out with
  "context deadline exceeded".

Async task queues:
- Isolate high-volume, model-heavy fan-out tasks into dedicated asynq queues so
  a single large document cannot saturate the shared worker pool and block
  user-facing document parsing:
    image:multimodal  -> "multimodal"
    chunk:extract     -> "graph"
    question:generation -> "question"
- Register the new queues in the server weight map and the cancel inspector's
  scanned-queue set (so cancelling a knowledge still purges its pending tasks).
2026-06-03 12:29:13 +08:00
wizardchen
959eba2136 fix(doc_parser): enhance DOC to DOCX conversion reliability
- Implemented a retry mechanism for DOC to DOCX conversion to handle concurrent `soffice` invocations, ensuring each attempt uses a dedicated user profile directory.
- Added logging for each conversion attempt, including success and failure messages, to improve visibility into the conversion process.
- Adjusted the handling of temporary directories for both conversion output and user profiles, enhancing robustness against conversion failures.
2026-06-01 20:50:02 +08:00
wizardchen
13301ca026 feat(parser): enhance web parser with improved image extension handling and XPath prioritization
- Added a regex pattern for image file extensions to the utils module for better image detection.
- Updated the BODY_XPATH in the xpaths module to prioritize matching specific content structures in web pages.
- These changes aim to improve the accuracy and efficiency of content extraction from web pages using the StdWebParser class.
2026-05-25 19:15:17 +08:00
wizardchen
d65e647f95 chore: bump Go to 1.26 and slim docreader dependencies
- Bump base image in docker/Dockerfile.app from golang:1.24 to golang:1.26
  to match `go 1.26` declared in go.mod (fixes CI build failure on
  `go mod download`).
- Drop unused docreader components and their dependencies:
  - Remove `docreader/ocr/` package (paddle/vlm/dummy backends are
    unreferenced by the main flow; OCR/VLM is handled by the Go App).
  - Remove `docreader/parser/storage.py` (dead code; image persistence
    happens in the Go App via inline ImageRef bytes).
  - Remove `docreader/scripts/download_deps.py` (PaddleOCR pre-download).
  - Drop deps: paddleocr, paddlepaddle, openai, ollama, minio,
    cos-python-sdk-v5, oss2, asyncio, pypdf2, markdown, mistletoe,
    goose3, markdownify, pdfplumber, antiword, urllib3.
- Re-lock uv.lock: 145 -> 79 packages.
- Update docreader/README.md to reflect that OCR/VLM/storage are no
  longer configured at the docreader level.
2026-05-09 13:32:40 +08:00
wolfkill
450a5bd2dd fix(docreader): throttle heavy parser concurrency 2026-05-07 17:36:09 +08:00
wizardchen
d5f6c7ba21 fix(docreader): remove default 100-page limit for DOCX parsing
The default DOCREADER_DOCX_MAX_PAGES=100 silently truncates large
documents, causing users to see at most ~1000 chunks regardless of
document length. Change the default to 0 (no limit) so all pages are
processed. Operators who need a cap can still set the env var.

Fixes #719
2026-04-28 21:50:15 +08:00
wizardchen
3e61b91efd fix(storage): remove ListBucket permission from MinIO bucket policy 2026-04-16 18:13:21 +08:00
wizardchen
1938094dcc feat(parser): add PDFScannedParser for handling scanned PDFs
- Introduced PDFScannedParser as a fallback parser for scanned PDFs that converts pages into images for OCR processing.
- Updated PDFParser to include PDFScannedParser in the parsing chain, enhancing the document parsing capabilities for scanned content.
- Improved logging for better error tracking during PDF parsing operations.
2026-04-16 18:13:20 +08:00
bingxiang.cheng
d11e043142 refactor(storage): 优化OSS存储初始化与URL格式,提升安全性和兼容性
- 添加判断oss2是否安装,未安装时打印错误日志
- 修改OSS桶不存在时不自动创建,避免误创建公共读桶
- OSS下载URL改为虚拟主机风格,提升兼容性
- OSS初始化错误、缺失配置时添加详细日志提示
- 其他存储类代码重排统一风格,提升可维护性
2026-04-13 15:34:28 +08:00
bingxiang.cheng
adae28beb2 feat(oss): add OssStorage class to docreader
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-13 15:34:28 +08:00
Windfarer
54da98fc24 feat: add docx max pages env config 2026-04-02 10:31:52 +08:00
wizardchen
8e1cfaccb7 refactor(prompt_templates): improve question generation guidelines and context handling
- Updated the question generation template to clarify the role of surrounding context and main content.
- Enhanced quality rules for generated questions to better align with user search intent.
- Revised output format and added explicit instructions on what not to generate.
- Improved logging and output in the web parser for better visibility of parsed content and metadata.
2026-04-01 21:59:22 +08:00
wizardchen
c4f5db7e88 feat(metadata): enhance document processing to include metadata extraction and handling
- Updated DocReaderServicer to pass metadata in responses.
- Modified PipelineParser to accumulate and merge metadata from all parsers.
- Enhanced StdWebParser to extract and log the title from web page content.
- Implemented logic in knowledge service to update knowledge title based on extracted metadata.
2026-04-01 15:49:47 +08:00
Windfarer
bc0c2d23d5 add comment 2026-04-01 13:37:37 +08:00
Windfarer
6058f1cbe0 add monkey patch for docx parse error 2026-04-01 13:37:37 +08:00
wizardchen
1c9503b063 feat(docparser): enhance image resolution capabilities in markdown and HTML
- Updated regex patterns in MarkdownImageUtil to support alt text containing brackets and handle MIME types with hyphens.
- Implemented new functions in ImageResolver for resolving HTML <img> tags with data URIs and bare base64 content, improving image handling in markdown.
- Added comprehensive tests for various image scenarios, ensuring robust handling of data URIs and base64 images.
2026-03-25 22:08:29 +08:00
wizardchen
397689d2f3 feat: introduce WeKnora Lite edition with lightweight configuration and deployment
- Added a new `.env.lite.example` file for the Lite version, providing a minimal configuration template.
- Updated `.env.example` to remove deprecated variables and include new Docreader settings.
- Enhanced Docker configurations to support the Lite version, including a new Dockerfile for the Docreader service.
- Introduced a Makefile target for building and running the Lite version, along with packaging capabilities.
- Created GitHub workflows for building and releasing Lite binaries, including Homebrew formula support.
- Implemented a new service file for managing the Lite version as a system service.

This update enables a streamlined, single-binary deployment of WeKnora, reducing external dependencies and simplifying setup.
2026-03-02 21:21:49 +08:00
chenrui7
aba7a47b20 fix(parser): resolve chunk index mismatch in logs 2026-01-29 19:32:35 +08:00
ice
6a3c29d455 fix: correctly extract host from completion_url by handling both v1 and non-v1 endpoints 2026-01-28 14:06:43 +08:00
cc
5c7f05189e fix(parser): separate StdMinerUParser and MinerUCloudParser implementation 2026-01-23 10:28:38 +08:00
begoniezhao
88fd42cbc3 refactor: Restructure OCR module and centralize config 2026-01-16 16:05:31 +08:00
cc
74bf76eb3f fix(parser): provide explicit file_extension for markitdown to resolve #544 2026-01-16 11:50:42 +08:00
cc
3f2792dff4 fix(parser): provide explicit file_extension for markitdown to resolve #544 2026-01-16 11:50:42 +08:00
wizardchen
48a0f9b508 Merge remote-tracking branch 'github_public/main' 2026-01-15 14:39:35 +08:00
begoniezhao
1abdaa5d5c feat: Make OCR and task concurrency configurable 2026-01-15 10:56:09 +08:00
begoniezhao
a35999466c refactor: Improve error handling and remove unused code 2026-01-14 17:08:26 +08:00
orbisai0security
14ac60128b fix: resolve critical vulnerability V-001
Automatically generated security fix
2026-01-14 17:08:26 +08:00
begoniezhao
f25ed71054 fix(doc_parser): Add secure command execution with sandbox 2026-01-14 17:08:26 +08:00
begoniezhao
bd0fb766f8 refactor: Improve error handling and remove unused code 2026-01-14 14:21:32 +08:00
orbisai0security
e023c26af8 fix: resolve critical vulnerability V-001
Automatically generated security fix
2026-01-12 16:11:44 +08:00
begoniezhao
91ec651ac0 fix(doc_parser): Add secure command execution with sandbox 2026-01-08 20:24:12 +08:00
begoniezhao
907e9a5522 feat: Add DataSchema tool for retrieving schema information from CSV and Excel files 2025-12-29 20:03:51 +08:00
begoniezhao
9ef974e4cf fix: Update storage configuration handling for improved flexibility 2025-12-05 18:02:21 +08:00
begoniezhao
3e31fdeefd style: add necessary comments to improve code quality
- Add docstrings and inline comments for key functions and complex logic
- Unify comment style, eliminate magic numbers and ambiguous variable names
- No functional changes, only improve maintainability
2025-12-01 17:43:26 +08:00
begoniezhao
154025f723 refactor: 优化解析器日志与API检查逻辑,简化异常处理 2025-11-20 15:05:53 +08:00
begoniezhao
587d1b2bd3 feat: 新增 CSV、XLSX、XLS 文件类型解析支持 2025-11-19 19:23:16 +08:00
begoniezhao
ddbdae686f feat: 新增MarkdownTableUtil,减少md表格空格 2025-11-19 15:14:00 +08:00
begoniezhao
4fdbec17a7 feat: 新增网页解析类,优化依赖及图片编码支持 2025-11-18 22:37:01 +08:00
begoniezhao
2d66abedf0 feat: 新增文档模型类,调整配置与解析逻辑,优化日志及导入
移除日志设置与冗余代码,优化导入、类型提示及OCR后端管理
统一调整各文件模块导入路径为绝对导入
调整导入路径,移除部分导入,优化日志及注释
升级文档解析器为 Docx2Parser,优化超时与图片处理逻辑
2025-11-18 22:37:01 +08:00
begoniezhao
c1f731e026 chore(docreader): 重新组织模块文件 2025-11-05 12:07:39 +08:00