WeKnora

mirror of https://github.com/Tencent/WeKnora.git synced 2026-06-04 13:30:32 +08:00

Author	SHA1	Message	Date
wizardchen	bbd3f6324a	refactor(parser): reorganize Markdown parser and enhance gRPC document reading - Moved the _SEPARATOR_CELL regex definition to a more appropriate location in the Markdown parser. - Implemented a fallback mechanism in the gRPC document reader to handle cases where the ReadStream RPC is unimplemented, ensuring compatibility with older versions. - Added a readUnary method to maintain backward compatibility with the legacy unary Read RPC. - Improved cancellation handling in the MinerUCloud and PaddleOCR-VL readers to prevent excessive API calls during context cancellation.	2026-06-03 12:29:13 +08:00
wizardchen	ef1047bf67	feat(parser): add OpenDataLoader, PaddleOCR-VL engines, and parser improvements Introduce opendataloader and PaddleOCR-VL parser engines with tenant-level settings UI, replace liteparse, and harden Excel/PPT/Markdown parsing. Optional odl-hybrid sidecar stays local-build only and is excluded from default dev-start and full profiles.	2026-06-03 12:29:13 +08:00
wizardchen	7b1bb1054f	feat(docreader): speed up scanned-PDF parsing, stream image results, isolate heavy async queues Large scanned PDFs (hundreds of pages) were slow and fragile end-to-end. This change addresses the parse, transport, and task-scheduling layers: docreader (parse + transport): - Parallelize per-page scanned rendering across processes (forkserver/fork), with serial fallback. ~4-7x faster on large scanned PDFs; pdfium is not thread-safe so we fan out across processes. Configurable via DOCREADER_PDF_RENDER_PARALLELISM. - Add server-streaming ReadStream RPC: emit one meta frame then one frame per image, so documents with many page images are no longer capped by the unary gRPC message-size limit (a 874-page PDF produced ~193MiB of images, far over the 50MB cap) and memory is bounded on both ends. Unary Read is kept for backward compatibility; the Go production reader switches to ReadStream. VLM: - Make the VLM HTTP timeout configurable (VLM_HTTP_TIMEOUT_SECONDS) and raise the default 90s -> 180s so dense scanned-page OCR does not time out with "context deadline exceeded". Async task queues: - Isolate high-volume, model-heavy fan-out tasks into dedicated asynq queues so a single large document cannot saturate the shared worker pool and block user-facing document parsing: image:multimodal -> "multimodal" chunk:extract -> "graph" question:generation -> "question" - Register the new queues in the server weight map and the cancel inspector's scanned-queue set (so cancelling a knowledge still purges its pending tasks).	2026-06-03 12:29:13 +08:00
wizardchen	959eba2136	fix(doc_parser): enhance DOC to DOCX conversion reliability - Implemented a retry mechanism for DOC to DOCX conversion to handle concurrent `soffice` invocations, ensuring each attempt uses a dedicated user profile directory. - Added logging for each conversion attempt, including success and failure messages, to improve visibility into the conversion process. - Adjusted the handling of temporary directories for both conversion output and user profiles, enhancing robustness against conversion failures.	2026-06-01 20:50:02 +08:00
wizardchen	13301ca026	feat(parser): enhance web parser with improved image extension handling and XPath prioritization - Added a regex pattern for image file extensions to the utils module for better image detection. - Updated the BODY_XPATH in the xpaths module to prioritize matching specific content structures in web pages. - These changes aim to improve the accuracy and efficiency of content extraction from web pages using the StdWebParser class.	2026-05-25 19:15:17 +08:00
wizardchen	d65e647f95	chore: bump Go to 1.26 and slim docreader dependencies - Bump base image in docker/Dockerfile.app from golang:1.24 to golang:1.26 to match `go 1.26` declared in go.mod (fixes CI build failure on `go mod download`). - Drop unused docreader components and their dependencies: - Remove `docreader/ocr/` package (paddle/vlm/dummy backends are unreferenced by the main flow; OCR/VLM is handled by the Go App). - Remove `docreader/parser/storage.py` (dead code; image persistence happens in the Go App via inline ImageRef bytes). - Remove `docreader/scripts/download_deps.py` (PaddleOCR pre-download). - Drop deps: paddleocr, paddlepaddle, openai, ollama, minio, cos-python-sdk-v5, oss2, asyncio, pypdf2, markdown, mistletoe, goose3, markdownify, pdfplumber, antiword, urllib3. - Re-lock uv.lock: 145 -> 79 packages. - Update docreader/README.md to reflect that OCR/VLM/storage are no longer configured at the docreader level.	2026-05-09 13:32:40 +08:00
wolfkill	450a5bd2dd	fix(docreader): throttle heavy parser concurrency	2026-05-07 17:36:09 +08:00
wizardchen	d5f6c7ba21	fix(docreader): remove default 100-page limit for DOCX parsing The default DOCREADER_DOCX_MAX_PAGES=100 silently truncates large documents, causing users to see at most ~1000 chunks regardless of document length. Change the default to 0 (no limit) so all pages are processed. Operators who need a cap can still set the env var. Fixes #719	2026-04-28 21:50:15 +08:00
wizardchen	3e61b91efd	fix(storage): remove ListBucket permission from MinIO bucket policy	2026-04-16 18:13:21 +08:00
wizardchen	1938094dcc	feat(parser): add PDFScannedParser for handling scanned PDFs - Introduced PDFScannedParser as a fallback parser for scanned PDFs that converts pages into images for OCR processing. - Updated PDFParser to include PDFScannedParser in the parsing chain, enhancing the document parsing capabilities for scanned content. - Improved logging for better error tracking during PDF parsing operations.	2026-04-16 18:13:20 +08:00
bingxiang.cheng	d11e043142	refactor(storage): 优化OSS存储初始化与URL格式，提升安全性和兼容性 - 添加判断oss2是否安装，未安装时打印错误日志 - 修改OSS桶不存在时不自动创建，避免误创建公共读桶 - OSS下载URL改为虚拟主机风格，提升兼容性 - OSS初始化错误、缺失配置时添加详细日志提示 - 其他存储类代码重排统一风格，提升可维护性	2026-04-13 15:34:28 +08:00
bingxiang.cheng	adae28beb2	feat(oss): add OssStorage class to docreader Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-13 15:34:28 +08:00
Windfarer	54da98fc24	feat: add docx max pages env config	2026-04-02 10:31:52 +08:00
wizardchen	8e1cfaccb7	refactor(prompt_templates): improve question generation guidelines and context handling - Updated the question generation template to clarify the role of surrounding context and main content. - Enhanced quality rules for generated questions to better align with user search intent. - Revised output format and added explicit instructions on what not to generate. - Improved logging and output in the web parser for better visibility of parsed content and metadata.	2026-04-01 21:59:22 +08:00
wizardchen	c4f5db7e88	feat(metadata): enhance document processing to include metadata extraction and handling - Updated DocReaderServicer to pass metadata in responses. - Modified PipelineParser to accumulate and merge metadata from all parsers. - Enhanced StdWebParser to extract and log the title from web page content. - Implemented logic in knowledge service to update knowledge title based on extracted metadata.	2026-04-01 15:49:47 +08:00
Windfarer	bc0c2d23d5	add comment	2026-04-01 13:37:37 +08:00
Windfarer	6058f1cbe0	add monkey patch for docx parse error	2026-04-01 13:37:37 +08:00
wizardchen	1c9503b063	feat(docparser): enhance image resolution capabilities in markdown and HTML - Updated regex patterns in MarkdownImageUtil to support alt text containing brackets and handle MIME types with hyphens. - Implemented new functions in ImageResolver for resolving HTML <img> tags with data URIs and bare base64 content, improving image handling in markdown. - Added comprehensive tests for various image scenarios, ensuring robust handling of data URIs and base64 images.	2026-03-25 22:08:29 +08:00
wizardchen	397689d2f3	feat: introduce WeKnora Lite edition with lightweight configuration and deployment - Added a new `.env.lite.example` file for the Lite version, providing a minimal configuration template. - Updated `.env.example` to remove deprecated variables and include new Docreader settings. - Enhanced Docker configurations to support the Lite version, including a new Dockerfile for the Docreader service. - Introduced a Makefile target for building and running the Lite version, along with packaging capabilities. - Created GitHub workflows for building and releasing Lite binaries, including Homebrew formula support. - Implemented a new service file for managing the Lite version as a system service. This update enables a streamlined, single-binary deployment of WeKnora, reducing external dependencies and simplifying setup.	2026-03-02 21:21:49 +08:00
chenrui7	aba7a47b20	fix(parser): resolve chunk index mismatch in logs	2026-01-29 19:32:35 +08:00
ice	6a3c29d455	fix: correctly extract host from completion_url by handling both v1 and non-v1 endpoints	2026-01-28 14:06:43 +08:00
cc	5c7f05189e	fix(parser): separate StdMinerUParser and MinerUCloudParser implementation	2026-01-23 10:28:38 +08:00
begoniezhao	88fd42cbc3	refactor: Restructure OCR module and centralize config	2026-01-16 16:05:31 +08:00
cc	74bf76eb3f	fix(parser): provide explicit file_extension for markitdown to resolve #544	2026-01-16 11:50:42 +08:00
cc	3f2792dff4	fix(parser): provide explicit file_extension for markitdown to resolve #544	2026-01-16 11:50:42 +08:00
wizardchen	48a0f9b508	Merge remote-tracking branch 'github_public/main'	2026-01-15 14:39:35 +08:00
begoniezhao	1abdaa5d5c	feat: Make OCR and task concurrency configurable	2026-01-15 10:56:09 +08:00
begoniezhao	a35999466c	refactor: Improve error handling and remove unused code	2026-01-14 17:08:26 +08:00
orbisai0security	14ac60128b	fix: resolve critical vulnerability V-001 Automatically generated security fix	2026-01-14 17:08:26 +08:00
begoniezhao	f25ed71054	fix(doc_parser): Add secure command execution with sandbox	2026-01-14 17:08:26 +08:00
begoniezhao	bd0fb766f8	refactor: Improve error handling and remove unused code	2026-01-14 14:21:32 +08:00
orbisai0security	e023c26af8	fix: resolve critical vulnerability V-001 Automatically generated security fix	2026-01-12 16:11:44 +08:00
begoniezhao	91ec651ac0	fix(doc_parser): Add secure command execution with sandbox	2026-01-08 20:24:12 +08:00
begoniezhao	907e9a5522	feat: Add DataSchema tool for retrieving schema information from CSV and Excel files	2025-12-29 20:03:51 +08:00
begoniezhao	9ef974e4cf	fix: Update storage configuration handling for improved flexibility	2025-12-05 18:02:21 +08:00
begoniezhao	3e31fdeefd	style: add necessary comments to improve code quality - Add docstrings and inline comments for key functions and complex logic - Unify comment style, eliminate magic numbers and ambiguous variable names - No functional changes, only improve maintainability	2025-12-01 17:43:26 +08:00
begoniezhao	154025f723	refactor: 优化解析器日志与API检查逻辑，简化异常处理	2025-11-20 15:05:53 +08:00
begoniezhao	587d1b2bd3	feat: 新增 CSV、XLSX、XLS 文件类型解析支持	2025-11-19 19:23:16 +08:00
begoniezhao	ddbdae686f	feat: 新增MarkdownTableUtil，减少md表格空格	2025-11-19 15:14:00 +08:00
begoniezhao	4fdbec17a7	feat: 新增网页解析类，优化依赖及图片编码支持	2025-11-18 22:37:01 +08:00
begoniezhao	2d66abedf0	feat: 新增文档模型类，调整配置与解析逻辑，优化日志及导入移除日志设置与冗余代码，优化导入、类型提示及OCR后端管理统一调整各文件模块导入路径为绝对导入调整导入路径，移除部分导入，优化日志及注释升级文档解析器为 Docx2Parser，优化超时与图片处理逻辑	2025-11-18 22:37:01 +08:00
begoniezhao	c1f731e026	chore(docreader): 重新组织模块文件	2025-11-05 12:07:39 +08:00

42 Commits