WeKnora

mirror of https://github.com/Tencent/WeKnora.git synced 2026-06-04 13:30:32 +08:00

Author	SHA1	Message	Date
wizardchen	ef1047bf67	feat(parser): add OpenDataLoader, PaddleOCR-VL engines, and parser improvements Introduce opendataloader and PaddleOCR-VL parser engines with tenant-level settings UI, replace liteparse, and harden Excel/PPT/Markdown parsing. Optional odl-hybrid sidecar stays local-build only and is excluded from default dev-start and full profiles.	2026-06-03 12:29:13 +08:00
wizardchen	13301ca026	feat(parser): enhance web parser with improved image extension handling and XPath prioritization - Added a regex pattern for image file extensions to the utils module for better image detection. - Updated the BODY_XPATH in the xpaths module to prioritize matching specific content structures in web pages. - These changes aim to improve the accuracy and efficiency of content extraction from web pages using the StdWebParser class.	2026-05-25 19:15:17 +08:00
wizardchen	8e1cfaccb7	refactor(prompt_templates): improve question generation guidelines and context handling - Updated the question generation template to clarify the role of surrounding context and main content. - Enhanced quality rules for generated questions to better align with user search intent. - Revised output format and added explicit instructions on what not to generate. - Improved logging and output in the web parser for better visibility of parsed content and metadata.	2026-04-01 21:59:22 +08:00
wizardchen	c4f5db7e88	feat(metadata): enhance document processing to include metadata extraction and handling - Updated DocReaderServicer to pass metadata in responses. - Modified PipelineParser to accumulate and merge metadata from all parsers. - Enhanced StdWebParser to extract and log the title from web page content. - Implemented logic in knowledge service to update knowledge title based on extracted metadata.	2026-04-01 15:49:47 +08:00
begoniezhao	88fd42cbc3	refactor: Restructure OCR module and centralize config	2026-01-16 16:05:31 +08:00
begoniezhao	907e9a5522	feat: Add DataSchema tool for retrieving schema information from CSV and Excel files	2025-12-29 20:03:51 +08:00
begoniezhao	3e31fdeefd	style: add necessary comments to improve code quality - Add docstrings and inline comments for key functions and complex logic - Unify comment style, eliminate magic numbers and ambiguous variable names - No functional changes, only improve maintainability	2025-12-01 17:43:26 +08:00
begoniezhao	4fdbec17a7	feat: 新增网页解析类，优化依赖及图片编码支持	2025-11-18 22:37:01 +08:00
begoniezhao	2d66abedf0	feat: 新增文档模型类，调整配置与解析逻辑，优化日志及导入移除日志设置与冗余代码，优化导入、类型提示及OCR后端管理统一调整各文件模块导入路径为绝对导入调整导入路径，移除部分导入，优化日志及注释升级文档解析器为 Docx2Parser，优化超时与图片处理逻辑	2025-11-18 22:37:01 +08:00
begoniezhao	c1f731e026	chore(docreader): 重新组织模块文件	2025-11-05 12:07:39 +08:00

10 Commits