10 Commits

Author SHA1 Message Date
wizardchen
ef1047bf67 feat(parser): add OpenDataLoader, PaddleOCR-VL engines, and parser improvements
Introduce opendataloader and PaddleOCR-VL parser engines with tenant-level
settings UI, replace liteparse, and harden Excel/PPT/Markdown parsing.
Optional odl-hybrid sidecar stays local-build only and is excluded from
default dev-start and full profiles.
2026-06-03 12:29:13 +08:00
wizardchen
13301ca026 feat(parser): enhance web parser with improved image extension handling and XPath prioritization
- Added a regex pattern for image file extensions to the utils module for better image detection.
- Updated the BODY_XPATH in the xpaths module to prioritize matching specific content structures in web pages.
- These changes aim to improve the accuracy and efficiency of content extraction from web pages using the StdWebParser class.
2026-05-25 19:15:17 +08:00
wizardchen
8e1cfaccb7 refactor(prompt_templates): improve question generation guidelines and context handling
- Updated the question generation template to clarify the role of surrounding context and main content.
- Enhanced quality rules for generated questions to better align with user search intent.
- Revised output format and added explicit instructions on what not to generate.
- Improved logging and output in the web parser for better visibility of parsed content and metadata.
2026-04-01 21:59:22 +08:00
wizardchen
c4f5db7e88 feat(metadata): enhance document processing to include metadata extraction and handling
- Updated DocReaderServicer to pass metadata in responses.
- Modified PipelineParser to accumulate and merge metadata from all parsers.
- Enhanced StdWebParser to extract and log the title from web page content.
- Implemented logic in knowledge service to update knowledge title based on extracted metadata.
2026-04-01 15:49:47 +08:00
begoniezhao
88fd42cbc3 refactor: Restructure OCR module and centralize config 2026-01-16 16:05:31 +08:00
begoniezhao
907e9a5522 feat: Add DataSchema tool for retrieving schema information from CSV and Excel files 2025-12-29 20:03:51 +08:00
begoniezhao
3e31fdeefd style: add necessary comments to improve code quality
- Add docstrings and inline comments for key functions and complex logic
- Unify comment style, eliminate magic numbers and ambiguous variable names
- No functional changes, only improve maintainability
2025-12-01 17:43:26 +08:00
begoniezhao
4fdbec17a7 feat: 新增网页解析类,优化依赖及图片编码支持 2025-11-18 22:37:01 +08:00
begoniezhao
2d66abedf0 feat: 新增文档模型类,调整配置与解析逻辑,优化日志及导入
移除日志设置与冗余代码,优化导入、类型提示及OCR后端管理
统一调整各文件模块导入路径为绝对导入
调整导入路径,移除部分导入,优化日志及注释
升级文档解析器为 Docx2Parser,优化超时与图片处理逻辑
2025-11-18 22:37:01 +08:00
begoniezhao
c1f731e026 chore(docreader): 重新组织模块文件 2025-11-05 12:07:39 +08:00