Introduce opendataloader and PaddleOCR-VL parser engines with tenant-level
settings UI, replace liteparse, and harden Excel/PPT/Markdown parsing.
Optional odl-hybrid sidecar stays local-build only and is excluded from
default dev-start and full profiles.
- Added a regex pattern for image file extensions to the utils module for better image detection.
- Updated the BODY_XPATH in the xpaths module to prioritize matching specific content structures in web pages.
- These changes aim to improve the accuracy and efficiency of content extraction from web pages using the StdWebParser class.
- Updated the question generation template to clarify the role of surrounding context and main content.
- Enhanced quality rules for generated questions to better align with user search intent.
- Revised output format and added explicit instructions on what not to generate.
- Improved logging and output in the web parser for better visibility of parsed content and metadata.
- Updated DocReaderServicer to pass metadata in responses.
- Modified PipelineParser to accumulate and merge metadata from all parsers.
- Enhanced StdWebParser to extract and log the title from web page content.
- Implemented logic in knowledge service to update knowledge title based on extracted metadata.
- Add docstrings and inline comments for key functions and complex logic
- Unify comment style, eliminate magic numbers and ambiguous variable names
- No functional changes, only improve maintainability