mirror of
https://github.com/Tencent/WeKnora.git
synced 2026-06-04 13:30:32 +08:00
Documents whose only payload is an embedded image (e.g. a docx with a single picture) intermittently produced the refusal line "No textual content was extractable from this document." even though the vision model had successfully extracted a caption. Three coordinated fixes: - Clarify the summary prompt that text inside `<image_caption>` and `<image_ocr>` is first-class extracted content, not an image reference, so the model only triggers the empty-content branch when the body is genuinely textless. - For image-dominated documents (real text < 200 runes after stripping image markup) include OCR alongside captions so screenshots and scanned figures contribute their actual content; text-heavy documents continue to use caption-only enrichment to avoid OCR noise from incidental figures. - Add `EnrichContentCaptionAndOCR` which embeds caption + OCR text inline next to the original Markdown image link, deliberately omitting the `<image url=...>` and `<image_original>` wrapper blocks. Those wrappers carry only opaque export hashes that consume tokens and have been observed to retrigger the LLM's "image reference with no extracted text" heuristic.