WeKnora

mirror of https://github.com/Tencent/WeKnora.git synced 2026-06-04 13:30:32 +08:00

Files

wizardchen c0e4a1d2f1 fix(summary): preserve image caption/OCR text in document summaries

Documents whose only payload is an embedded image (e.g. a docx with a
single picture) intermittently produced the refusal line "No textual
content was extractable from this document." even though the vision
model had successfully extracted a caption.

Three coordinated fixes:

- Clarify the summary prompt that text inside `<image_caption>` and
  `<image_ocr>` is first-class extracted content, not an image
  reference, so the model only triggers the empty-content branch when
  the body is genuinely textless.
- For image-dominated documents (real text < 200 runes after stripping
  image markup) include OCR alongside captions so screenshots and
  scanned figures contribute their actual content; text-heavy
  documents continue to use caption-only enrichment to avoid OCR
  noise from incidental figures.
- Add `EnrichContentCaptionAndOCR` which embeds caption + OCR text
  inline next to the original Markdown image link, deliberately
  omitting the `<image url=...>` and `<image_original>` wrapper
  blocks. Those wrappers carry only opaque export hashes that consume
  tokens and have been observed to retrigger the LLM's "image
  reference with no extracted text" heuristic.

2026-05-22 17:25:39 +08:00

conversion.go

feat(web-search): refactor web search provider

2026-03-31 20:22:34 +08:00

imageinfo.go

fix(summary): preserve image caption/OCR text in document summaries

2026-05-22 17:25:39 +08:00

normalize.go

refactor: Introduce searchutil package for web search result conversion and keyword score normalization

2025-12-01 18:27:12 +08:00

textutil.go

refactor(chat_pipeline): enhance merging logic and add partial overlap removal

2026-03-25 22:08:29 +08:00