Files
WeKnora/internal
wizardchen 7aca1017db fix(docparser): address review feedback on PR #1404
Three follow-up fixes on top of the MinerU markdown preservation work:

- Stop applying normalizeMinerUMarkdown inside ResolveAndStore. The
  helper is already called by MinerUReader.Read, and ResolveAndStore is
  shared by every parser (docreader, session attachments, ...). Running
  the heading/image unescape regexes globally would silently rewrite
  content (including inside fenced code blocks) for non-MinerU sources.

- Recognize MinerU image references whose path contains spaces, e.g.
  "images/第 1 页.jpg". The previous regex used in
  extractImageRefsFromContent disallowed whitespace in the URL group,
  so such images were never matched and never persisted. Use a
  whitespace-tolerant pattern aligned with ResolveAndStore's own
  imgPattern.

- Deduplicate uploads when the same MinerU image is referenced under
  multiple path forms (e.g. "images/foo.png" vs "./images/foo.png").
  saveReferencedImage now caches by ref.Filename in addition to the
  raw ref path, so the second variant reuses the previously stored
  ServingURL instead of writing the same bytes to object storage
  again.

Tests added:
- TestProcessImagesMatchesPathsWithSpaces
- TestResolveAndStoreDedupsSameImageRefVariants
2026-05-21 11:42:56 +08:00
..