WeKnora

pub_soft/WeKnora

Fork 0

mirror of https://github.com/Tencent/WeKnora.git synced 2026-06-04 13:30:32 +08:00

Commit Graph

Author	SHA1	Message	Date
ochan.kwon	e9980c6011	fix: deep-copy stored files and images when cloning a knowledge base Cloning a knowledge base previously copied only the storage path strings (knowledge.FilePath and chunk.ImageInfo.URL), so the source and the clone shared the same physical objects in the storage backend. Once the original file and extracted images are deleted on source removal, the clone is left with dangling references and its document and images become unreadable — data loss that occurs even for same-store clones. Add a CopyFile primitive to the FileService interface and implement it in every backend: server-side CopyObject on the object stores (s3/obs/cos/oss/tos/ks3/minio), io.Copy on local, and a no-op on dummy. Destinations use the knowledge-owned layout and reuse the existing path/object-key guards; a sentinel ErrCrossBackendCopy is returned when the source scheme does not match the backend. Use CopyFile to deep-copy the document file in cloneKnowledge and the extracted images in CloneChunk and cloneFAQKnowledgeBase via a shared cloneChunkImageInfo helper that deduplicates identical image URLs per clone and rewrites them to the new objects. Copied objects are cleaned up best-effort if a clone fails partway through. A clone-time preflight rejects cloning into a target bound to a different storage backend when the tenant pins providers via StorageEngineConfig. Adds unit tests for local CopyFile (independent copy survives source deletion, traversal rejection, cross-backend rejection), cloneChunkImageInfo (empty/multi/dedup/parse-failure/OriginalURL handling), and the storage provider preflight.	2026-06-03 14:45:59 +08:00
wizardchen	8103398fcf	fix(agent): materialize knowledge files to temp path for DuckDB (#1007 ) Dev-mode repro from the issue reporter's logs: [Tool][DataAnalysis] Failed to create table from Excel: IO Error: GDAL Error (4): Failed to open file 'local://10000/.../1777030910246871000.xlsx': No such file or directory The local file service returns a custom 'local://' URL from GetFileURL, which the duckdb spatial/excel extensions can't resolve (they expect plain paths or http(s)/s3:// style URIs). Presigned HTTPS URLs from cloud backends worked by accident; the 'local://' dev path has always been broken for the Data Analysis tool. Instead of per-scheme adapters, stream the file through FileService.GetFile into a temp file and hand DuckDB the resulting filesystem path. This works uniformly across every backend (local, OSS, S3, MinIO, COS) and survives future scheme changes without any changes to the Data Analysis tool. - Preserve the original file extension on the temp file so DuckDB's format auto-detection (csv / xlsx / xls) still kicks in. - Clean up the temp file when LoadFromKnowledge returns, including on every error branch. Cleanup is idempotent (double-invoke safe). - New tests stub FileService to lock in that (a) we never leak a provider:// URL to DuckDB, (b) GetFile errors propagate, (c) the extension survives case normalization. Refs: https://github.com/Tencent/WeKnora/issues/1007	2026-04-24 19:57:56 +08:00

Author

SHA1

Message

Date

ochan.kwon

e9980c6011

fix: deep-copy stored files and images when cloning a knowledge base

Cloning a knowledge base previously copied only the storage path strings
(knowledge.FilePath and chunk.ImageInfo.URL), so the source and the clone
shared the same physical objects in the storage backend. Once the original
file and extracted images are deleted on source removal, the clone is left
with dangling references and its document and images become unreadable —
data loss that occurs even for same-store clones.

Add a CopyFile primitive to the FileService interface and implement it in
every backend: server-side CopyObject on the object stores
(s3/obs/cos/oss/tos/ks3/minio), io.Copy on local, and a no-op on dummy.
Destinations use the knowledge-owned layout and reuse the existing
path/object-key guards; a sentinel ErrCrossBackendCopy is returned when the
source scheme does not match the backend.

Use CopyFile to deep-copy the document file in cloneKnowledge and the
extracted images in CloneChunk and cloneFAQKnowledgeBase via a shared
cloneChunkImageInfo helper that deduplicates identical image URLs per clone
and rewrites them to the new objects. Copied objects are cleaned up
best-effort if a clone fails partway through. A clone-time preflight rejects
cloning into a target bound to a different storage backend when the tenant
pins providers via StorageEngineConfig.

Adds unit tests for local CopyFile (independent copy survives source
deletion, traversal rejection, cross-backend rejection), cloneChunkImageInfo
(empty/multi/dedup/parse-failure/OriginalURL handling), and the storage
provider preflight.

2026-06-03 14:45:59 +08:00

wizardchen

8103398fcf

fix(agent): materialize knowledge files to temp path for DuckDB (#1007 )

Dev-mode repro from the issue reporter's logs:

  [Tool][DataAnalysis] Failed to create table from Excel: IO Error:
  GDAL Error (4): Failed to open file
  'local://10000/.../1777030910246871000.xlsx': No such file or directory

The local file service returns a custom 'local://' URL from GetFileURL,
which the duckdb spatial/excel extensions can't resolve (they expect
plain paths or http(s)/s3:// style URIs). Presigned HTTPS URLs from
cloud backends worked by accident; the 'local://' dev path has always
been broken for the Data Analysis tool.

Instead of per-scheme adapters, stream the file through
FileService.GetFile into a temp file and hand DuckDB the resulting
filesystem path. This works uniformly across every backend (local,
OSS, S3, MinIO, COS) and survives future scheme changes without any
changes to the Data Analysis tool.

- Preserve the original file extension on the temp file so DuckDB's
  format auto-detection (csv / xlsx / xls) still kicks in.
- Clean up the temp file when LoadFromKnowledge returns, including on
  every error branch. Cleanup is idempotent (double-invoke safe).
- New tests stub FileService to lock in that (a) we never leak a
  provider:// URL to DuckDB, (b) GetFile errors propagate, (c) the
  extension survives case normalization.

Refs: https://github.com/Tencent/WeKnora/issues/1007

2026-04-24 19:57:56 +08:00

2 Commits