mirror of
https://github.com/Tencent/WeKnora.git
synced 2026-06-04 13:30:32 +08:00
Phase 3 (#1440) gate flip. PR 1 (#1445) + PR 2a (#1481) + PR 2b (#1482) laid the type prep + driver skeleton + read/write paths as gated dead code; this PR wires every activation surface so opensearch becomes a registerable VectorStore engine. Activation wiring - internal/types: validEngineTypes / GetVectorStoreTypes (with HNSW bounds + knn_engine enum + Immutable hints) / retrieverEngineMapping / buildEnvStoreForDriver — every gated surface now recognises "opensearch". IndexConfig grows four omitempty HNSW fields (HNSWM / HNSWEFConstruction / HNSWEFSearch / KNNEngine), keeping other engines' serialised config byte-identical. - internal/container: createOpenSearchEngine + the switch case in createEngineServiceFromStore; the RETRIEVE_DRIVER=opensearch env path in initRetrieveEngineRegistry; NewEngineFactory now closes over the AuditLogService (the EngineFactory type itself is unchanged). - internal/application/service/vectorstore_healthcheck.go: a testOpenSearchConnection case so CreateStore's connectivity probe accepts opensearch instead of returning 400. - internal/application/repository/retriever/opensearch/transport.go: NewOpenSearchClient is exported so the factory and env path can build the TLS-hardened client; healthcheck.go reuses the unexported probeVersion / probeKNNPlugin for the service-layer probe. Service-layer validation - validateOpenSearchIndexConfig validates the HNSW caps (m 2-100, ef_construction 2-4096, ef_search 1-10000, knn_engine ∈ lucene|faiss). Shards/replicas continue to be enforced by the flat ValidateIndexConfig. Create-only: UpdateStore mutates the name only. - validateConnectionConfig requires addr for opensearch. Sync implementations (stubs.go shrinks) - CopyIndices (copy.go) mirrors the Elasticsearch / Qdrant pattern — search → BatchSave with the source_id remap for generated questions — so dim/keyword routing and the source_id contract come from BatchSave for free. embeddingMap is keyed by the *target* SourceID because OpenSearch's BatchSave looks up embeddings by SourceID (lookupEmbedding), not by chunk_id (the ES driver's convention). Pagination is from/size; copies larger than max_result_window (default 10000) need the scroll-based async path that lands later. - BatchUpdateChunkEnabledStatus / BatchUpdateChunkTagID (bulk_update.go) group the input by target value and issue one _update_by_query per group over the cross-dim <base>_* pattern. Caller values flow through bound script params only — never string-interpolated into the Painless source — closing the script-injection surface. - inspectByQueryResponse (byquery.go) mirrors inspectBulkResponse: the full failure reason goes to the debug log only; the returned error carries the bounded id + type. - UpdateByQueryParams.Refresh is *bool in opensearch-go v4.6.0 (the same shape as DeleteByQuery's quirk), so refresh=wait_for is not expressible; we use refresh=true. Driver-owned audit (DIP) - A new opensearch.AuditSink interface (with nopSink + WithAuditSink functional option) lets the driver emit opensearch.index_created and opensearch.reindex_executed events without importing any service package — the service layer implements the interface. NewRepository takes opts, so existing 4-arg test call sites keep compiling unchanged. - internal/container/audit_sink.go bridges AuditSink to AuditLogService. When the context carries no tenant (the env-path registration ctx during boot, for example) the adapter skips the emit with a warning rather than silently writing tenant_id=0, which would collide with the system-scope sentinel. Frontend + polish - FieldSchema (frontend/src/api/vector-store.ts) gains min/max/enum/ immutable. VectorStoreSettings.vue is now schema-driven: a closed `enum` renders a t-select; number inputs use the schema's `:min`/`:max` and fall back to the legacy replica-vs-shard heuristic only when the schema does not pin them; a danger-coloured warning fires when insecure_skip_verify is toggled on (the switch and warning are wrapped in a vertical stack so the warning sits on its own row below the switch). - i18n: labels for hnsw_m / hnsw_ef_construction / hnsw_ef_search / knn_engine / insecure_skip_verify plus the warning copy in en-US, ko-KR, zh-CN, ru-RU. - docker-compose.dev.yml: an opensearch profile (single-node 3.3.2 with security plugin disabled for dev only). OpenSearch Dashboards lives in a separate, opt-in opensearch-ui profile so the heavy UI container is not forced up alongside the cluster (the driver e2e is fully curl-verifiable against :9200). The new docs/dev/opensearch-integration-test.md covers the end-to-end exercise and the single-node guidance (set replicas=0 to keep the cluster Green). Gating-guard tests flipped - The "OpenSearch is NOT in validEngineTypes / mapping / types list / env builder / stubs" guard tests from PR 1 / PR 2 are replaced by their positive counterparts in this PR. The test suite was the activation checklist; the activation flip is its diff. Backward compatibility - Additive everywhere. IndexConfig's new HNSW fields are omitempty so other engines' serialised config is byte-identical. Existing Elasticsearch / Qdrant / Milvus / Weaviate / Doris / TencentVectorDB stores are untouched. No migrations. Test plan - go build ./... clean - go vet ./... clean - gofmt -l clean on touched files - go test ./... — only TestOssEnsureBucket_CreateFails (Aliyun OSS endpoint), the docreader gRPC tests, and the doris SQL-shape tests fail; all three are pre-existing on upstream/main and untouched by this PR. - New tests across internal/types, opensearch, service and container — including a full end-to-end env-path test that exercises initRetrieveEngineRegistry with RETRIEVE_DRIVER=opensearch against an httptest cluster.
195 lines
6.4 KiB
Go
195 lines
6.4 KiB
Go
package opensearch
|
|
|
|
import (
|
|
"bytes"
|
|
"context"
|
|
"encoding/json"
|
|
"fmt"
|
|
"io"
|
|
"strings"
|
|
|
|
"github.com/google/uuid"
|
|
osapi "github.com/opensearch-project/opensearch-go/v4/opensearchapi"
|
|
|
|
"github.com/Tencent/WeKnora/internal/logger"
|
|
"github.com/Tencent/WeKnora/internal/types"
|
|
)
|
|
|
|
// copyBatchSize is the pagination size for the source scan. Kept under the
|
|
// BatchSave per-call cap so each copied page is a single bulk request.
|
|
const copyBatchSize = 500
|
|
|
|
// copySourceDoc is the full _source read during CopyIndices — it includes the
|
|
// embedding vector and is_recommended, which the retrieve-path hit struct
|
|
// omits because retrieval does not need them.
|
|
type copySourceDoc struct {
|
|
Content string `json:"content"`
|
|
SourceID string `json:"source_id"`
|
|
SourceType int `json:"source_type"`
|
|
ChunkID string `json:"chunk_id"`
|
|
KnowledgeID string `json:"knowledge_id"`
|
|
KnowledgeBaseID string `json:"knowledge_base_id"`
|
|
TagID string `json:"tag_id"`
|
|
IsEnabled bool `json:"is_enabled"`
|
|
IsRecommended bool `json:"is_recommended"`
|
|
Embedding []float32 `json:"embedding"`
|
|
}
|
|
|
|
// transformSourceID mirrors the sibling drivers' source_id remap:
|
|
// - regular chunk (source_id == chunk_id) → target chunk id
|
|
// - generated question (source_id == "<chunk>-<q>") → "<targetChunk>-<q>"
|
|
// - anything else → a fresh uuid
|
|
func transformSourceID(sourceID, chunkID, targetChunkID string) string {
|
|
switch {
|
|
case sourceID == chunkID:
|
|
return targetChunkID
|
|
case strings.HasPrefix(sourceID, chunkID+"-"):
|
|
return targetChunkID + "-" + strings.TrimPrefix(sourceID, chunkID+"-")
|
|
default:
|
|
return uuid.New().String()
|
|
}
|
|
}
|
|
|
|
// CopyIndices copies all docs of one knowledge base into another (within the
|
|
// same store) by scanning the source and re-saving via BatchSave — mirroring
|
|
// the Elasticsearch / Qdrant drivers (search→BatchSave), which yields the
|
|
// source_id transformation and dim/keyword routing for free. Runs
|
|
// synchronously and paginates; the large-batch background-task path is a
|
|
// later change.
|
|
//
|
|
// NOTE: from/size pagination is bounded by the index's max_result_window
|
|
// (default 10000). Copies larger than that require the scroll-based async
|
|
// path (a later change).
|
|
func (r *Repository) CopyIndices(
|
|
ctx context.Context,
|
|
sourceKnowledgeBaseID string,
|
|
sourceToTargetKBIDMap map[string]string, // keyed by source knowledge_id (mirrors sibling drivers)
|
|
sourceToTargetChunkIDMap map[string]string,
|
|
targetKnowledgeBaseID string,
|
|
dimension int,
|
|
knowledgeType string,
|
|
) error {
|
|
log := logger.GetLogger(ctx)
|
|
if len(sourceToTargetChunkIDMap) == 0 {
|
|
log.Warn("[OpenSearch] CopyIndices: empty chunk mapping, skipping")
|
|
return nil
|
|
}
|
|
if dimension <= 0 {
|
|
return fmt.Errorf("opensearch: CopyIndices requires dim > 0, got %d: %w",
|
|
dimension, ErrDimensionMismatch)
|
|
}
|
|
if err := r.ensureReady(ctx, dimension); err != nil {
|
|
return err
|
|
}
|
|
alias := r.indexAlias(dimension)
|
|
|
|
var total int64
|
|
for from := 0; ; from += copyBatchSize {
|
|
docs, err := r.copyScanBatch(ctx, alias, sourceKnowledgeBaseID, from, copyBatchSize)
|
|
if err != nil {
|
|
return err
|
|
}
|
|
if len(docs) == 0 {
|
|
break
|
|
}
|
|
infos := make([]*types.IndexInfo, 0, len(docs))
|
|
embMap := make(map[string][]float32, len(docs))
|
|
enabledMap := make(map[string]bool, len(docs))
|
|
for i := range docs {
|
|
d := &docs[i]
|
|
targetChunkID, ok := sourceToTargetChunkIDMap[d.ChunkID]
|
|
if !ok {
|
|
log.Warnf("[OpenSearch] CopyIndices: source chunk %s not mapped, skipping", d.ChunkID)
|
|
continue
|
|
}
|
|
targetKnowledgeID, ok := sourceToTargetKBIDMap[d.KnowledgeID]
|
|
if !ok {
|
|
log.Warnf("[OpenSearch] CopyIndices: source knowledge %s not mapped, skipping", d.KnowledgeID)
|
|
continue
|
|
}
|
|
targetSourceID := transformSourceID(d.SourceID, d.ChunkID, targetChunkID)
|
|
if len(d.Embedding) > 0 {
|
|
// BatchSave looks up embeddings by SourceID (lookupEmbedding),
|
|
// so key by the target source id — not the chunk id, which is
|
|
// the Elasticsearch driver's convention.
|
|
embMap[targetSourceID] = d.Embedding
|
|
}
|
|
enabledMap[targetChunkID] = d.IsEnabled
|
|
infos = append(infos, &types.IndexInfo{
|
|
Content: d.Content,
|
|
SourceID: targetSourceID,
|
|
SourceType: types.SourceType(d.SourceType),
|
|
ChunkID: targetChunkID,
|
|
KnowledgeID: targetKnowledgeID,
|
|
KnowledgeBaseID: targetKnowledgeBaseID,
|
|
KnowledgeType: knowledgeType,
|
|
TagID: d.TagID,
|
|
IsEnabled: d.IsEnabled,
|
|
IsRecommended: d.IsRecommended,
|
|
})
|
|
}
|
|
if len(infos) > 0 {
|
|
params := map[string]any{
|
|
"embedding": embMap,
|
|
"chunk_enabled": enabledMap,
|
|
}
|
|
if err := r.BatchSave(ctx, infos, params); err != nil {
|
|
return fmt.Errorf("opensearch: CopyIndices batch save: %w", err)
|
|
}
|
|
total += int64(len(infos))
|
|
}
|
|
if len(docs) < copyBatchSize {
|
|
break
|
|
}
|
|
}
|
|
log.Infof("[OpenSearch] CopyIndices: copied %d docs (KB %s → %s, dim=%d)",
|
|
total, sourceKnowledgeBaseID, targetKnowledgeBaseID, dimension)
|
|
r.auditSink().EmitReindexExecuted(ctx, alias, alias, total)
|
|
return nil
|
|
}
|
|
|
|
// copyScanBatch reads one page of docs belonging to sourceKB from the per-dim
|
|
// index, decoding the full _source (including the embedding vector).
|
|
func (r *Repository) copyScanBatch(
|
|
ctx context.Context, index, sourceKB string, from, size int,
|
|
) ([]copySourceDoc, error) {
|
|
body, err := json.Marshal(map[string]any{
|
|
"from": from,
|
|
"size": size,
|
|
"query": map[string]any{
|
|
"bool": map[string]any{
|
|
"filter": []any{
|
|
map[string]any{"term": map[string]any{"knowledge_base_id": sourceKB}},
|
|
},
|
|
},
|
|
},
|
|
})
|
|
if err != nil {
|
|
return nil, fmt.Errorf("opensearch: marshal copy scan body: %w", err)
|
|
}
|
|
req := osapi.SearchReq{Indices: []string{index}, Body: bytes.NewReader(body)}
|
|
resp, err := r.client.Search(ctx, &req)
|
|
if err != nil {
|
|
if isNotFound(err) {
|
|
return nil, fmt.Errorf("opensearch: index %s missing: %w", index, ErrIndexNotFound)
|
|
}
|
|
return nil, wrapTransport(err)
|
|
}
|
|
defer drainAndClose(resp.Inspect().Response.Body)
|
|
var parsed struct {
|
|
Hits struct {
|
|
Hits []struct {
|
|
Source copySourceDoc `json:"_source"`
|
|
} `json:"hits"`
|
|
} `json:"hits"`
|
|
}
|
|
if err := json.NewDecoder(io.LimitReader(resp.Inspect().Response.Body, 64<<20)).Decode(&parsed); err != nil {
|
|
return nil, fmt.Errorf("opensearch: parse copy scan response: %w", ErrTransport)
|
|
}
|
|
out := make([]copySourceDoc, len(parsed.Hits.Hits))
|
|
for i, h := range parsed.Hits.Hits {
|
|
out[i] = h.Source
|
|
}
|
|
return out, nil
|
|
}
|