mirror of
https://github.com/Tencent/WeKnora.git
synced 2026-06-04 13:30:32 +08:00
First half of the gated OpenSearch k-NN driver introduced in PR 1 (#1445) by way of #1440. PR 2a ships a hollow, interface-compliant shell of the `internal/application/repository/retriever/opensearch/` package — every behavioural method (Save / BatchSave / DeleteBy* / Retrieve, plus the previously-stubbed CopyIndices / BatchUpdate* / EstimateStorageSize / swapToVersion) returns `ErrFeatureNotEnabled` or a conservative sentinel value. PR 2b lands the real read/write implementations in dedicated files (`query.go` + `retrieve.go` + `crud.go`) and replaces the stubs accordingly. Strict feature-gate (unchanged from PR 1): no entry is added to validEngineTypes / GetVectorStoreTypes / retrieverEngineMapping / BuildEnvVectorStores / container env path / engine factory switch, so the driver remains unreachable. Attempting to register an `engine_type=opensearch` VectorStore continues to fail with the existing "not a valid engine type" error. What lands in PR 2a ------------------- Driver skeleton (6 production files + 2 test files, ~1170 + ~1115 LoC): - `repository.go` — Repository struct + NewRepository constructor that validates cluster reachability + OS version (2.4+ / 3.x; primary tested 3.3.2) + k-NN plugin presence on every cluster node. sync.Once-guarded ensureReady(ctx, dim) for lazy per- dimension index creation, with transient errors not cached so a momentary cluster blip does not permanently poison a dim. sanitizeIndexName enforces a strict OS-compatible name spec. probeVersion uses robust strings.Split/Atoi parsing for pre-release suffixes and missing-patch versions. EngineType returns the PR 1 constant; Support returns [keywords, vector]. - `transport.go` — newOpenSearchClient ships TLS posture (MinVersion TLS 1.2, opt-in InsecureSkipVerify, forward-secrecy- only cipher list) and transport tuning for the driver. Caller exists only in PR 3 (container.go + engine_factory.go); PR 2a remains gated dead code. - `mapping.go` — buildIndexMapping(cfg, dim) produces the full knn_vector + HNSW + content-analyzer mapping with every *_id field as an explicit keyword and source_type as integer. buildKeywordsMapping ships the dim-less keyword-only index mapping used by the no-embedding save path. createIndexAndAlias creates <alias>_v1 and aliases <alias> to it, with best-effort orphan cleanup and mapping-drift detection. - `config.go` — internalCfg (value type) applying OpenSearch defaults (hnsw_m=16, ef_construction=100, ef_search=100, shards=4, replicas=1, engine=lucene). - `errors.go` — nine sentinels (ErrIndexNotFound, ErrDimensionMismatch, ErrAuth, ErrTransport, ErrVersionUnsupported, ErrConfigInvalid, ErrFeatureNotEnabled, ErrBatchTooLarge, ErrCircuitBreaker). Repository never imports apperrors; PR 3's engine factory wraps these to typed AppError 2200/2201. - `stubs.go` — every behavioural method returns ErrFeatureNotEnabled. EstimateStorageSize returns a conservative HNSW lower-bound estimate (not 0) so the Phase 2 KB-delete guard fails-closed for non-empty KBs. Tests (~1115 LoC, 50 cases): - `repository_test.go` — interface satisfaction, sentinel mapping, sanitizeIndexName positive/negative matrix, semver parsing (pre-release / missing-patch), buildIndexMapping JSON shape pin (Lucene + Faiss + Keywords), probeVersion matrix (OS 1.x / 2.2 / 2.5 / 2.11 / 3.x / 3.0.0-rc1 / ES rejection), probeKNNPlugin multi-node coverage, ensureReady concurrency + per-dim isolation + transient retry, NewRepository storeID validation, all 11 stubs (CopyIndices, BatchUpdate*, EstimateStorageSize, SwapToVersion + Save / BatchSave / Retrieve / DeleteBy*), wrapTransport sentinel mapping + leak guard, isNotFound / isAlreadyExistsError, drainAndClose / limitedDecode helpers. - `transport_test.go` — TLS defaults / opt-in InsecureSkipVerify / TLS 1.2 pinning / cipher list / transport tuning. Single dependency addition: github.com/opensearch-project/ opensearch-go/v4 v4.6.0 in go.mod/go.sum. SDK quirks discovered (opensearch-go v4.6.0) -------------------------------------------- PR 2a includes the workarounds for two of three SDK limitations that landed during full implementation (the third, Refresh:*bool, only affects the delete path that ships in PR 2b): - AliasExists method passes dataPointer=nil to its internal do(), which means non-2xx responses come back as a plain *errors.errorString ("status: 404 Not Found") rather than as *opensearch.StructError. aliasExists therefore inspects resp.StatusCode directly (resp is returned even when err is non-nil) and only falls back to wrapTransport for the "no response at all" case. - sync.OnceReset is not in the standard library; the keyword-only index uses a mutex + ready/err flag pattern so transient failures can be retried by the next caller. The per-dimension path uses the `once map[int]*sync.Once` delete-and-recreate trick. Test fixes folded in -------------------- While doing a full `go test ./...` against PR 1-merged main, two deterministic regressions surfaced that block a clean run-everything signal. Both are unrelated to the driver and are folded into PR 2a so the PR's own CI run is green: (1) Follow-up to #1445 — fanout test missed the new normalizer policy (internal/application/service/knowledgebase_search_fanout_test.go, +46 / -6). #1445 changed EngineAwareNormalizer for ES / ElasticFaiss / OpenSearch / Weaviate / Postgres / SQLite / Qdrant / TencentVectorDB / Doris from (score+1)/2 to clamp01 passthrough (those engines surface non-negative cosine to the normalizer per Lucene script_score non-negative invariant for ES, k-NN plugin SpaceType.COSINESIMIL.scoreTranslation for OpenSearch, engine-internal or IR-normalized conversions for the rest). Milvus is now the only engine that still surfaces raw signed cosine in [-1, 1]. TestRetrieveFromStores_MixedEngine_Normalizes still asserted the old cosine-shift behaviour for ES (raw -0.4 → expected 0.3) which under passthrough now becomes clamp01(-0.4) = 0. The normalizer's own _test.go was updated at #1445 time, but this fan-out integration test was not. Fix: rewrite the godoc to spell out the two engine groups; restate sub-case 2 as ES passthrough on a production-possible mid-range cosine (0.3 → 0.3, PG out-ranks ES); add sub-case 3 pinning the cosine-shift branch via Milvus -0.4 → 0.3. (2) Pre-existing — SSRF whitelist singleton race surfaced by this run (internal/utils/security.go + internal/utils/security_test.go + internal/infrastructure/web_search/searxng_test.go, +33 / -9). loadSSRFWhitelist in internal/utils/security.go is cached via sync.Once on first call. The internal reset helper resetSSRFWhitelistForTest was unexported, so tests in other packages could not reset and saw whatever whitelist was cached by the first sync.Once.Do() in the same test binary. In internal/infrastructure/web_search/, TestValidateProxyURL runs before TestValidateSearxngBaseURL alphabetically and exercises ValidateURLForSSRF with no SSRF_WHITELIST set, caching an empty whitelist; the later setenv in searxng_test then has no effect and 127.0.0.1 is rejected with "hostname 127.0.0.1 is restricted". Pre-existing on main; surfaced now because this PR was the first to do a full `go test ./...` run on top of #1445. Fix: capitalize the helper to ResetSSRFWhitelistForTest (the ForTest suffix is the test-only contract); update in-package callers; in web_search/searxng_test.go import internal/utils and call ResetSSRFWhitelistForTest around the env mutation in both TestValidateSearxngBaseURL and TestSearxngProvider_Search. No production code path changes. Roadmap ------- - PR 2b (next, depends on this PR) — read/write implementations: query.go + retrieve.go + crud.go land their real bodies; stubs for Save / BatchSave / DeleteBy* / Retrieve in stubs.go are removed; corresponding CRUD/retrieve/filter test cases (~430 LoC) join repository_test.go. - PR 3 — activation switch + async paths (CopyIndices, BatchUpdate*, large-batch async deletes) + i18n + docker-compose dev profile. After PR 3 merges, the OpenSearch driver becomes reachable via either `engine_type=opensearch` VectorStore or `RETRIEVE_DRIVER=opensearch` env. Backward compatibility ---------------------- - New package — additive only. No existing file modified except go.mod / go.sum, the two test files in (1)/(2), and the test-only export rename in utils/security.go. - Driver is unreachable: no registry path activates it. - No SQL migration. - The PR 1 normalizer case for OpenSearch remains unreachable here (no driver instance produces a result yet). Test plan --------- - [x] go build ./... clean - [x] go vet ./... clean - [x] go test -race -count=1 ./internal/application/repository/retriever/opensearch/... passes - [x] grep -r "case types.OpenSearchRetrieverEngineType" internal/ shows only PR 1's normalizer case + this driver's EngineType() and tests — no activation path. - [x] grep -r "case \"opensearch\"" internal/ shows no hits.