Files
WeKnora/internal/infrastructure
ochan.kwon 11c3236e52 feat(retriever): add OpenSearch driver skeleton + interface stubs (PR 2a of 3)
First half of the gated OpenSearch k-NN driver introduced in PR 1
(#1445) by way of #1440. PR 2a ships a hollow, interface-compliant
shell of the `internal/application/repository/retriever/opensearch/`
package — every behavioural method (Save / BatchSave / DeleteBy* /
Retrieve, plus the previously-stubbed CopyIndices / BatchUpdate* /
EstimateStorageSize / swapToVersion) returns `ErrFeatureNotEnabled`
or a conservative sentinel value. PR 2b lands the real read/write
implementations in dedicated files (`query.go` + `retrieve.go` +
`crud.go`) and replaces the stubs accordingly.

Strict feature-gate (unchanged from PR 1): no entry is added to
validEngineTypes / GetVectorStoreTypes / retrieverEngineMapping /
BuildEnvVectorStores / container env path / engine factory switch,
so the driver remains unreachable. Attempting to register an
`engine_type=opensearch` VectorStore continues to fail with the
existing "not a valid engine type" error.

What lands in PR 2a
-------------------

Driver skeleton (6 production files + 2 test files, ~1170 + ~1115 LoC):

- `repository.go` — Repository struct + NewRepository constructor
  that validates cluster reachability + OS version (2.4+ / 3.x;
  primary tested 3.3.2) + k-NN plugin presence on every cluster
  node. sync.Once-guarded ensureReady(ctx, dim) for lazy per-
  dimension index creation, with transient errors not cached so a
  momentary cluster blip does not permanently poison a dim.
  sanitizeIndexName enforces a strict OS-compatible name spec.
  probeVersion uses robust strings.Split/Atoi parsing for
  pre-release suffixes and missing-patch versions. EngineType
  returns the PR 1 constant; Support returns [keywords, vector].
- `transport.go` — newOpenSearchClient ships TLS posture
  (MinVersion TLS 1.2, opt-in InsecureSkipVerify, forward-secrecy-
  only cipher list) and transport tuning for the driver. Caller
  exists only in PR 3 (container.go + engine_factory.go); PR 2a
  remains gated dead code.
- `mapping.go` — buildIndexMapping(cfg, dim) produces the full
  knn_vector + HNSW + content-analyzer mapping with every *_id
  field as an explicit keyword and source_type as integer.
  buildKeywordsMapping ships the dim-less keyword-only index
  mapping used by the no-embedding save path. createIndexAndAlias
  creates <alias>_v1 and aliases <alias> to it, with best-effort
  orphan cleanup and mapping-drift detection.
- `config.go` — internalCfg (value type) applying OpenSearch
  defaults (hnsw_m=16, ef_construction=100, ef_search=100,
  shards=4, replicas=1, engine=lucene).
- `errors.go` — nine sentinels (ErrIndexNotFound,
  ErrDimensionMismatch, ErrAuth, ErrTransport,
  ErrVersionUnsupported, ErrConfigInvalid, ErrFeatureNotEnabled,
  ErrBatchTooLarge, ErrCircuitBreaker). Repository never imports
  apperrors; PR 3's engine factory wraps these to typed AppError
  2200/2201.
- `stubs.go` — every behavioural method returns
  ErrFeatureNotEnabled. EstimateStorageSize returns a conservative
  HNSW lower-bound estimate (not 0) so the Phase 2 KB-delete guard
  fails-closed for non-empty KBs.

Tests (~1115 LoC, 50 cases):

- `repository_test.go` — interface satisfaction, sentinel mapping,
  sanitizeIndexName positive/negative matrix, semver parsing
  (pre-release / missing-patch), buildIndexMapping JSON shape pin
  (Lucene + Faiss + Keywords), probeVersion matrix (OS 1.x / 2.2 /
  2.5 / 2.11 / 3.x / 3.0.0-rc1 / ES rejection), probeKNNPlugin
  multi-node coverage, ensureReady concurrency + per-dim isolation
  + transient retry, NewRepository storeID validation, all 11
  stubs (CopyIndices, BatchUpdate*, EstimateStorageSize,
  SwapToVersion + Save / BatchSave / Retrieve / DeleteBy*),
  wrapTransport sentinel mapping + leak guard, isNotFound /
  isAlreadyExistsError, drainAndClose / limitedDecode helpers.
- `transport_test.go` — TLS defaults / opt-in InsecureSkipVerify /
  TLS 1.2 pinning / cipher list / transport tuning.

Single dependency addition: github.com/opensearch-project/
opensearch-go/v4 v4.6.0 in go.mod/go.sum.

SDK quirks discovered (opensearch-go v4.6.0)
--------------------------------------------

PR 2a includes the workarounds for two of three SDK limitations
that landed during full implementation (the third, Refresh:*bool,
only affects the delete path that ships in PR 2b):

- AliasExists method passes dataPointer=nil to its internal do(),
  which means non-2xx responses come back as a plain
  *errors.errorString ("status: 404 Not Found") rather than as
  *opensearch.StructError. aliasExists therefore inspects
  resp.StatusCode directly (resp is returned even when err is
  non-nil) and only falls back to wrapTransport for the "no
  response at all" case.
- sync.OnceReset is not in the standard library; the keyword-only
  index uses a mutex + ready/err flag pattern so transient failures
  can be retried by the next caller. The per-dimension path uses
  the `once map[int]*sync.Once` delete-and-recreate trick.

Test fixes folded in
--------------------

While doing a full `go test ./...` against PR 1-merged main, two
deterministic regressions surfaced that block a clean run-everything
signal. Both are unrelated to the driver and are folded into PR 2a
so the PR's own CI run is green:

(1) Follow-up to #1445 — fanout test missed the new normalizer policy
    (internal/application/service/knowledgebase_search_fanout_test.go,
    +46 / -6). #1445 changed EngineAwareNormalizer for ES /
    ElasticFaiss / OpenSearch / Weaviate / Postgres / SQLite /
    Qdrant / TencentVectorDB / Doris from (score+1)/2 to clamp01
    passthrough (those engines surface non-negative cosine to the
    normalizer per Lucene script_score non-negative invariant for
    ES, k-NN plugin SpaceType.COSINESIMIL.scoreTranslation for
    OpenSearch, engine-internal or IR-normalized conversions for
    the rest). Milvus is now the only engine that still surfaces
    raw signed cosine in [-1, 1].

    TestRetrieveFromStores_MixedEngine_Normalizes still asserted
    the old cosine-shift behaviour for ES (raw -0.4 → expected 0.3)
    which under passthrough now becomes clamp01(-0.4) = 0. The
    normalizer's own _test.go was updated at #1445 time, but this
    fan-out integration test was not.

    Fix: rewrite the godoc to spell out the two engine groups;
    restate sub-case 2 as ES passthrough on a production-possible
    mid-range cosine (0.3 → 0.3, PG out-ranks ES); add sub-case 3
    pinning the cosine-shift branch via Milvus -0.4 → 0.3.

(2) Pre-existing — SSRF whitelist singleton race surfaced by this run
    (internal/utils/security.go + internal/utils/security_test.go +
    internal/infrastructure/web_search/searxng_test.go,
    +33 / -9). loadSSRFWhitelist in internal/utils/security.go is
    cached via sync.Once on first call. The internal reset helper
    resetSSRFWhitelistForTest was unexported, so tests in other
    packages could not reset and saw whatever whitelist was cached
    by the first sync.Once.Do() in the same test binary. In
    internal/infrastructure/web_search/, TestValidateProxyURL runs
    before TestValidateSearxngBaseURL alphabetically and exercises
    ValidateURLForSSRF with no SSRF_WHITELIST set, caching an empty
    whitelist; the later setenv in searxng_test then has no effect
    and 127.0.0.1 is rejected with "hostname 127.0.0.1 is restricted".
    Pre-existing on main; surfaced now because this PR was the
    first to do a full `go test ./...` run on top of #1445.

    Fix: capitalize the helper to ResetSSRFWhitelistForTest (the
    ForTest suffix is the test-only contract); update in-package
    callers; in web_search/searxng_test.go import internal/utils
    and call ResetSSRFWhitelistForTest around the env mutation in
    both TestValidateSearxngBaseURL and TestSearxngProvider_Search.
    No production code path changes.

Roadmap
-------

- PR 2b (next, depends on this PR) — read/write implementations:
  query.go + retrieve.go + crud.go land their real bodies; stubs
  for Save / BatchSave / DeleteBy* / Retrieve in stubs.go are
  removed; corresponding CRUD/retrieve/filter test cases (~430
  LoC) join repository_test.go.
- PR 3 — activation switch + async paths (CopyIndices,
  BatchUpdate*, large-batch async deletes) + i18n + docker-compose
  dev profile. After PR 3 merges, the OpenSearch driver becomes
  reachable via either `engine_type=opensearch` VectorStore or
  `RETRIEVE_DRIVER=opensearch` env.

Backward compatibility
----------------------

- New package — additive only. No existing file modified except
  go.mod / go.sum, the two test files in (1)/(2), and the
  test-only export rename in utils/security.go.
- Driver is unreachable: no registry path activates it.
- No SQL migration.
- The PR 1 normalizer case for OpenSearch remains unreachable
  here (no driver instance produces a result yet).

Test plan
---------

- [x] go build ./... clean
- [x] go vet ./... clean
- [x] go test -race -count=1 ./internal/application/repository/retriever/opensearch/... passes
- [x] grep -r "case types.OpenSearchRetrieverEngineType" internal/
      shows only PR 1's normalizer case + this driver's EngineType()
      and tests — no activation path.
- [x] grep -r "case \"opensearch\"" internal/ shows no hits.
2026-05-26 20:54:58 +08:00
..