WeKnora

mirror of https://github.com/Tencent/WeKnora.git synced 2026-06-04 13:30:32 +08:00

Files

young1lin 29820e4cac docs(chat): clarify cached-token semantics for explicit-cache providers

`cached_tokens` is reported by every OpenAI-compatible provider that
supports prompt caching, but how it becomes non-zero differs by mode:

- Implicit caching (OpenAI, Azure OpenAI, DeepSeek, …) populates the
  field automatically whenever a prompt prefix matches a previous
  request within the provider's cache TTL. No client-side opt-in.

- Explicit caching (Qwen on Aliyun, Anthropic Claude, …) only
  populates the field after the caller attaches `cache_control:
  {"type": "ephemeral"}` to the relevant message / content block.
  Until that opt-in is applied upstream of the request, the field
  stays zero even when the prefix is otherwise byte-stable.

Without this distinction documented, the previous commit reads as if
`TokenUsage.CachedTokens` will show non-zero values for Qwen / Claude
once this PR lands — which is not the case. The plumbing here is a
prerequisite (stable prefix via sorted tools) and a meter (visibility
of the field), but the explicit-cache opt-in itself is out of scope
and lives elsewhere.

Document this on `TokenUsage.CachedTokens` and the `cachedTokens`
helper so callers do not mistake observability for activation.

2026-05-25 16:47:14 +08:00

asr

feat(observability): integrate Langfuse for LLM token tracking and tracing

2026-04-24 10:29:19 +08:00

chat

docs(chat): clarify cached-token semantics for explicit-cache providers

2026-05-25 16:47:14 +08:00

embedding

fix(embedding): repair broken comment in Zhipu embedder

2026-05-17 21:31:54 +08:00

provider

fix(moonshot): pin temperature=1 for models that reject other values moonshot-v1-* and kimi-k2.5/k2.6 reject any temperature ≠ 1 with HTTP 400. Detect these models in BuildChatCompletionRequest and force Temperature=1 while leaving kimi-k2/k2-turbo/k2-thinking unaffected.