mirror of
https://github.com/Tencent/WeKnora.git
synced 2026-06-04 13:30:32 +08:00
`cached_tokens` is reported by every OpenAI-compatible provider that
supports prompt caching, but how it becomes non-zero differs by mode:
- Implicit caching (OpenAI, Azure OpenAI, DeepSeek, …) populates the
field automatically whenever a prompt prefix matches a previous
request within the provider's cache TTL. No client-side opt-in.
- Explicit caching (Qwen on Aliyun, Anthropic Claude, …) only
populates the field after the caller attaches `cache_control:
{"type": "ephemeral"}` to the relevant message / content block.
Until that opt-in is applied upstream of the request, the field
stays zero even when the prefix is otherwise byte-stable.
Without this distinction documented, the previous commit reads as if
`TokenUsage.CachedTokens` will show non-zero values for Qwen / Claude
once this PR lands — which is not the case. The plumbing here is a
prerequisite (stable prefix via sorted tools) and a meter (visibility
of the field), but the explicit-cache opt-in itself is out of scope
and lives elsewhere.
Document this on `TokenUsage.CachedTokens` and the `cachedTokens`
helper so callers do not mistake observability for activation.