Files
WeKnora/internal
young1lin 29820e4cac docs(chat): clarify cached-token semantics for explicit-cache providers
`cached_tokens` is reported by every OpenAI-compatible provider that
supports prompt caching, but how it becomes non-zero differs by mode:

- Implicit caching (OpenAI, Azure OpenAI, DeepSeek, …) populates the
  field automatically whenever a prompt prefix matches a previous
  request within the provider's cache TTL. No client-side opt-in.

- Explicit caching (Qwen on Aliyun, Anthropic Claude, …) only
  populates the field after the caller attaches `cache_control:
  {"type": "ephemeral"}` to the relevant message / content block.
  Until that opt-in is applied upstream of the request, the field
  stays zero even when the prefix is otherwise byte-stable.

Without this distinction documented, the previous commit reads as if
`TokenUsage.CachedTokens` will show non-zero values for Qwen / Claude
once this PR lands — which is not the case. The plumbing here is a
prerequisite (stable prefix via sorted tools) and a meter (visibility
of the field), but the explicit-cache opt-in itself is out of scope
and lives elsewhere.

Document this on `TokenUsage.CachedTokens` and the `cachedTokens`
helper so callers do not mistake observability for activation.
2026-05-25 16:47:14 +08:00
..