Files
xet-core/xet_runtime
Rajat Arya 8b8db52be2 Raise HF_XET_CLIENT_READ_TIMEOUT to 300s + clean up query_dedup 404 log (#808)
Two connected cleanups from the [2026-04-21 Julien upload-stuck
investigation](https://www.notion.so/huggingface2/Julien-upload-stuck-upload_xorb-120s-timeouts-2026-04-21-3491384ebcac81a19d0af5394745cfff).
Closes #807. Docs PR: huggingface/hub-docs#2419.

## Change 1 — raise `HF_XET_CLIENT_READ_TIMEOUT` default 120s → 300s

**Files:** `xet_runtime/src/config/groups/client.rs`,
`xet_client/src/cas_client/remote_client.rs` (stale comment).

The 120s client read timeout was firing before legitimate `upload_xorb`
requests could complete on high-latency / transatlantic / bursty links.
Fleet-wide this produced a **chronic 30–50% xorb POST failure rate**
(1,092–4,196 `error uploading xorb` events per hour sustained over 24h,
peaking at 49.1% in the investigation window). 267 successful uploads in
the same 24h had latency > 120s (max 37 min), so 120s wasn't protecting
anything legitimate — it was only cutting off slow-but-healthy streams.

300s preserves stall-detection semantics (still an order of magnitude
under the 3600s ALB idle). The env override `HF_XET_CLIENT_READ_TIMEOUT`
is unchanged.

## Change 2 — log `query_dedup` 404 as cache miss, not "Fatal Error"

**Files:** `xet_client/src/cas_client/retry_wrapper.rs`,
`xet_client/src/cas_client/remote_client.rs`.

A 404 from `cas::query_dedup` is an expected cache miss — the caller
converts it to `Ok(None)` and proceeds to upload. Today the retry
wrapper logs it as `Fatal Error: \"cas::query_dedup\" api call failed
... 404 Not Found`, producing **20+ alarming-looking lines per upload
session** with no actual failure behind them (Hoyt flagged this in the
incident Slack thread).

Fix: add `RetryWrapper::with_expected_404()` — mirroring the existing
`with_expected_416()` pattern — and opt `query_dedup` into it. The 404
still short-circuits retries and surfaces as a fatal error to the caller
(preserving the existing `Ok(None)` conversion), but the log line now
reads `Not Found (cache miss): \"cas::query_dedup\" api call failed ...
404 Not Found`.

## Test plan

- [x] `cargo +nightly fmt --all --check` clean
- [x] `cargo test -p xet-client --lib cas_client::retry_wrapper` — 5
passed (incl. new `test_404_expected_is_fatal_and_not_retried`)
- [ ] Manually verify `HF_XET_CLIENT_READ_TIMEOUT=120` still overrides
via env
- [ ] Confirm a session run produces no `Fatal Error:` lines for the
`query_dedup` 404s
- [ ] Watch the xorb POST error-rate panel on the [CAS Grafana
dashboard](https://grafana.huggingface.tech/d/dejp4w2hael1cb/cas) after
release; expect the 120s-clustered p50 to disappear

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Adjusts client networking defaults (read timeout) and alters
retry-wrapper handling/logging for HTTP 404s, which can change behavior
and observability for slow uploads and cache-miss paths.
> 
> **Overview**
> Raises the default `HF_XET_CLIENT_READ_TIMEOUT` from 120s to 300s to
better tolerate slow-but-progressing transfers.
> 
> Adds `RetryWrapper::with_expected_404()` and opts `cas::query_dedup`
into it so 404 responses are still non-retried/fatal to the caller but
are logged as an expected *cache miss* (with a new unit test covering
the no-retry behavior).
> 
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
3e88f9cf8f. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 11:49:51 -07:00
..

xet-runtime

crates.io docs.rs License

Async runtime, configuration storage, logging, and utility infrastructure for the Hugging Face Xet storage tools. This is meant to be used through the API in the hf-xet package.

Overview

xet-runtime provides the shared foundation used by all crates in the xet-core ecosystem:

  • Async runtime — Tokio-based runtime with configurable thread pools
  • Configuration — Hierarchical configuration for Xet clients
  • Structured logging — Tracing-based logging with file and console outputs
  • Error handlingRuntimeError type for the runtime layer
  • Utilities — File operations, sync primitives, and platform abstractions

This crate is part of xet-core.

License

Apache-2.0