mirror of
https://github.com/huggingface/xet-core.git
synced 2026-06-04 13:30:29 +08:00
Two connected cleanups from the [2026-04-21 Julien upload-stuck
investigation](https://www.notion.so/huggingface2/Julien-upload-stuck-upload_xorb-120s-timeouts-2026-04-21-3491384ebcac81a19d0af5394745cfff).
Closes #807. Docs PR: huggingface/hub-docs#2419.
## Change 1 — raise `HF_XET_CLIENT_READ_TIMEOUT` default 120s → 300s
**Files:** `xet_runtime/src/config/groups/client.rs`,
`xet_client/src/cas_client/remote_client.rs` (stale comment).
The 120s client read timeout was firing before legitimate `upload_xorb`
requests could complete on high-latency / transatlantic / bursty links.
Fleet-wide this produced a **chronic 30–50% xorb POST failure rate**
(1,092–4,196 `error uploading xorb` events per hour sustained over 24h,
peaking at 49.1% in the investigation window). 267 successful uploads in
the same 24h had latency > 120s (max 37 min), so 120s wasn't protecting
anything legitimate — it was only cutting off slow-but-healthy streams.
300s preserves stall-detection semantics (still an order of magnitude
under the 3600s ALB idle). The env override `HF_XET_CLIENT_READ_TIMEOUT`
is unchanged.
## Change 2 — log `query_dedup` 404 as cache miss, not "Fatal Error"
**Files:** `xet_client/src/cas_client/retry_wrapper.rs`,
`xet_client/src/cas_client/remote_client.rs`.
A 404 from `cas::query_dedup` is an expected cache miss — the caller
converts it to `Ok(None)` and proceeds to upload. Today the retry
wrapper logs it as `Fatal Error: \"cas::query_dedup\" api call failed
... 404 Not Found`, producing **20+ alarming-looking lines per upload
session** with no actual failure behind them (Hoyt flagged this in the
incident Slack thread).
Fix: add `RetryWrapper::with_expected_404()` — mirroring the existing
`with_expected_416()` pattern — and opt `query_dedup` into it. The 404
still short-circuits retries and surfaces as a fatal error to the caller
(preserving the existing `Ok(None)` conversion), but the log line now
reads `Not Found (cache miss): \"cas::query_dedup\" api call failed ...
404 Not Found`.
## Test plan
- [x] `cargo +nightly fmt --all --check` clean
- [x] `cargo test -p xet-client --lib cas_client::retry_wrapper` — 5
passed (incl. new `test_404_expected_is_fatal_and_not_retried`)
- [ ] Manually verify `HF_XET_CLIENT_READ_TIMEOUT=120` still overrides
via env
- [ ] Confirm a session run produces no `Fatal Error:` lines for the
`query_dedup` 404s
- [ ] Watch the xorb POST error-rate panel on the [CAS Grafana
dashboard](https://grafana.huggingface.tech/d/dejp4w2hael1cb/cas) after
release; expect the 120s-clustered p50 to disappear
🤖 Generated with [Claude Code](https://claude.com/claude-code)
<!-- CURSOR_SUMMARY -->
---
> [!NOTE]
> **Medium Risk**
> Adjusts client networking defaults (read timeout) and alters
retry-wrapper handling/logging for HTTP 404s, which can change behavior
and observability for slow uploads and cache-miss paths.
>
> **Overview**
> Raises the default `HF_XET_CLIENT_READ_TIMEOUT` from 120s to 300s to
better tolerate slow-but-progressing transfers.
>
> Adds `RetryWrapper::with_expected_404()` and opts `cas::query_dedup`
into it so 404 responses are still non-retried/fatal to the caller but
are logged as an expected *cache miss* (with a new unit test covering
the no-retry behavior).
>
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
3e88f9cf8f. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
xet-runtime
Async runtime, configuration storage, logging, and utility infrastructure for the Hugging Face Xet storage tools. This is meant to be used through the API in the hf-xet package.
Overview
xet-runtime provides the shared foundation used by all crates in the
xet-core ecosystem:
- Async runtime — Tokio-based runtime with configurable thread pools
- Configuration — Hierarchical configuration for Xet clients
- Structured logging — Tracing-based logging with file and console outputs
- Error handling —
RuntimeErrortype for the runtime layer - Utilities — File operations, sync primitives, and platform abstractions
This crate is part of xet-core.
License
Apache-2.0