## Summary
- **Remove unused dependencies**: warp (zero imports), paste (zero
invocations), tower-service (zero imports), and heed misplacement in
xet_core_structures
- **Move mockall to dev-dependencies** in xet_client by gating
`#[automock]` with `#[cfg_attr(test, automock)]`
- **Feature-gate simulation module** behind `simulation` cargo feature
in xet_client, making axum, heed, humantime, futures-util,
human-bandwidth, and tower-http optional
- **Replace duration-str with humantime** (~2 deps vs ~78 transitive
deps) across xet_runtime, xet_client simulation, and simulation crate
## Impact
| Metric | Before | After | Change |
|---|---|---|---|
| hf-xet production deps | 371 | 321 | **-50** |
| Workspace total | 575 | 569 | -6 |
## Test plan
- [x] `cargo check --workspace` passes
- [x] `cargo check -p hf-xet` passes (without simulation feature — key
validation)
- [x] `cargo test --workspace` — all tests pass (4 pre-existing auth
test failures in git_xet unrelated to this PR)
- [x] `cargo tree -p hf-xet -e normal --prefix none | sort -u | wc -l`
confirms 321 deps
🤖 Generated with [Claude Code](https://claude.com/claude-code)
<!-- CURSOR_SUMMARY -->
---
> [!NOTE]
> **Medium Risk**
> Medium risk because it changes dependency graph and Cargo feature
gating (notably `xet-client` simulation modules and CI test features),
which can affect build/test behavior across targets despite minimal
runtime logic changes.
>
> **Overview**
> Reduces workspace dependency surface by removing `duration-str`
(replaced with `humantime`) and trimming other transitive-heavy crates;
updates lockfiles accordingly across the workspace, `hf_xet`, and WASM
builds.
>
> Introduces/propagates a `simulation` Cargo feature: `xet-client`’s
simulation server-related deps become optional and are only
compiled/exported when `feature = "simulation"` is enabled; `git_xet`
adds a `simulation` feature that forwards to dependent crates, and CI
now runs tests with `strict simulation git-xet-for-integration-test`.
>
> Minor repo hygiene updates include ignoring `.claude/` in `.gitignore`
and wiring the `simulation` crate to depend on `xet-client` with
`features = ["simulation"]` (plus swapping its duration parsing helper
to `humantime`).
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
6abc194398. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Currently, computing aggregate chunk hashes across independently
processed ranges requires recomputing over the full concatenated chunk
list. This PR introduces ChunkHashRange, a composable representation
that can hash contiguous partial ranges and merge them while preserving
equivalence with the existing xorb_hash / file_hash behavior. This
allows an intermediate representation of the hash ranges that can be
merged in arbitrary order to get the final hash. It also uses O(log(n))
storage and all operations are done in linear time. Serialization and
Deserialization are fully supported.
The main use case for this is in doing partial file edits. Previously,
to edit the middle of a large file, the client would have to know all
the hashes for the full file, even if only a few in the middle were
changed. With a large file, this can still be 100s of MB; the chunk
metadata size is roughly 1/1000 of the data size. With this change, we
can now transmit the unmodified parts of a file in O(log(n)) storage but
still be able to build the entire function hash; now a sequence of 10M
chunks takes the equivalent storage of ~500 chunks or so.
Along the way, we also added in an optimization for the merge step to
avoid an allocation, yielding a 2x speedup.
---------
Co-authored-by: Hoyt Koepke <hoytak@xethub.com>
Previously, upload and download paths each had their own ad-hoc state
tracking, cancellation, and runtime bridging logic. TaskRuntime
consolidates this into a single type that owns a CancellationToken tree,
tracks Running/Finished/Cancelled state with recursive propagation to
children, and provides bridge_async/bridge_sync wrappers that
automatically wire up tokio::select! cancellation. Session →
commit/group → per-file handles form a parent-child token tree, so
aborting a session cancels all descendant work.
The upload path gets new UploadFileHandle and UploadStreamHandle wrapper
types (replacing the old UploadTaskHandle), with inner/wrapper pattern
for cheap cloning. UploadCommit::commit() now returns a CommitReport
containing aggregate dedup metrics, progress, and per-file FileMetadata.
The download path mirrors this structure: FileDownloadGroup uses
TaskRuntime for state gating and owns bespoke DownloadTaskHandle
instances with per-task status and result access.
<!-- CURSOR_SUMMARY -->
---
> [!NOTE]
> **High Risk**
> High risk due to a breaking redesign of the public `xet_session` API
(new handle/report types and renamed methods) plus new
cancellation/state machinery that changes how uploads/downloads are
coordinated and terminated.
>
> **Overview**
> **Redesigns `xet_pkg::xet_session` around a new hierarchical
`TaskRuntime`** (using `tokio-util` cancellation tokens) to unify state,
bridging, and cancellation across session → commit/group → per-file
handles.
>
> **Replaces the old task-handle/result model** (`tasks.rs`,
`UploadResult`/`DownloadResult`, `TaskStatus`, group/session state
enums) with explicit handle/report types: `XetFileUpload`,
`XetStreamUpload`, `XetFileDownload`, `XetCommitReport`, and
`XetDownloadGroupReport`, and standardizes task state via
`XetTaskState`.
>
> **Adjusts APIs and error semantics**: `commit()` now returns an
aggregate report (dedup metrics + progress + per-file metadata) and no
longer consumes `self`; progress methods become infallible
(`progress()`); cancellations/errors are consolidated
(`AlreadyCompleted`, `UserCancelled`, `KeyboardInterrupt`,
`TaskError`/`PreviousTaskError`) with updated Python exception mapping.
`xet_data` now returns per-file `DeduplicationMetrics` from upload tasks
and adds a zero-copy `SingleFileCleaner::add_data_from_bytes`.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
153a3ebbbe. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
Acknowledged that running "cargo bench --no-run" on every test platform
is slow. This PR
- extracts benchmark compilation verification from the Linux and macOS
build_and_test jobs into a dedicated `check-bench-compiles` job so it
runs in parallel with the cargo test jobs;
- also skips compiling "git_xet" in release mode which itself doesn't
contain benchmarks and takes the longest to compile due to optimized
linking;
- also removes unused clippy component installs from Windows and macOS
toolchain setup.
See below that the `check-bench-compiles` job finishes faster than
`build_and_test-linux` and `build_and_test-win`, so it's not introducing
extra wait time.
Currently, session runtime routing is split between XetSession and
XetRuntime. This PR centralizes runtime routing in XetRuntime, moving
all wrapper structs there. Now, bridge_async / bridge_sync work
universally to bring from async and sync runtimes.
This PR also changes the default behavior to having the default new()
method auto-detect whether the process can run inside an existing tokio
runtime with valid features enabled vs. creating a new one. Also, then,
with_tokio_handle() errors out if the provided tokio handle doesn't have
the correct features.
## Context
These changes support the hf-mount project, where FUSE streaming uploads
don't know the file size in advance.
## Summary
- Changes the `size` parameter of `FileUploadSession::start_clean()`
from `u64` to `Option<u64>`. Passing `None` signals that the final file
size is unknown (FUSE streaming uploads), which prevents `debug_assert`
panics when `completed_bytes` exceeds the initially declared
`total_bytes=0`.
- Propagates `Option<u64>` to the public API:
`UploadCommit::upload_file()` and `upload_file_blocking()` now take
`file_size: Option<u64>`.
- All existing callers are updated to wrap the size argument in
`Some(...)`.
<!-- CURSOR_SUMMARY -->
---
> [!NOTE]
> **Medium Risk**
> Public upload APIs now accept an optional file size, which is a
breaking signature change and could affect downstream callers and
progress tracking behavior when size is `None`. Implementation changes
are small but touch core upload session and commit interfaces.
>
> **Overview**
> Enables streaming uploads where the final file size is not known up
front by changing `FileUploadSession::start_clean` to take `Option<u64>`
and treating `None` as an unknown size for progress/completion tracking.
>
> Propagates this optional-size API through `UploadCommit::upload_file`
/ `upload_file_blocking` and updates all internal callers, examples, and
tests to pass `Some(size)` when the size is known, along with doc
updates reflecting the new `None` semantics.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
8b41e11e24. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
This PR adds ordered and unordered download streams on XetSession,
including optional byte-range support and per-stream progress reporting.
Blocking and async variants are supported.
On the reconstruction side, this introduces UnorderedWriter and
UnorderedDownloadStream in xet_data, and extends the FileDownloadSession
stream APIs to take optional source ranges. Ordered and unordered
streams now share the same session-facing access pattern for async and
blocking callers.
This PR also renames DownloadGroup to FileDownloadGroup; the stream data
uses the per-session memory pool but don't count towards the maximum
number of concurrent downloads in progress.
<!-- CURSOR_SUMMARY -->
---
> [!NOTE]
> **Medium Risk**
> Touches core file reconstruction/writer plumbing (including
`DataWriter` ownership and new unordered writer/stream paths) and
changes public session APIs, so regressions could impact download
correctness, cancellation, or progress reporting.
>
> **Overview**
> Adds first-class **ordered and unordered streaming download APIs** to
`xet_pkg::xet_session`, including async and blocking variants, optional
source-relative byte ranges, and per-stream progress via new
`XetDownloadStream` / `XetUnorderedDownloadStream` wrappers.
>
> On the data layer, introduces an **unordered reconstruction path**
(`UnorderedWriter` + `UnorderedDownloadStream`) and refactors streaming
to spawn reconstruction tasks immediately but gate execution behind
`start()`; stream abort callbacks are now registered per-stream and
automatically unregistered on drop to avoid callback accumulation.
>
> Updates the reconstruction writer contract by making
`DataWriter::finish` consume the writer (and shifting `DataWriter` to
`&mut self` usage), adjusts `SequentialWriter` accordingly, and adds
Criterion-based reconstruction benchmarks plus extensive
unordered-stream tests. Also renames session `DownloadGroup` to
`FileDownloadGroup` (and constructors) and updates call sites/examples.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
e02890aa4b. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
Currently, the full test validation is rather heavy, but running local
tests often fails to catch many issues due to the tests that probe the
full stack. This PR adds a smoke-test path that runs a meaningful subset
of the tests across the workspace that covers most errors. This runs in
about 1/8 of the time as cargo test, so it's useful to use in speeding
up AI model iteration.
In addition, a few intermittent failures were also fixed.
There should be no runtime functionality change.
<!-- CURSOR_SUMMARY -->
---
> [!NOTE]
> **Low Risk**
> Low risk since changes are limited to Cargo configuration and test
gating; no production code paths are modified. Main risk is accidentally
skipping too much coverage or misconfiguring feature flags in CI/local
workflows.
>
> **Overview**
> Adds a new `cargo smoke-test` workflow by introducing a `smoke-test`
Cargo profile and a `cargo` alias that runs `test` with per-crate
`smoke-test` features enabled.
>
> Defines `smoke-test` features across multiple crates and uses
`#[cfg_attr(feature = "smoke-test", ignore)]` / `#[cfg(... not(feature =
"smoke-test"))]` to skip long-running, concurrency-heavy, or full-stack
integration tests during smoke runs.
>
> Tightens test robustness by making `SafeFileCreator` permission
assertions umask-tolerant (require owner read/write rather than an exact
`0o644`).
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
5d53009652. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
---------
Co-authored-by: Hoyt Koepke <hoytak@xethub.com>
Currently, GC simulation coverage depends on deletion-control operations
that were only partially wired through the disk-backed HTTP simulation
path, and deletion behavior in LocalClient needed to preserve shard-hash
stability across GC epochs. This PR adds shard dedup-entry cleanup to
the deletion-control surface and updates the file deletion behavior so
shard files are not rewritten.
Note: this introduces a breaking change against LocalClient in that
current LocalClient repositories won't persist across this commit.
<!-- CURSOR_SUMMARY -->
---
> [!NOTE]
> **Medium Risk**
> Changes affect disk-backed simulation CAS behavior and persistence
(new LMDB tables + deletion semantics) and add a new deletion-control
API used by GC; regressions could break existing local test data or GC
integration paths.
>
> **Overview**
> Enables *correct* GC integration testing against the disk-backed
simulation server by switching `LocalClient::delete_file_entry` to a
**soft-delete** backed by a new LMDB `file_status_table`, and updating
listing/reconstruction/direct-file-access paths to hide and reject
deleted files without rewriting shard files.
>
> Extends the deletion-control surface with `remove_shard_dedup_entries`
(plus a new `DELETE /simulation/shards/{hash}/dedup_entries` route and
client support) and fixes `LocalTestServerBuilder` to actually wire a
`deletion_client` for disk-backed servers so `/simulation/*` deletion
routes stop returning `501`.
>
> Reworks `verify_integrity` to validate XORB references *across shards*
(global dedup aware) and skip soft-deleted files, adds targeted
unit/integration tests for the new behaviors, and tightens log cleanup
to avoid protecting stale logs on PID reuse by comparing process start
time to the log’s embedded timestamp.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
fdca297600. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
Currently, download metadata assumes file_size is always known, which
forces callers to provide a size even when only a hash is available.
This PR changes XetFileInfo.file_size to Option<u64> -- with
serialization compatibility -- and propagates that through so hash-only
downloads are a supported path while known-size flows continue to work
as before.
On the download path, this updates the reconstructor setup and range
handling so progress can start without a final total and then finalize
when EOF is discovered. For known-size full-file downloads, it now
validates the reconstructed byte count and returns
DataError::SizeMismatch when expected and actual size differ. In
addition, open ended ranges (e.g. `start..` and `..end`) are now
supported through all APIs.
This also adds coverage for range-based writer/stream downloads and
unknown-size round trips in session-level tests.
<!-- CURSOR_SUMMARY -->
---
> [!NOTE]
> **Medium Risk**
> Medium risk because it changes a widely used API type
(`XetFileInfo.file_size`) and adjusts download/reconstruction behavior,
which can affect progress reporting and error handling across Rust and
Python bindings.
>
> **Overview**
> Enables *hash-only downloads* by changing `XetFileInfo.file_size` from
`u64` to `Option<u64>` (serde backward-compatible) and adding
`XetFileInfo::new_hash_only`, then propagating the optional size through
`xet_pkg` and `hf_xet` (Python `PyXetDownloadInfo.file_size` and
`PyPointerFile.filesize`).
>
> Extends download APIs to accept *open-ended ranges* via
`RangeBounds<u64>` (e.g. `start..`, `..end`, `..`) and updates
reconstructor/progress behavior to handle unknown totals, while adding
`DataError::SizeMismatch` and validating reconstructed byte counts for
full downloads and bounded ranges.
>
> Adds substantial new unit/integration test coverage for range
variants, unknown-size round trips, and size-mismatch errors, plus minor
CLI output adjustments to print unknown sizes.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
4d25896c51. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
This PR performs some housecleaning and removes some technical debt
around using different error types, unifying them with the python
interface.
- Our client code tended to do a lot with anyhow errors as an artifact
of first using them before switching to thiserror. This PR cleans these
up in favor of using ClientError or other named error types directly.
- It also removes all the aliases to the old error type names present in
the packages before the refactoring, now settling into ClientError,
FormatError, DataError, and RuntimeError, with XetError being the error
type exposed publicly.
- Also, currently, xet_session exposes SessionError as an alias of
XetError, which adds an extra public type name without adding behavior.
This PR removes that alias and standardizes the public API/docs onto
XetError directly.
-It also tightens Python-facing error behavior and moves the python
handling to the XetError class directly, hidden behind a python feature
flag. Using these types, hf_xet now registers XetObjectNotFoundError and
XetAuthenticationError exception classes for authentication and the
not-found cases. These inherit from the current exception classes, so
all behavior is preserved.
- In addition, the From for PyErr mapping routes
timeout/network/auth/not-found categories to more appropriate Python
exception types than simply RuntimeError.
This is primarily an API-surface cleanup plus error-classification
alignment.
<!-- CURSOR_SUMMARY -->
---
> [!NOTE]
> **Medium Risk**
> API-breaking error-surface changes (removal of legacy alias modules
and signature changes like `CredentialHelper::fill_credential`) may
require downstream code updates, especially where errors are
matched/converted. Runtime behavior should be mostly unchanged, but
error mapping/propagation paths (including Python exceptions) are widely
touched across crates.
>
> **Overview**
> This PR **unifies error types across the workspace** by removing
legacy re-export/alias modules (e.g. `CasClientError`, `CasTypesError`,
`DataProcessingError`, `SessionError`) and updating call sites to use
canonical errors like `xet_client::ClientError`,
`xet_core_structures::CoreError`, and `xet_data::DataError` directly.
>
> It updates CAS client code to **standardize on
`crate::error::Result`/`ClientError`**, including deleting
`cas_client/error.rs`, adjusting error conversions in retry/http
middleware paths, and updating simulation/local-server code to map
`ClientError` to HTTP responses.
>
> Python bindings (`hf_xet`) now **convert failures via `XetError`**
(with `xet_pkg` built with `python` support), register custom exceptions
on module init, and refine argument-validation errors to `PyValueError`
while routing network/timeout/auth/not-found to more appropriate Python
exception classes.
>
> Misc cleanup: `git_xet` now depends on `xet-data`, simulation binaries
switch to `anyhow::Result`/`bail!`, and lockfiles are updated for
new/updated dependencies (notably `pyo3`/`inventory`).
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
f3d056a909. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
Since a previous PR merges the async and blocking APIs under one struct,
the blocking APIs become accessible from `UploadCommit` /
`DownloadGroup` created by the async APIs. This PR adds the similar
`External` runtime mode checking to these blocking APIs as done for the
`new_upload_commit_blocking` and `new_download_group_blocking`
functions, so that they return an `Err` gracefully if possible instead
of panic.
However, this doesn't guard users from "deliberately" creating an
`UploadCommit` / `DownloadGroup` instance with `Owned` runtime mode and
send it to an async context and call the blocking APIs, in which case it
will still panic.
Added unit tests and updated docs for the above changes.
The session id was replaced from `ulid` to `UniqueID` (a self
incrementing u64 in memory) in a previous PR but it's not correct.
The session id is used on CAS server logs and traces and CDN logs to
identity a related group of activity (for debugging and etc. purposes)
and it needs to be globally unique (thus using `ulid`) instead of
locally unique.
This PR should resolve all Dependabot alerts by upgrading deps and
switching out some deprecated crate for suggested alternatives, e.g.
`tempdir -> tempfile`. Supersede PR #721. Fix issue #722
The last repo restructuring didn't update several bench code that are
not compiled by default as part of "cargo build". This PR fixes those
compilation errors and warning, and adds "cargo bench --no-run" to CI
which checks compilation but doesn't actually run benchmarks.
This PR adds a full integration test suite on top of the xet session
interface that mimics the integration tests in xet_data/tests/. This one
additionally tests alternate asynchronous runtimes to ensure that the
bridge to the internal tokio runtime works correctly as well.
Currently, progress tracking is split between callback-driven and
snapshot-driven paths, making session and task wiring across xet_data,
xet_pkg, hf_xet, and git_xet harder to keep consistent. This PR moves
upload/download progress to a polling snapshot model backed by atomics.
It also switches task identifiers to a UniqueID common with the progress
tracking throughout the session APIs.
This PR also updates the rate estimation to use the lighter weight
exponentially weighted moving averages model, so this can be done at a
low level.
To preserve compatibility for existing callback consumers,
callback-oriented upload/download progress tracking APIs are moved under
xet_pkg::legacy and bridged from polling snapshots via a callback based
updaters. hf_xet and git_xet are updated to use that legacy bridge
layer, so current integrations keep working until everything is fully
switched over to the XetSession method.
## Summary
- Rewrites smoke tests to drive everything through the `hf` CLI rather
than the huggingface_hub Python API, covering the actual user-facing
surface area of hf-xet
- Moves smoke tests and diagnostic scripts into a `scripts/` directory
for cleaner repo layout
- Adds storage bucket test suite exercising the full bucket lifecycle
- Adds 50 MB and 100 MB files to repo upload/download tests
## Test matrix (14 tests, all passing)
**Repository tests** (`hf upload` / `hf download`)
- Upload single file, upload folder
- Download individual files + SHA-256 verify
- Download entire repo + SHA-256 verify
- Overwrite file and verify new content served
- Delete file and confirm absent
**Bucket tests** (`hf buckets`)
- `cp` upload / download + verify
- `sync` upload / download + verify
- Recursive list confirms expected paths
- Overwrite via `cp` + verify
- `sync --delete` removes extraneous remote files
- `rm` + confirm absent from listing
## Test plan
- [x] Run `HF_TOKEN=... ./scripts/smoke_tests/run.sh` and confirm all 14
tests pass
- [x] Run `./scripts/smoke_tests/run.sh --skip-buckets` for repo-only
path
- [x] Run with `--hf-xet-version <version>` to confirm PyPI cache bypass
works
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This PR moves some config values that were part of the data
configuration into XetConfig, specifically the compression_policy,
staging_subdir, session_dir_name, and global_dedup_query_enabled. This
also consolidates the remaining values into a single struct with
endpoint and authentication information.
Currently, the SHA-256 hash of uploaded file content is computed
internally during the upload pipeline but not surfaced to callers.
Downstream consumers — e.g. OpenDAL's Hugging Face backend — need the
SHA-256 to commit files to the Hub API.
This PR adds an optional sha256 field to XetFileInfo, the session-layer
FileMetadata, and the Python-exposed PyXetUploadInfo. The field is
populated from the already-computed hash when Sha256Policy::Compute or
Sha256Policy::Provided is used, and left None for downloads and when
Sha256Policy::Skip is used. Serde attributes (default,
skip_serializing_if) ensure backward-compatible serialisation — existing
serialised data without the field deserialises cleanly.
Needed for the functionality in
https://github.com/huggingface/xet-core/pull/642.
Currently, the Queued → Running status transition in spawned upload
tasks is unconditional — it overwrites whatever the current status is,
including Cancelled set by a concurrent abort() call. This creates a
race window: if abort() sets Cancelled between the semaphore acquisition
and the status write, the task overwrites it with Running, then the
completion guard (if matches!(*s, TaskStatus::Running)) passes and sets
Completed. The result is a task that was aborted but reports Completed.
This PR makes the Queued → Running transition conditional, matching the
already-guarded Running → Completed/Failed transition. If the status is
no longer Queued when the task starts, it bails early with
SessionError::Aborted. This closes the race window — all three status
transitions are now properly guarded against concurrent abort().
This was observed as a flaky test failure on Windows CI
(test_abort_while_state_lock_held_skips_state_update_but_drains_tasks).
This PR introduces V2 multirange URL fetching for xorbs, but optionally
splits the multirange requests into multiple single-range requests that
can be executed in parallel. This allows the reconstruction process to
generate full multirange presigned URLs, but the client effectively
performs the retrieval stage as a sequence of parallel single-range
queries.
The config variable `client.enable_multirange_fetching` controls this
behavior; by default it is set to false due to the current observed
slowness of fetching multiranged URLs.
---------
Co-authored-by: Adrien <adrien@huggingface.co>
Currently, `UploadCommitSync` and `DownloadGroupSync` are thin wrappers
around `UploadCommit` and `DownloadGroup` that delegate every method
through `external_run_async_task`. This means two types, two sets of doc
comments, and two test suites covering the same underlying behavior.
This PR removes the separate sync types and adds `_blocking` suffixed
methods directly on `UploadCommit` and `DownloadGroup`. The session
factory methods `new_upload_commit_blocking()` and
`new_download_group_blocking()` now return the same types as their async
counterparts, and the entire `xet_session::sync` module is deleted (~680
lines removed).
This also fixes a minor bug: `UploadCommitSync::upload_from_path` did
not call `std::path::absolute()` on the file path before dispatching,
unlike the async version. The new `upload_from_path_blocking` includes
the `std::path::absolute()` call, matching the async version's behavior.
This PR adds in a feature flag, "python" to the xet_runtime package such
that when compiled, the XetConfig struct is built to have python getters
and setters. This integrates the handling of the config struct directly
into the XetConfig struct and the macros used to register the config
values, making the handling of values in the python bindings seamless.
Stacking on top of https://github.com/huggingface/xet-core/pull/694,
this updates both the async and sync APIs: update
`XetSession::UploadCommit` and `XetSession::UploadCommitSync` pub APIs
to explicitly take a `sha256: Sha256Policy` to control whether to
compute and embed a sha256 for that file in the upload info.
Resolves XET-898
---------
Co-authored-by: Hoyt Koepke <hoytak@huggingface.co>
It's a bit annoying to try to ensure our CDN routing is correct. Logging
the URL domain for the first fetch term download to the debug logs.
Please don't hesitate to recommend alternative approaches.
`XetSession` always created its own tokio runtime via
`XetRuntime::new_with_config`, and calling `external_run_async_task`
panics when already inside a tokio context. This blocked embedding the
session in async Rust frameworks.
Core strategy:
- `RuntimeMode` enum —
`Owned` (session created its own thread pool via
`XetSessionBuilder::build` or `XetSessionBuilder::build_async` when
outside tokio context. Both `_blocking` and async methods are supported.
Async methods use an internal `bridge_to_owned` bridge that routes
futures onto the owned thread pool, so they work from any executor
(tokio, smol, async-std))
vs
`External` (session wraps a caller-supplied tokio handle via
`XetSessionBuilder::with_tokio_handle` or
`XetSessionBuilder::build_async` when inside qualified tokio context.
Only async methods may be called; `_blocking` methods return
`SessionError::WrongRuntimeMode`. No second thread pool is created).
- `XetRuntime::bridge_to_owned` — a new bridge that routes a future onto
the owned tokio thread pool from any executor (smol, async-std,
futures::executor, non-qualified tokio runtime) by delivering the result
via a `tokio::sync::oneshot` channel that can be polled by any async
executor.
- Async public API — `UploadCommit` and `DownloadGroup` methods
(`upload_from_path`, `upload_bytes`, `upload_file`, `commit`, `finish`)
are now async fn. Factory methods `XetSession::new_upload_commit` and
`new_download_group` are async.
Example:
```
let session = XetSessionBuilder::new().build_async().await?;
// Upload
let commit = session.new_upload_commit().await?;
let handle = commit.upload_from_path("file.bin".into()).await?;
let results = commit.commit().await?;
// Download
let group = session.new_download_group().await?;
let info = XetFileInfo {
hash: ...,
file_size: ...,
};
let dl_handle = group.download_file_to_path(info, "out/file.bin".into())?;
let finish_results = group.finish().await?;
```
- Sync wrappers — New `UploadCommitSync` / `DownloadGroupSync` in
`xet_session/sync/` expose a fully blocking API for sync Rust and Python
(PyO3) callers. Returned by `new_upload_commit_blocking()` and
`new_download_group_blocking()`.
Example:
```
let session = XetSessionBuilder::new().build()?;
// Upload
let commit = session.new_upload_commit_blocking()?;
let handle = commit.upload_from_path("file.bin".into())?;
let results = commit.commit()?;
let m = results.values().next().unwrap().as_ref().as_ref().unwrap();
// Download
let group = session.new_download_group_blocking()?;
let info = XetFileInfo {
hash: ...,
file_size: ...,
};
let dl_handle = group.download_file_to_path(info, "out/file.bin".into())?;
let finish_results = group.finish()?;
```
Additional fixes: `download_file_to_path` and `upload_from_path` now
canonicalize paths with `std::path::absolute` before enqueuing; task
status is only overwritten when still `Running`, preventing a race with
concurrent abort().
Fix XET-891
---------
Co-authored-by: Hoyt Koepke <hoytak@huggingface.co>
Currently, the condition for increasing connection concurrency is gated
on the model predicting that a 64MB transmission will complete within 90
seconds. However, when the transmissions are primarily composed of small
packets, this can drastically overestimate the round trip, artificially
suppressing the connection concurrency.
This PR fixes this issue by also modeling the average predicted packet
size, using the 95% quantile of that (bounded by two config variables)
to predict the round trip time when considering a concurrency increase.
## Summary
- Bump hf_xet version from 1.4.1 to 1.4.2 in Cargo.toml and Cargo.lock
- Follows up on 1.4.1 release where the version bump PR was merged after
the release artifacts were built
This PR updates the interface for retrieving per-task results after
UploadCommit::commit() or DownloadGroup::finish(). The problem with the
previous interface is that commit() and finish() return a vector of
FileMetadata or DownloadResult, making it difficult for users to
associate each result with a specific task.
The new interface uses `task_id` as a strong binding bridge:
## Upload per-task result access patterns
After commit() completes, there are two equivalent ways to retrieve a
per-task FileMetadata result:
1. Lookup in the global result map:
```
let commit = session.new_upload_commit()?;
let handle = commit.upload_from_path(src)?;
let results = commit.commit()?;
let result = results.get(&handle.task_id)
```
2. Direct access from the handle:
```
let commit = session.new_upload_commit()?;
let handle = commit.upload_from_path(src)?;
commit.commit()?;
// handle.result() is populated by commit() via the shared Arc.
let result = handle.result()
```
## Download per-task result access patterns
The pattern is similar to the above.
## Why not put results in a vector in the same order as tasks are
registered to the commit instance?
After a commit instance is created, it can be cloned (since it is itself
an Arc wrapping an internal struct) and sent to different threads. When
multiple threads are registering tasks, there is no static registration
order that a program can observe upfront.
This PR creates a folder, api_changes, in which AI agents can record
updates to the API surface that could affect downstream PRs and
dependencies. This can be scanned by AI agents to reliably perform
merges or to propagate changes. See api_changes/README.md for a
description of how this should work.
This PR is a massive rearrangement of the code base into 5 packages
intended for release on cargo. The directories and corresponding
packages are:
1. xet_runtime/ — compiles into the xet-runtime package. Contains the
runtime, config, and logging management.
2. xet_core_structures/ — compiles into the xet-core-structures package.
Contains core data structures for hashing, shards, and xorbs as well as
internal data structures that depend on these.
3. xet_client/ — compiles into the xet-client package, contains client
code for remotely connecting to the Hugging Face servers.
4. xet_data/ — compiles into the xet-data package, contains the data
processing pipeline: chunking/deduplication, file reconstruction,
clean/smudge operations, and progress tracking.
5. xet_pkg/ — compiles into the hf-xet package, provides the top-level
session-based API for file upload and download with user-facing error
categorization. This is the primary package downstream dependencies
would use. This also contains a single summary error type, XetError,
that translates cleanly into python error types.
In addition, the other tools are:
- git_xet/ — the git_xet CLI binary crate (location preserved).
- hf_xet/ -- the hf_xet python package (location preserved).
- simulation/ — the simulation crate for upload scenario benchmarking.
- wasm/ -- the wasm objects.
The full description — and information for an AI agent to use to update
downstream dependencies — is at
api_changes/update_260309_package_restructure.md.
Summary of moves:
- xet_runtime: became xet_runtime::core inside xet_runtime/.
- utils: became xet_runtime::utils inside xet_runtime/.
- xet_config: became xet_runtime::config inside xet_runtime/.
- xet_logging: became xet_runtime::logging inside xet_runtime/.
- error_printer: became xet_runtime::error_printer inside xet_runtime/.
- file_utils: became xet_runtime::file_utils inside xet_runtime/.
- merklehash: became xet_core_structures::merklehash inside
xet_core_structures/.
- mdb_shard: became xet_core_structures::metadata_shard inside
xet_core_structures/.
- xorb_object: became xet_core_structures::xorb_object inside
xet_core_structures/.
- cas_client: became xet_client::cas_client inside xet_client/.
- hub_client: became xet_client::hub_client inside xet_client/.
- cas_types: became xet_client::cas_types inside xet_client/.
- chunk_cache: became xet_client::chunk_cache inside xet_client/.
- data: became xet_data::processing inside xet_data/.
- deduplication: became xet_data::deduplication inside xet_data/.
- file_reconstruction: became xet_data::file_reconstruction inside
xet_data/.
- progress_tracking: became xet_data::progress_tracking inside
xet_data/.
- xet_session: became xet::xet_session inside xet_pkg/.
- Wasm packages (hf_xet_wasm, hf_xet_thin_wasm): moved from top-level
into wasm/; internal imports updated, public APIs unchanged.
Fixes
[XET-885](https://linear.app/xet/issue/XET-885/investigate-unsloth-upload-failure-shard-upload-timeout-on-cas)
## Summary
Shard uploads to CAS can take a long time due to server-side processing
(DynamoDB writes scale with file entry count). The default
`read_timeout(120s)` on the reqwest client kills these uploads.
**Key insight:** reqwest's per-request `RequestBuilder::timeout()` does
NOT override the client-level `read_timeout()` — they are independent
mechanisms polled as separate futures. So the original approach of using
per-request timeouts was ineffective.
**Fix:** Create a dedicated `shard_upload_http_client` on `RemoteClient`
with **no `read_timeout`**, built once at construction time and reused
for all shard uploads. All other settings (connect timeout, pool config,
auth middleware) are identical to the standard client.
## Changes
### `cas_client/src/http_client.rs`
- Added `reqwest_client_no_read_timeout()` — creates a reqwest client
with no `read_timeout`
- Added `build_auth_http_client_no_read_timeout()` — public API wrapping
it with middleware
- 4 unit tests for the new builder
### `cas_client/src/remote_client.rs`
- Added `shard_upload_http_client` field to `RemoteClient` (cfg'd out on
wasm)
- `upload_shard()` uses the pre-built no-timeout client instead of
building one per request
### `cas_client/tests/test_shard_upload_timeout.rs`
- Updated: slow server test now asserts **success** (shard uploads
should wait as long as needed)
### `xet_config/src/groups/client.rs`
- Removed `shard_read_timeout` config field (no longer needed)
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary
Fixes download stalls/deadlocks on large file reconstruction (reported
on 48.5 GB GGUF files). The root cause is a circular dependency: the
main reconstruction loop holds a buffer semaphore permit while blocking
on CAS connection permit acquisition, and xorb write locks held during
HTTP downloads cause CAS permit starvation.
### Changes
1. **Single-flight xorb downloads via `OnceCell`** (`xorb_block.rs`):
replaces `RwLock<Option<...>>` with `tokio::sync::OnceCell`. Only one
task per xorb block acquires a CAS permit and downloads the data;
concurrent callers wait on the same result without acquiring permits or
duplicating work. This eliminates duplicate downloads, prevents
double-counted transfer progress, and avoids a failing duplicate from
killing the reconstruction.
2. **Decouple CAS permit from buffer permit** (`file_term.rs`): the main
loop no longer blocks on CAS permits while holding a buffer permit. The
spawned download task delegates to `retrieve_data` which handles permit
acquisition internally via the OnceCell single-flight. This breaks the
circular dependency that causes stalls.
3. **Improve error propagation** (`sequential_writer.rs`): when the
background writer channel closes, check `RunState` for the original
error before returning a generic "channel closed" message.
### Root cause
The reconstruction pipeline has three resource pools: buffer permits
(bounded semaphore), CAS download permits (64 concurrent), and per-xorb
write locks.
Before this fix, the main loop would:
1. Acquire a **buffer permit** (blocking if buffer full)
2. Call `get_data_task()` which acquires a **CAS permit** (blocking if
pool exhausted)
3. Inside `retrieve_data()`, hold a **write lock** during the entire
HTTP download
This creates two deadlock vectors:
- **Buffer vs CAS**: buffer fills up with terms waiting for CAS permits,
but CAS permits are held by tasks blocked behind xorb write locks, and
the writer can't drain the buffer because it's waiting for those tasks
- **CAS vs write lock**: multiple tasks sharing the same xorb each hold
a CAS permit while blocked on the write lock, starving other xorbs of
permits
## Reproduction
Reliably reproducible with small buffer:
```
HF_XET_RECONSTRUCTION_DOWNLOAD_BUFFER_SIZE=64mb \
HF_XET_RECONSTRUCTION_DOWNLOAD_BUFFER_LIMIT=64mb \
python3 -c "from huggingface_hub import hf_hub_download; hf_hub_download('unsloth/Qwen3-Coder-Next-GGUF', 'Qwen3-Coder-Next-Q4_K_M.gguf', local_dir='/tmp/test')"
```
- **Before fix**: stalls at ~3.4 GB, no progression (deadlock)
- **After fix**: continuous progression, completes successfully
With default buffer (2 GB), the stall is intermittent depending on
network speed (consistently reproduced on slower connections).
## Summary
- Add `ShaGenerator::Skip` variant that skips SHA-256 computation
entirely
- `ShaGenerator::finalize()` now returns `Option<Sha256>` (None when
skipped)
- `SingleFileCleaner::new()` and `FileUploadSession::start_clean()`
accept a `skip_sha256` boolean
- When skipped, no `FileMetadataExt` is included in the shard
## Context
Bucket uploads don't need SHA-256 in the shard metadata — the
`sha_index` GSI is only used for LFS pointer resolution, which doesn't
apply to buckets. Skipping SHA-256 for bucket uploads removes the main
CPU bottleneck in the upload pipeline on non-SHA-NI instances.
## Alternative: dummy SHA-256
Instead of skipping entirely, the client could send a zeroed/dummy
`FileMetadataExt`. The server would still store it but queries would
never match. This avoids the server-side schema change (xetcas PR) but
pollutes the GSI with dummy entries.
Companion PRs:
- xetcas: huggingface-internal/xetcas#498 (make `FileIdItem.sha256`
optional server-side)
This PR replaces the previous collection of scripts around setting up
docker containers with a much more nimble and lightweight set of rust
scripts and a simple, reusable proxy that can limit bandwidth and
congestion simulations. The previous scripts are rewritten to be more
nimble and use more reusable components.
New tools:
- cas_client/src/simulation/network_simulation: A lightweight,
in-process network congestion simulation proxy that lives between the
LocalServer instance and the RemoteClient instance, allowing simulation
tests to run on a network with realistic congestion conditions and a
gated bandwidth. This can be controlled dynamically through a
LocalTestServer instance.
- simulation/: A new package for collecting simulation scripts and
analyzing the results.
To run the new simulation scripts for the adaptive concurrency on
upload, compile in release mode and run one of the scripts in
`simulation/src/adaptive_concurrency/scripts/`. Docker is no longer
needed to run any of the simulations.
The old `cas_client/tests/adaptive_concurrency/` paths were removed.
This PR adds interface functions to the LocalServer class that will
allow it to become a full simulation environment for testing all the
garbage collection stages.
Currently, the async stream logic silently swallows an UnexpectedEOF,
treating it the same as an EOF. This is a bug; this PR fixes it to
propagate UnexpectedEOF while handling correct EOF as the end of the
stream.
This PR makes the use of the `cas` and `xorb` terms consistent.
Previously, "cas" (for content addressed store) could simultaneously
refer to either the remote server or the data bytes stored as a
collection of chunks. After the renames in this PR, we consistently use
`xorb` to refer to the data object and cas to refer to the remote
server.
This renames quite a few places; to aid in rebasing current work or
updating downstream dependencies, this PR includes a file
`API_UPDATES.md` that can be fed into an AI agent to quickly and
accurately perform the renaming on any downstream dependencies.
This PR introduces a new `xet_session` crate that provides a
session-based hierarchical API: Users create a XetSession to manage
runtime and configuration, then batch uploads into UploadCommit objects
and downloads into DownloadGroup objects — each of which runs transfers
in the background by the inner XetRuntime.
All pub functions are exposed as sync functions - making them easy to
use in other languages, e.g. Python, C, etc.
## Summary
- Add optional `sha256s` keyword parameter to the Python-exposed
`upload_files()` function
- Forward it to `data_client::upload_async()` which already supports it
## Context
### Double computation today
`huggingface_hub` computes SHA-256 on every file during
`CommitOperationAdd.__post_init__()` for LFS batch negotiation, then
`hf_xet` recomputes it internally because `upload_files()` doesn't
accept pre-computed hashes.
### Performance impact
This change eliminates the redundant computation entirely.
### Backward compatibility
- `sha256s` is a keyword-only parameter with default `None` — no change
for existing callers
- `data_client::upload_async()` already accepts `sha256s:
Option<Vec<String>>` since day one
- When provided, `SingleFileCleaner` uses `ShaGenerator::ProvidedValue`
and skips internal recomputation
Companion PR: huggingface/huggingface_hub#3876
## Summary
- Fix command injection vulnerability in `.github/workflows/release.yml`
(HackerOne #3581567, severity High 8.8)
- `${{ github.event.inputs.tag }}` was interpolated directly in `run:`
blocks, allowing arbitrary RCE via crafted tag input (e.g. `v0.1.0; id;
cat /etc/passwd;#`)
- Moved all 6 occurrences to `env:` variables so the value is passed as
a shell environment variable instead of being interpolated into the
script
## Jobs fixed
- `linux` — "Update version in toml" step
- `musllinux` — "Update version in toml" step
- `windows` — "Update version in toml" step
- `macos` — "Update version in toml" step
- `sdist` — "Update version in toml" step
- `github-release` — "Create GitHub Release" step (`gh release create`)
This PR adds an integrated API for streaming downloads, exposing a
DownloadStream object that is integrated with the file reconstructor. It
also uses the same memory management buffer limiting process to work
with the stream object.
It also introduces cancellation support to the FileReconstructor to
ensure that tasks waiting on a long running download or semaphore wait
don't cause things to hang when an error is reported or the user drops
the stream.