39 Commits

Author SHA1 Message Date
Joseph Godlewski
5a75a71727 feat: expose FileUploadSession low-level functions (#855) 2026-06-01 16:43:33 -07:00
Adrien
40f9530753 feat: range aware file write (#717)
## Summary

APIs for range-aware file writes: instead of re-uploading an entire file
when only part of it changed, compose a new CAS file from stable
segments + re-chunked dirty windows. Supports resize edits (insert /
delete / arbitrary replace) in addition to in-place rewrites.

### API: `upload_ranges`

```rust
pub async fn upload_ranges(
    config: Arc<TranslatorConfig>,
    cas_client: Arc<dyn Client>,
    original_hash: MerkleHash,
    original_size: u64,
    dirty_inputs: Vec<DirtyInput>,
) -> Result<XetFileInfo>
```

```rust
/// A single edit applied to the original file: replace `original_range` with
/// `new_length` bytes from `reader`. Edits are expressed in original-file coordinates.
pub struct DirtyInput {
    pub original_range: Range<u64>,
    pub reader: Pin<Box<dyn AsyncRead + Send>>,
    pub new_length: u64,
}
```

The output file size is **derived** from the inputs (no `total_size`
parameter): `original_size - removed + added`.

### Edit shapes (all expressible with the same struct)

| Operation | `original_range` | `new_length` |
|---|---|---|
| In-place edit | `a..b` | `b - a` |
| Resize replace | `a..b` | any |
| Pure insert | `p..p` | `> 0` |
| Pure delete | `a..b` | `0` |
| Append | `original_size..original_size` | `> 0` |
| Truncate to N | `N..original_size` | `0` |
| No-op | empty `dirty_inputs` | — |

Motivating example:

```text
abc + upload_ranges([0..1), "foo", 3) = foobc
abc + upload_ranges([0..0), "foo", 3) = fooabc
abc + upload_ranges([0..1), "",    0) = bc
```

**Per-range `AsyncRead` instead of `ReadSeek` over the staging file.**
The earlier prototype took `dirty_ranges: &[(u64, u64)] + dirty_source:
&mut dyn ReadSeek`. That had a subtle bug: for truncation we silently
extended the dirty set with a boundary chunk and read those bytes from
the staging file, but if the file was never opened for write the staging
file contains zeros at those positions (real bytes are in CAS) → silent
corruption on the truncation boundary chunk. Pairing each edit with its
own reader makes that structurally impossible: any byte not provided by
the caller is fetched from CAS.

<details>
<summary>How it works</summary>

### High level

```
                           upload_ranges
   +----------------------+   |   +----------------------+
   |  original file (CAS) |---+-->|  composed file (CAS) |
   +----------------------+       +----------------------+
   only the dirty windows are re-uploaded; everything else
   is reused as whole CAS segments.
```

### Step 1 — coalesce + snap edits to segment boundaries

Edits are user-coordinates (byte ranges). We snap each edit's
`original_range` to the **enclosing CAS segments** so composition can
swap whole segments instead of truncating one mid-chunk. Adjacent /
overlapping snapped ranges are then coalesced.

Pure inserts (`start == end`) snap to the segment that owns `start`; an
insert at `original_size` snaps to the last segment.

### Step 2 — server returns windows + gap subtrees

Single CAS call: `GET /v2/file-chunk-hashes/{file_id}` with the
segment-aligned ranges in an `X-Range-Dirty: bytes=A-B,C-D` header.
Response shape (xetcas#987):

```rust
struct FileChunkHashesResponse {
    windows:      Vec<ChunkWindow>,         // one per dirty range
    hash_ranges:  Vec<Option<MerkleHashSubtree>>, // N+1 entries: [gap0, gap1, ..., gapN]
}
```

`windows[i].chunks` carries the chunk hashes the server actually owns
for that window (we re-upload these bytes). `hash_ranges[i]` is the
**MerkleHashSubtree** for the i-th unmodified gap, or `None` when there
is no gap there. This is the key to composing the final file hash
without touching unmodified bytes.

### Step 3 — for each window, stream `[CAS prefix | edits | CAS suffix]`
through a fresh cleaner

```
window = [w_start ............................................. w_end]
edits in this window:        [edit_a]    [edit_b]
                                ^           ^
streamed input to the cleaner:
  CAS bytes [w_start, edit_a.start)
  reader bytes for edit_a (new_length bytes)
  CAS bytes [edit_a.end, edit_b.start)
  reader bytes for edit_b
  CAS bytes [edit_b.end, w_end)
```

Pure inserts contribute zero original bytes but still emit `new_length`
reader bytes. Pure deletes contribute zero reader bytes. The cleaner
produces a new `MDBFileInfo` per window and a `ChunkHashList`.

### Step 4 — compose the file hash via `MerkleHashSubtree::merge`

```text
merge_seq = [gap0, w0, gap1, w1, ..., wN, gapN]   // skip None gaps

merged          = MerkleHashSubtree::merge(merge_seq)
aggregated_hash = merged.final_hash()
combined_hash   = aggregated_hash.hmac(zero)      // matches cleaner's file_hash
```

Special-case: if `total_size == 0` (e.g. truncate to empty) the result
is `MerkleHash::default()` *without* HMAC, mirroring `file_hash([])`.

### Step 5 — splice segments + register

Walk the original `MDBFileInfo.segments` and replace any segment that
falls inside a window with that window's freshly-uploaded segments.
Verification entries follow segment-for-segment when present.
`metadata_ext = None` (no SHA-256, see Limitations). Then
`register_composed_file` + `finalize`.

### Multi-window example

Two edits: replace `[50MB, 51MB)` and `[150MB, 151MB)` on a 200MB file:

```
+-----------+-------+------------+-------+-----------+
|  GAP 0    |  W0   |   GAP 1    |  W1   |  GAP 2    |
|  reused   |upload |  reused    |upload |  reused   |
| (subtree) | ~1MB  | (subtree)  | ~1MB  | (subtree) |
+-----------+-------+------------+-------+-----------+

Wire transfer: ~2MB upload + a few hundred KB of CAS reads for window
boundary chunks. Old approach: 200MB download + 200MB upload.
```

### Empty original short-circuit

When `original_size == 0` there is nothing to compose against — every
edit's `original_range` must be `0..0` (validated). We just stream the
new bytes through a fresh cleaner (`upload_fresh_file`).

</details>

### Reviewer note: `chunk_window_builder` is a re-implementation of
xetcas

`xet_client/src/cas_client/chunk_window_builder.rs` is a port of the
same window-building state machine that already lives in xetcas — it's
only used by the local / in-memory simulation clients (`local_client`,
`memory_client`) so the mock CAS server returns the same shape as the
real one in tests. **No need to re-review it as part of this PR**: it
mirrors logic already reviewed and merged in xetcas#987. A follow-up
xetcas PR will deduplicate by removing the server-side copy and pulling
this one in (or vice versa); the duplication is intentional and
temporary.

### Limitations

- **No SHA-256 metadata**: composed files have `metadata_ext = None`
since recomputing SHA-256 would require reading the full file. Only
suitable for contexts that don't require SHA-256 verification (HF
buckets, xet-native repos), not for Git LFS-backed repos.
- **Memory**: for very large files, the per-window in-memory state
(chunk hash list + composed segments) is bounded by the dirty regions,
not the whole file. The chunk-hashes response is paginated by the
server-defined window granularity.

### Tests (27)

Covering all edit shapes + edge cases. Notable:

| Test | Purpose |
|---|---|
| `test_resize_edits_abc` | The 3 motivating FUSE examples |
| `test_resize_large_replace_grows_file` | Replace `[a..b)` with much
more data |
| `test_resize_large_replace_shrinks_file` | Replace `[a..b)` with much
less data |
| `test_resize_mid_file_insert` | Pure insert in the middle |
| `test_resize_mid_file_delete` | Pure delete in the middle |
| `test_resize_multi_edit_mix` | Insert + replace + delete in one call |
| `test_resize_insert_at_segment_boundary` | Snapping correctness for
inserts |
| `test_upload_ranges_mid_file_edit` | In-place edit |
| `test_upload_ranges_truncation` | Pure truncate (sub-segment) |
| `test_upload_ranges_truncation_empty_staging` | Truncate when staging
is all-zero (boundary read from CAS) |
| `test_upload_ranges_truncation_with_overlapping_dirty` | Truncate +
dirty range overlapping the boundary |
| `test_truncate_to_empty_matches_clean_empty` | Truncating to 0 hashes
to `MerkleHash::default()` (matches a fresh empty cleaner) |
| `test_upload_ranges_append` | Pure append |
| `test_append_with_gap_before_dirty_range` | Append where reader covers
a sparse gap too |
| `test_append_sparse_staging_file` | Append on a sparse staging file |
| `test_mid_edit_plus_append` | Mid-file edit *and* append in one call
(P1 codex regression) |
| `test_empty_original_append` | `original_size == 0` + append falls
into the fresh-file path (P2 codex regression) |
| `test_empty_original_validates_ranges` | `original_size == 0` still
runs validation (reviewer regression) |
| `test_upload_ranges_at_file_start` | Edit at offset 0 (no stable
prefix) |
| `test_upload_ranges_multiple_regions` | Two non-adjacent dirty windows
with stable gap |
| `test_single_input_spanning_many_chunks` | One edit covering many CDC
chunks |
| `test_data_integrity_scenarios` | 5 sub-scenarios covering composition
correctness |
| `test_noop_returns_original_hash` | Empty `dirty_inputs` → no CAS
call, original hash returned |
| `test_rejects_dirty_range_past_total_size` | Validation: range past
`original_size` |
| `test_rejects_overlapping_dirty_ranges` | Validation: overlapping
edits |
| `test_rejects_unsorted_dirty_ranges` | Validation: unsorted edits |
| `test_upload_ranges_small_file_mid_edit` | Small files (single
segment) |

### Dependencies

- xetcas: `GET /v2/file-chunk-hashes/{file_id}` with `windows[] +
hash_ranges[]` response shape — huggingface-internal/xetcas#987
(merged).
- Consumer: huggingface-internal/hf-mount#41.


<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **High Risk**
> High risk because it adds a new partial-upload composition path that
splices CAS segments and recomputes file hashes from window subtrees,
touching core data integrity and client/server chunk-boundary logic.
> 
> **Overview**
> Adds range-aware file writes via new `upload_ranges`, letting callers
apply insert/delete/replace edits and upload only re-chunked dirty
windows while reusing stable CAS segments.
> 
> Introduces a new CAS API `get_file_chunk_hashes` (`GET
/v2/file-chunk-hashes/{file_id}` with `X-Range-Dirty`) plus response
types (`FileChunkHashesResponse`, `ChunkWindow`) and simulation support
(`chunk_window_builder`) that extends dirty ranges to *stable* chunk
boundaries and returns gap `MerkleHashSubtree` summaries +
stable-segment verification.
> 
> Refactors dedup/cleaning plumbing to expose per-chunk hash lists
(`ChunkHashList`), adds detached cleaner/session completion and
`register_composed_file` to avoid orphan shard entries, and
moves/re-exports `next_stable_chunk_boundary` into `xet_core_structures`
for shared stable-window computations.
> 
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
2f4cee46df. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Arpit Jain <arpitjain099@gmail.com>
Co-authored-by: Hoyt Koepke <hoytak@huggingface.co>
Co-authored-by: tison <wander4096@gmail.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: Di Xiao <seanses@users.noreply.github.com>
Co-authored-by: Arpit Jain <3242828+arpitjain099@users.noreply.github.com>
Co-authored-by: Assaf Vayner <assaf@huggingface.co>
Co-authored-by: Rajat Arya <rajatarya@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 21:27:59 +02:00
Hoyt Koepke
6f0cf38065 Stable chunk boundary detection (#815)
This PR adds a function, next_stable_chunk_boundary, that takes a list
of chunk boundary positions and a starting cut point and returns the
next chunk boundary after the cut point such that, for all possible
alterations of the data up to the cut point, the chunk boundaries when
chunking the entire file will always be the same starting at the stable
chunk boundary.

The implication of this is that to alter a specific range of a file `[a,
b)`, we would do the following:

1. Locate the previous chunk boundary before a; call this `c_start`. 
2. Take the full set of chunk boundary locations, call
next_stable_chunk_boundary with b as the cut point. this will return the
next stable chunk boundary. Call this `c_end`.
3. Make the replacement to `[a, b)`; prepend the original `data[c_start,
a)` and append `data[b, c_end)`; chunk this segment.
4. Use the merkle hash subtrees for `[0, c_start)`, the new [c_start,
c_end), and the original `[c_end, end)` to calculate the new file hash.
This will be the same as chunking the entire new file.

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Adds new public chunk-boundary selection logic used to make
resumed/partial workflows deterministic; mistakes could cause
misalignment or incorrect resume behavior in deduplication/chunking
paths. Large new randomized/stress tests reduce risk but the algorithm’s
correctness assumptions are subtle.
> 
> **Overview**
> Introduces a new public helper, `next_stable_chunk_boundary`, that
computes a restart-safe/stable resume boundary *from existing
chunk-boundary metadata* (no byte access) by scanning for two
consecutive chunks that fall within a conservative size window derived
from chunking constants.
> 
> Updates `find_partitions` documentation to reflect the hash
warmup/hidden-trigger verification approach and to reference the new
helper, re-exports the function from `xet_data::deduplication`, and adds
extensive edge-case and randomized mutation/stress tests to validate
boundary stability under arbitrary prefix changes.
> 
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
98411603e3. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
2026-05-18 12:16:49 -07:00
Di Xiao
9e804c2dea Remove unnecessary UniqueId -> UniqueID type alias (#824)
In response to
https://github.com/huggingface/xet-core/pull/792#discussion_r3119356452,
this removes the \`UniqueID\` type alias that re-exported
\`xet_runtime::utils::UniqueId\` under a screaming-snake-case name from
\`xet_data::progress_tracking\`. This type alias is unnecessary and
caused confusion for reviewers (both human beings and agents).
2026-05-01 04:10:14 -07:00
Di Xiao
23ec2940bb Expose XetSession APIs to Python (#792)
Replaces the old `upload_files` / `download_files` / `hash_files` Python
functions with a new object-oriented API that exposes `XetSession` and
its child objects directly as PyO3 classes. This gives Python callers
full control over session lifecycle, connection pooling, and progress
reporting.

The previous module-level functions are kept under `hf_xet/src/legacy/`
and remain importable as `from hf_xet import upload_files` etc., but now
emit `DeprecationWarning`.
2026-05-01 03:05:51 -07:00
Assaf Vayner
d40f96bbea Fix spelling typos in comments and docs (#826)
## Summary
- Run codespell across tracked files in the repo and fix unambiguous
spelling typos
- All edits are in comments, doc strings, an issue template, and one log
message — no logic changes
- 22 typos fixed across 19 files (e.g. retreived→retrieved,
elegible→eligible, occurances→occurrences, gauranteed→guaranteed,
endianess→endianness, archetectures→architectures, etc.)

## Cases left for follow-up (not in this PR)
A few hits were ambiguous and need human judgment:
- \`xet_core_structures/src/metadata_shard/shard_file_manager.rs:1400\`
— comment "but delet" appears truncated
- \`xet_core_structures/src/metadata_shard/shard_format.rs:1577\` —
"invalid somes" likely meant "invalid ones"
- \`xet_data/src/deduplication/chunking.rs:564\` — comment trails off
("on other po")

False positives left untouched: \`serde::ser::*\` module paths,
"process-global statics" (Rust \`static\` items), "implementor(s)"
(valid alternate of "implementer"), "re-used", "unparseable".

## Test plan
- [x] \`cargo check --workspace --lib --all-features\` passes
- [ ] CI green on the draft PR

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Low Risk**
> Low risk: changes are limited to spelling fixes in comments/docs, an
issue template string, and a single log message, with no functional code
modifications.
> 
> **Overview**
> Fixes a set of unambiguous spelling typos across the repo (primarily
Rust comments/docstrings plus `.github/ISSUE_TEMPLATE/bug-report.yml`
and `api_changes/README.md`).
> 
> Also corrects one user-facing log line in `hf_xet` ("cofigured" ->
"configured"); otherwise behavior is unchanged.
> 
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
e615df87a8. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
2026-04-30 13:15:18 -07:00
Adrien
145b819fc1 feat: expose CAS client factory and chunk cache re-exports (#730)
## Context

These changes support the hf-mount project, which needs direct access to
CAS client types.

## Summary

- Changed `create_remote_client` visibility from `pub(crate)` to `pub`
- Added re-exports for `CasClient`, `ChunkCache`, `CacheConfig`, and
`get_cache` in `xet_data::processing`
2026-04-26 17:30:08 +02:00
Hoyt Koepke
b43c0aec0e Move XetRuntime model away from thread-local statics (#801)
This PR moves the XetRuntime model away from using thread-local statics
and decouples the XetConfig and XetCommon structs from a single runtime.
It introduces a struct XetContext that gives the runtime context for
operations:

```
struct XetContext { 
    pub runtime : Arc<XetRuntime>,  // The current tokio runtime wrapper, minus the config and common objects..
    pub common : Arc<XetCommon>, // The common cache objects, semaphores, rate trackers, etc.
    pub config : Arc<XetConfig> // The config 
 }
 ```
 
Now, instead of using functions like `xet_runtime()` and `xet_config()` that examine the thread-local storage, we now explicitly passing through a XetContext instance from the session creation that gets stored in each major processing struct.  

This allows decoupling between the runtime, config, and common caches, especially: 
- Running multiple config settings and/or endpoints within the same pre-existing tokio runtime.
- Running multiple runtimes that share the same XetCommon object.
2026-04-21 09:17:19 -07:00
Assaf Vayner
b5f7280a3b set version 1.5.2 (#805)
<!-- CURSOR_SUMMARY -->
> [!NOTE]
> **Low Risk**
> Low risk: this is a coordinated version bump across workspace
manifests and lockfiles with no functional code changes.
> 
> **Overview**
> Bumps the workspace/package version to `1.5.2` and updates internal
crate dependency pins (`xet-runtime`, `xet-core-structures`,
`xet-client`, `xet-data`, `hf-xet`) from `1.5.1` to `1.5.2`.
> 
> Regenerates lockfiles (`Cargo.lock` plus lockfiles under `hf_xet/` and
`wasm/`) to reflect the new `1.5.2` crate versions.
> 
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
b4ec15471d. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
2026-04-20 15:06:14 -07:00
Assaf Vayner
5868f64ab9 fixing some issues identified in cargo audit (#802)
CI for hf-hub is running cargo audit and found many issues through
hf-xet transitive deps. this PR attempts to solve some of them (not
necessarily all of them).

Main changes:
- dropped derivative and reqwest-retry
- replaced bincode with postcard, only used in testing
- upgrade xet-core rand usage
- added audit CI step and ignoring some issues that we can't easily fix.





<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Medium risk because it removes `reqwest-retry`/`derivative` and
replaces part of the retry classification logic with an in-house
equivalent, which could subtly change HTTP retry behavior; the remaining
changes are dependency/version bumps and test-only serialization swaps.
> 
> **Overview**
> Adds a new CI `cargo audit` job and introduces `.cargo/audit.toml` to
ignore a small set of **dev-only** RustSec advisories with documented
rationale.
> 
> Reduces audit surface by dropping `derivative` (manual `Debug` impl
for `AuthConfig`) and removing `reqwest-retry`, replacing its
status-code classification with a local `Retryable` enum +
`default_on_request_success` helper in `RetryWrapper`.
> 
> Updates workspace deps (notably `rand` to `0.10` and `rand_distr` to
`0.6`) and adjusts call sites to the newer `rand` APIs (`RngExt`
imports, minor test/bench tweaks). Test-only binary serialization
switches from `bincode` to `postcard` (and updates affected tests), with
corresponding lockfile updates across crates.
> 
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
26377f4a1c. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
2026-04-20 14:49:48 -07:00
Assaf Vayner
08377eab3c Upgrade crates version to 1.5.1 (#782)
## Summary
- Bump workspace version from 1.5.0 to 1.5.1
- Update all internal dependency version references to match

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Low Risk**
> Low risk version-only bump across workspace manifests and lockfiles
with no code/behavior changes in the diff.
> 
> **Overview**
> Bumps the workspace package version from `1.5.0` to `1.5.1` and aligns
internal crate dependency version pins (`xet-runtime`,
`xet-core-structures`, `xet-client`, `xet-data`, `hf-xet`) to match.
> 
> Updates lockfiles (`Cargo.lock` plus `hf_xet` and wasm lockfiles) so
published/embedded artifacts resolve to the `1.5.1` crate set (including
bringing wasm lockfiles up to `1.5.1`).
> 
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
e8563700a0. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
2026-04-06 14:03:02 -07:00
Di Xiao
950807ba43 Upgrade crates version to 1.5.0 (#775)
Update workspace version to `1.5.0` and intra-workspace dependency
versions to `1.5.0`
2026-04-03 13:39:50 -07:00
Hoyt Koepke
0d9f78aaf4 Add README.md files and Cargo.toml updates needed for publishing hf-xet (#773)
This PR adds crates.io-facing metadata (homepage, readme, keywords,
categories) for the publishable crates, along with crate README files
and concise crate-level docs so crates.io and docs.rs pages have better
context.
2026-04-03 12:34:47 -07:00
Hoyt Koepke
014ff2d75b Fix for FD leak (#774)
Currently, the tests can fail intermittently due to a subtle fd leak in
how the session and the runtimes interact. This causes tests using the
sessions to quickly run out of file handles.

There were two different issues: 

1. XetSessionInner tracked active upload commits and file download
groups in strong-reference maps, and those child objects held a clone of
the session. That created a second cycle (session -> child -> session)
that prevented cleanup of commit/download resources and the runtime
handles. This is dropped. (Note that all abort/sigint-cancellation
behavior is handled automatically through TaskRuntime; the session
classes don't need any explicit code for it outside of that).

2. The static thread-local reference to the tokio runtime prevented the
tokio runtime from getting cleaned up when it was created explicitly and
not aborted. In addition, JoinHandle objects hold a reference back to
the runtime, so if these are not aborted or joined, then they also
prevent the runtime from shutting down.

The FD tracking code was left in but feature gated behind feature
`fd-track`.

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Changes runtime/session lifetime management (TLS runtime refs,
shutdown behavior, and session child ownership), which can affect task
cancellation and runtime teardown across the library.
> 
> **Overview**
> Fixes intermittent file-descriptor leaks by **breaking ownership
cycles** between `XetSession` and child upload/download objects and by
ensuring `XetRuntime` can actually drop/shutdown when the last external
reference is released.
> 
> Adds an opt-in `fd-track` feature with lightweight FD counting/scoped
tracing, plus new leak-focused tests, and tightens local CAS DB/shard
manager caching to avoid duplicate `redb` opens (canonicalized paths,
weak cached handles, and cleanup on drop).
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
041426e73e. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
2026-04-02 18:28:26 -07:00
Assaf Vayner
20198a9081 Remove prometheus dependency and metrics (#769)
## Summary
- Remove the `prometheus` crate dependency from the workspace and
`xet_data`
- Delete `prometheus_metrics.rs` which defined 3 IntCounter metrics (CAS
bytes produced, bytes cleaned, bytes smudged)
- Remove metric increment calls from `file_upload_session.rs` and
`file_download_session.rs`
- Fix Windows CI flake: redb "Database already open" error in
`test_single_large`

These metrics were collected but never exposed via any HTTP endpoint or
text encoder, making them effectively dead code.

## Test plan
- [x] `cargo +nightly fmt` — clean
- [x] `cargo clippy --all-targets` — no new warnings
- [x] `cargo test -p xet-data` — 17/17 pass
- [x] `cargo test -p xet-data --features simulation --test
test_clean_smudge` — 14/14 pass (including `test_single_large`)
- [x] WASM builds (`hf_xet_wasm`, `hf_xet_thin_wasm`) — both succeed

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Low Risk**
> Low risk: this removes unused Prometheus metrics plumbing and related
dependencies without changing the core upload/download logic. Main risk
is loss of any downstream reliance on these counters at build time
(e.g., feature flags or imports).
> 
> **Overview**
> Removes the `prometheus` dependency from the workspace and `xet_data`,
and updates lockfiles accordingly (including WASM-related lockfiles).
> 
> Deletes `xet_data`’s `prometheus_metrics` module and strips the
associated counter increments from `FileUploadSession` and
`FileDownloadSession`, leaving the data processing behavior unchanged
aside from no longer recording these metrics.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
c6c866b7ca. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
2026-04-01 14:56:58 -07:00
Adrien
3d377bdffb feat: optional chunk cache in download path for cross-file dedup (#731)
## Context

These changes support the hf-mount project, which needs cross-file chunk
deduplication during downloads.

## Summary

- Adds an optional `ChunkCache` to the download path
(`FileDownloadSession`, `FileReconstructor`, `XorbBlock`). When
provided, xorb blocks are looked up in cache before HTTP requests and
stored after download.
- Cache hits skip permit acquisition, so they don't consume network
concurrency slots. This enables cross-file deduplication for mount-style
workloads.
- Breaking change to `FileDownloadSession::new()` and `from_client()`
signatures (new `chunk_cache: Option<Arc<dyn ChunkCache>>` parameter).
All existing callers pass `None`.

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Touches the core download/reconstruction path and changes session
constructor signatures; cache-hit/miss behavior affects concurrency
permits and progress reporting. Risk is mitigated by being opt-in
(`None` for existing callers) but incorrect cache keys or offsets could
corrupt reconstructed output or skew progress.
> 
> **Overview**
> Adds an *optional* `ChunkCache` to the download pipeline to enable
cross-file xorb/chunk dedup during reconstruction.
> 
> `FileDownloadSession` now accepts/stores `chunk_cache` and wires it
into `FileReconstructor`, which passes it down into
`FileTerm`/`XorbBlock` retrieval. `XorbBlock::retrieve_data` now checks
the cache before acquiring CAS download permits (so cache hits avoid
consuming network concurrency), and writes downloaded blocks back to the
cache asynchronously on a best-effort basis (logging failures).
> 
> This also introduces a small refactor (`build_chunk_offsets`) and
updates all call sites/tests/examples to the new
`FileDownloadSession::new(..., chunk_cache)` / `from_client(...,
chunk_cache)` signatures (currently passing `None`).
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
f4fdea5175. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
2026-04-01 17:29:28 +02:00
Hoyt Koepke
3051478cdd Allow shard expiration to be set on global dedup queries for GC simulation (#762)
Currently, simulation global dedup shard queries return full shard bytes
with no configurable shard footer expiration, and simulation control
knobs are split between partially implemented paths. This PR adds global
dedup shard expiration control to simulation clients and servers, and
extends /simulation/set_config to cover shard expiration, max range
splitting, V2 reconstruction disabling, API delay, and URL expiration in
one path. This enables rapid simulation of the GC paths by setting the
global dedup expiration to a sub-epoch value.

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Touches simulation client/server APIs and shard serialization behavior
(including new trait methods and HTTP knobs), so downstream implementors
and tests may break if not updated. Changes are scoped to simulation/GC
tooling paths but affect how global-dedup shard bytes are produced and
validated.
> 
> **Overview**
> Adds a new simulation control to set **global-dedup shard
expiration**: `DirectAccessClient::set_global_dedup_shard_expiration`
now makes `query_for_global_dedup_shard` optionally return *minimal*
shard bytes (file section stripped) with `shard_key_expiry = now +
expiration` (sub-second durations round up).
> 
> Extends `MDBMinimalShard` serialization with
`serialize_xorb_subset_with_expiry` to write an optional
`shard_key_expiry` footer, and updates `LocalClient`/`MemoryClient` to
use it when expiration is enabled.
> 
> Unifies and expands runtime simulation knobs under
`/simulation/set_config` (global dedup expiration, max ranges per fetch,
disable V2 reconstruction, API delay, URL expiration) and updates
`SimulationControlClient` to apply them via a retried async POST. Also
moves integrity/reachability checks to `DeletionControlableClient`, adds
`verify_all_reachable`, and wires new `/simulation/verify_all_reachable`
with 501 behavior when no deletion client is configured.
> 
> Separately, introduces **simulation-only xorb cut thresholds**
(`XORB_CUT_THRESHOLD_*`) driven by new `xet_runtime` xorb config
overrides, and updates upload/dedup code paths to use these thresholds.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
42bd9c3f4f. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
2026-03-31 18:35:19 -07:00
Assaf Vayner
86935b4117 Move test-only deps to dev-dependencies in git_xet (#767)
## Summary
- Move `russh`, `rand_core`, and `tempfile` from regular dependencies to
dev-dependencies in `git_xet`, since they are only used in test code
- `russh` and `rand_core` are also declared as optional regular deps
activated by the `git-xet-for-integration-test` feature flag, since the
integration test SSH server is compiled into the library under that
feature
- Gate `test_utils/ssh_server` module and related exports behind
`#[cfg(any(test, feature = "git-xet-for-integration-test"))]`
- Gate `tests/test_ssh.rs` integration test file behind `#![cfg(feature
= "git-xet-for-integration-test")]`

## Test plan
- [x] `cargo check -p git_xet` passes (no features)
- [x] `cargo test -p git_xet --no-run` passes (no features)
- [x] `cargo test -p git_xet --features git-xet-for-integration-test
--no-run` passes

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Low Risk**
> Low risk: primarily Cargo dependency/feature and `cfg` gating changes,
with no production logic changes; risk is limited to build/test
configuration and feature-flagged integration test coverage.
> 
> **Overview**
> **Reduces default build dependencies for `git_xet`.** Moves `russh`,
`rand_core`, and `tempfile` into `dev-dependencies`, and keeps
`russh`/`rand_core` available as *optional* deps enabled only by the
`git-xet-for-integration-test` feature.
> 
> **Gates SSH test helpers and integration tests behind a feature
flag.** Exposes `GitLFSAuthenticateResponse*` and the local SSH test
server only under `#[cfg(test)]` or `feature =
"git-xet-for-integration-test"`, and makes `tests/test_ssh.rs` compile
only when that feature is enabled.
> 
> Separately, cleans up workspace manifests/lockfiles by moving some
crates (`half`, `regex`, `futures-util`) to dev-deps where they’re only
needed for tests/benches, and adds `.worktrees/` to `.gitignore`.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
cdc30a5a8f. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
2026-03-31 13:31:20 -07:00
Hoyt Koepke
29acd7a981 Fix for download streams swallowing errors into generic "Channel closed" message. (#765)
Previously, when an error happens, the channel stream can close before
the error gets propagated to the user-facing iterators; when this
happens, it's random whether the channel closed error or the original
error gets surfaced. This PR ensures that the actual error causing the
shutdown gets surfaced to the user.

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Adjusts error handling across async/blocked stream consumption and the
sequential writer thread, affecting concurrency and shutdown paths. Risk
is moderate due to potential behavior changes in edge cases when
channels close during failures.
> 
> **Overview**
> Prevents download streaming APIs from masking reconstruction failures
as generic "channel closed" errors.
> 
> When a per-chunk `oneshot` receiver is dropped/closed,
`DownloadStream::{next,blocking_next}` and the sequential writer thread
(`SyncWriterThread::next_write`) now first call
`run_state.check_error()` to surface the *actual* underlying error
before falling back to an internal writer error.
> 
> Wires `RunState` into `SyncWriterThread` so the background writer path
can perform the same error propagation check.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
e33b30f076. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
2026-03-30 10:07:44 -07:00
Adrien
f781498b68 fix: truncate local file on full-file download to prevent corruption (#764)
## Summary

- Fixes a data corruption bug where downloading a file smaller than an
existing local file left stale trailing bytes intact
- The file was opened with `truncate(false)` unconditionally (needed for
concurrent partial-range writes), but full-file downloads now use
`truncate(true)`
- Adds regression test `test_full_file_truncates_larger_existing_file`

Ref: https://github.com/huggingface/huggingface_hub/issues/3995

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Changes on-disk write semantics for reconstructed downloads by
optionally truncating the destination file, which affects data integrity
and could impact concurrent/partial-write callers if misused.
> 
> **Overview**
> Fixes a corruption case where full-file downloads could leave stale
trailing bytes when writing over an existing larger file by adding a
`truncate_file` flag to `FileReconstructor::reconstruct_to_file` and
wiring it to `OpenOptions::truncate()`.
> 
> Updates full-file download flow
(`FileDownloadSession::download_file_with_id`) to pass
`truncate_file=true`, while keeping benchmarks/tests and
range/concurrent write paths passing `false` to preserve existing
behavior for partial/concurrent writes.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
ed33dab9a1. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Co-authored-by: di <di@huggingface.co>
2026-03-30 17:57:23 +02:00
Di Xiao
15011cb230 XetSession uses direct token refresh route instead of a callback (#751)
This PR makes two significant, breaking API redesign: 
1. Auth tokens move from session-level (shared by all operations) to
per-operation level (per `UploadCommit`, `FileDownloadGroup`, and
`DownloadStreamGroup`). This enables uploads and downloads from the same
session to carry different access-level tokens — a sensible design for
HF's write-vs-read token split.
2. Instead of letting users provide a callback to refresh tokens, this
new API now let users provide a token refresh URL and access credential
in an HTTP header map.

### Why
1. CAS JWT have short life, but `XetSession` is intended to be held long
time -- thus it makes more sense to configure CAS auth on the operation
level (`UploadCommit` or `FileDownloadGroup` or `DownloadStreamGroup`)
and it will be discarded once the operation is done.
2. For different access level (write vs. read) and different operation
target (repo and commit), CAS JWT token will be different and the token
refresh URL will be different. `UploadCommit` and `FileDownloadGroup`
and `DownloadStreamGroup` they each also function as a single auth
group.
3. Providing an URL is considered easier than writing a callback, and is
more safe when crossing the GIL Python - Rust boundary.

Examples:
```
// Upload token (write access)
let mut upload_headers = HeaderMap::new();
upload_headers.insert("Authorization", "Bearer hub-write-token".parse().unwrap());
let commit = session
    .new_upload_commit()?
    .with_token_info("CAS_WRITE_JWT", 900)
    .with_token_refresh_url("https://huggingface.co/api/repos/token/write", upload_headers)
    .build_blocking()?;
```
```
// File download token (read access)
let mut dl_headers = HeaderMap::new();
dl_headers.insert("Authorization", "Bearer hub-read-token".parse().unwrap());
let group = session
    .new_file_download_group()?
    .with_token_info("CAS_READ_JWT", 900)
    .with_token_refresh_url("https://huggingface.co/api/repos/token/read", dl_headers)
    .build_blocking()?;
```

Secondary changes include:

- `DirectRefreshRouteTokenRefresher` consolidated into
`xet_client::cas_client::auth`.
- HTTP client module moved from `cas_client` to `xet_client::common` for
shared use between `xet_client::cas_client` and
`xet_client::hub_client`.
- New `DownloadStreamGroup` type (streaming downloads moved off
`XetSession`).
- Fix Session ID type regression: this was fixed once in
https://github.com/huggingface/xet-core/pull/738 but regressed again,
seems AI agents don't learn.
- HTTP client cache key now incorporates custom headers
2026-03-30 08:39:25 -07:00
Assaf Vayner
9c0cb6e4c8 Reduce workspace dependencies (batches 1-3) (#746)
## Summary

- **Remove unused dependencies**: warp (zero imports), paste (zero
invocations), tower-service (zero imports), and heed misplacement in
xet_core_structures
- **Move mockall to dev-dependencies** in xet_client by gating
`#[automock]` with `#[cfg_attr(test, automock)]`
- **Feature-gate simulation module** behind `simulation` cargo feature
in xet_client, making axum, heed, humantime, futures-util,
human-bandwidth, and tower-http optional
- **Replace duration-str with humantime** (~2 deps vs ~78 transitive
deps) across xet_runtime, xet_client simulation, and simulation crate

## Impact

| Metric | Before | After | Change |
|---|---|---|---|
| hf-xet production deps | 371 | 321 | **-50** |
| Workspace total | 575 | 569 | -6 |

## Test plan

- [x] `cargo check --workspace` passes
- [x] `cargo check -p hf-xet` passes (without simulation feature — key
validation)
- [x] `cargo test --workspace` — all tests pass (4 pre-existing auth
test failures in git_xet unrelated to this PR)
- [x] `cargo tree -p hf-xet -e normal --prefix none | sort -u | wc -l`
confirms 321 deps

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Medium risk because it changes dependency graph and Cargo feature
gating (notably `xet-client` simulation modules and CI test features),
which can affect build/test behavior across targets despite minimal
runtime logic changes.
> 
> **Overview**
> Reduces workspace dependency surface by removing `duration-str`
(replaced with `humantime`) and trimming other transitive-heavy crates;
updates lockfiles accordingly across the workspace, `hf_xet`, and WASM
builds.
> 
> Introduces/propagates a `simulation` Cargo feature: `xet-client`’s
simulation server-related deps become optional and are only
compiled/exported when `feature = "simulation"` is enabled; `git_xet`
adds a `simulation` feature that forwards to dependent crates, and CI
now runs tests with `strict simulation git-xet-for-integration-test`.
> 
> Minor repo hygiene updates include ignoring `.claude/` in `.gitignore`
and wiring the `simulation` crate to depend on `xet-client` with
`features = ["simulation"]` (plus swapping its duration parsing helper
to `humantime`).
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
6abc194398. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 09:54:36 -07:00
Hoyt Koepke
c90f0a7bd9 Session API Polish; unify task handling/cancellation behavior. (#747)
Previously, upload and download paths each had their own ad-hoc state
tracking, cancellation, and runtime bridging logic. TaskRuntime
consolidates this into a single type that owns a CancellationToken tree,
tracks Running/Finished/Cancelled state with recursive propagation to
children, and provides bridge_async/bridge_sync wrappers that
automatically wire up tokio::select! cancellation. Session →
commit/group → per-file handles form a parent-child token tree, so
aborting a session cancels all descendant work.

The upload path gets new UploadFileHandle and UploadStreamHandle wrapper
types (replacing the old UploadTaskHandle), with inner/wrapper pattern
for cheap cloning. UploadCommit::commit() now returns a CommitReport
containing aggregate dedup metrics, progress, and per-file FileMetadata.
The download path mirrors this structure: FileDownloadGroup uses
TaskRuntime for state gating and owns bespoke DownloadTaskHandle
instances with per-task status and result access.

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **High Risk**
> High risk due to a breaking redesign of the public `xet_session` API
(new handle/report types and renamed methods) plus new
cancellation/state machinery that changes how uploads/downloads are
coordinated and terminated.
> 
> **Overview**
> **Redesigns `xet_pkg::xet_session` around a new hierarchical
`TaskRuntime`** (using `tokio-util` cancellation tokens) to unify state,
bridging, and cancellation across session → commit/group → per-file
handles.
> 
> **Replaces the old task-handle/result model** (`tasks.rs`,
`UploadResult`/`DownloadResult`, `TaskStatus`, group/session state
enums) with explicit handle/report types: `XetFileUpload`,
`XetStreamUpload`, `XetFileDownload`, `XetCommitReport`, and
`XetDownloadGroupReport`, and standardizes task state via
`XetTaskState`.
> 
> **Adjusts APIs and error semantics**: `commit()` now returns an
aggregate report (dedup metrics + progress + per-file metadata) and no
longer consumes `self`; progress methods become infallible
(`progress()`); cancellations/errors are consolidated
(`AlreadyCompleted`, `UserCancelled`, `KeyboardInterrupt`,
`TaskError`/`PreviousTaskError`) with updated Python exception mapping.
`xet_data` now returns per-file `DeduplicationMetrics` from upload tasks
and adds a zero-copy `SingleFileCleaner::add_data_from_bytes`.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
153a3ebbbe. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
2026-03-27 07:54:37 -07:00
Hoyt Koepke
ffee6a978c Move session runtime decisions to XetRuntime (#742)
Currently, session runtime routing is split between XetSession and
XetRuntime. This PR centralizes runtime routing in XetRuntime, moving
all wrapper structs there. Now, bridge_async / bridge_sync work
universally to bring from async and sync runtimes.

This PR also changes the default behavior to having the default new()
method auto-detect whether the process can run inside an existing tokio
runtime with valid features enabled vs. creating a new one. Also, then,
with_tokio_handle() errors out if the provided tokio handle doesn't have
the correct features.
2026-03-20 20:23:15 -07:00
Adrien
7b33764330 feat: make start_clean size parameter optional for streaming uploads (#732)
## Context

These changes support the hf-mount project, where FUSE streaming uploads
don't know the file size in advance.

## Summary

- Changes the `size` parameter of `FileUploadSession::start_clean()`
from `u64` to `Option<u64>`. Passing `None` signals that the final file
size is unknown (FUSE streaming uploads), which prevents `debug_assert`
panics when `completed_bytes` exceeds the initially declared
`total_bytes=0`.
- Propagates `Option<u64>` to the public API:
`UploadCommit::upload_file()` and `upload_file_blocking()` now take
`file_size: Option<u64>`.
- All existing callers are updated to wrap the size argument in
`Some(...)`.

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Public upload APIs now accept an optional file size, which is a
breaking signature change and could affect downstream callers and
progress tracking behavior when size is `None`. Implementation changes
are small but touch core upload session and commit interfaces.
> 
> **Overview**
> Enables streaming uploads where the final file size is not known up
front by changing `FileUploadSession::start_clean` to take `Option<u64>`
and treating `None` as an unknown size for progress/completion tracking.
> 
> Propagates this optional-size API through `UploadCommit::upload_file`
/ `upload_file_blocking` and updates all internal callers, examples, and
tests to pass `Some(size)` when the size is known, along with doc
updates reflecting the new `None` semantics.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
8b41e11e24. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
2026-03-20 23:33:18 +01:00
Hoyt Koepke
332a456e1d Add ordered and unordered download streaming to session interface (#729)
This PR adds ordered and unordered download streams on XetSession,
including optional byte-range support and per-stream progress reporting.
Blocking and async variants are supported.

On the reconstruction side, this introduces UnorderedWriter and
UnorderedDownloadStream in xet_data, and extends the FileDownloadSession
stream APIs to take optional source ranges. Ordered and unordered
streams now share the same session-facing access pattern for async and
blocking callers.

This PR also renames DownloadGroup to FileDownloadGroup; the stream data
uses the per-session memory pool but don't count towards the maximum
number of concurrent downloads in progress.

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Touches core file reconstruction/writer plumbing (including
`DataWriter` ownership and new unordered writer/stream paths) and
changes public session APIs, so regressions could impact download
correctness, cancellation, or progress reporting.
> 
> **Overview**
> Adds first-class **ordered and unordered streaming download APIs** to
`xet_pkg::xet_session`, including async and blocking variants, optional
source-relative byte ranges, and per-stream progress via new
`XetDownloadStream` / `XetUnorderedDownloadStream` wrappers.
> 
> On the data layer, introduces an **unordered reconstruction path**
(`UnorderedWriter` + `UnorderedDownloadStream`) and refactors streaming
to spawn reconstruction tasks immediately but gate execution behind
`start()`; stream abort callbacks are now registered per-stream and
automatically unregistered on drop to avoid callback accumulation.
> 
> Updates the reconstruction writer contract by making
`DataWriter::finish` consume the writer (and shifting `DataWriter` to
`&mut self` usage), adjusts `SequentialWriter` accordingly, and adds
Criterion-based reconstruction benchmarks plus extensive
unordered-stream tests. Also renames session `DownloadGroup` to
`FileDownloadGroup` (and constructors) and updates call sites/examples.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
e02890aa4b. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
2026-03-20 14:40:18 -07:00
Hoyt Koepke
602d7679f6 Add cargo smoke-test for rapid full-workspace testing. (#741)
Currently, the full test validation is rather heavy, but running local
tests often fails to catch many issues due to the tests that probe the
full stack. This PR adds a smoke-test path that runs a meaningful subset
of the tests across the workspace that covers most errors. This runs in
about 1/8 of the time as cargo test, so it's useful to use in speeding
up AI model iteration.

In addition, a few intermittent failures were also fixed. 

There should be no runtime functionality change.

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Low Risk**
> Low risk since changes are limited to Cargo configuration and test
gating; no production code paths are modified. Main risk is accidentally
skipping too much coverage or misconfiguring feature flags in CI/local
workflows.
> 
> **Overview**
> Adds a new `cargo smoke-test` workflow by introducing a `smoke-test`
Cargo profile and a `cargo` alias that runs `test` with per-crate
`smoke-test` features enabled.
> 
> Defines `smoke-test` features across multiple crates and uses
`#[cfg_attr(feature = "smoke-test", ignore)]` / `#[cfg(... not(feature =
"smoke-test"))]` to skip long-running, concurrency-heavy, or full-stack
integration tests during smoke runs.
> 
> Tightens test robustness by making `SafeFileCreator` permission
assertions umask-tolerant (require owner read/write rather than an exact
`0o644`).
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
5d53009652. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Co-authored-by: Hoyt Koepke <hoytak@xethub.com>
2026-03-20 13:32:38 -07:00
Hoyt Koepke
b137daf88e Implement optional range specification for downloads. (#735)
Currently, download metadata assumes file_size is always known, which
forces callers to provide a size even when only a hash is available.
This PR changes XetFileInfo.file_size to Option<u64> -- with
serialization compatibility -- and propagates that through so hash-only
downloads are a supported path while known-size flows continue to work
as before.

On the download path, this updates the reconstructor setup and range
handling so progress can start without a final total and then finalize
when EOF is discovered. For known-size full-file downloads, it now
validates the reconstructed byte count and returns
DataError::SizeMismatch when expected and actual size differ. In
addition, open ended ranges (e.g. `start..` and `..end`) are now
supported through all APIs.

This also adds coverage for range-based writer/stream downloads and
unknown-size round trips in session-level tests.

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Medium risk because it changes a widely used API type
(`XetFileInfo.file_size`) and adjusts download/reconstruction behavior,
which can affect progress reporting and error handling across Rust and
Python bindings.
> 
> **Overview**
> Enables *hash-only downloads* by changing `XetFileInfo.file_size` from
`u64` to `Option<u64>` (serde backward-compatible) and adding
`XetFileInfo::new_hash_only`, then propagating the optional size through
`xet_pkg` and `hf_xet` (Python `PyXetDownloadInfo.file_size` and
`PyPointerFile.filesize`).
> 
> Extends download APIs to accept *open-ended ranges* via
`RangeBounds<u64>` (e.g. `start..`, `..end`, `..`) and updates
reconstructor/progress behavior to handle unknown totals, while adding
`DataError::SizeMismatch` and validating reconstructed byte counts for
full downloads and bounded ranges.
> 
> Adds substantial new unit/integration test coverage for range
variants, unknown-size round trips, and size-mismatch errors, plus minor
CLI output adjustments to print unknown sizes.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
4d25896c51. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
2026-03-20 00:14:52 -07:00
Hoyt Koepke
749c28b086 Error unification and cleanup (#737)
This PR performs some housecleaning and removes some technical debt
around using different error types, unifying them with the python
interface.

- Our client code tended to do a lot with anyhow errors as an artifact
of first using them before switching to thiserror. This PR cleans these
up in favor of using ClientError or other named error types directly.
- It also removes all the aliases to the old error type names present in
the packages before the refactoring, now settling into ClientError,
FormatError, DataError, and RuntimeError, with XetError being the error
type exposed publicly.
- Also, currently, xet_session exposes SessionError as an alias of
XetError, which adds an extra public type name without adding behavior.
This PR removes that alias and standardizes the public API/docs onto
XetError directly.
-It also tightens Python-facing error behavior and moves the python
handling to the XetError class directly, hidden behind a python feature
flag. Using these types, hf_xet now registers XetObjectNotFoundError and
XetAuthenticationError exception classes for authentication and the
not-found cases. These inherit from the current exception classes, so
all behavior is preserved.
- In addition, the From for PyErr mapping routes
timeout/network/auth/not-found categories to more appropriate Python
exception types than simply RuntimeError.

This is primarily an API-surface cleanup plus error-classification
alignment.

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> API-breaking error-surface changes (removal of legacy alias modules
and signature changes like `CredentialHelper::fill_credential`) may
require downstream code updates, especially where errors are
matched/converted. Runtime behavior should be mostly unchanged, but
error mapping/propagation paths (including Python exceptions) are widely
touched across crates.
> 
> **Overview**
> This PR **unifies error types across the workspace** by removing
legacy re-export/alias modules (e.g. `CasClientError`, `CasTypesError`,
`DataProcessingError`, `SessionError`) and updating call sites to use
canonical errors like `xet_client::ClientError`,
`xet_core_structures::CoreError`, and `xet_data::DataError` directly.
> 
> It updates CAS client code to **standardize on
`crate::error::Result`/`ClientError`**, including deleting
`cas_client/error.rs`, adjusting error conversions in retry/http
middleware paths, and updating simulation/local-server code to map
`ClientError` to HTTP responses.
> 
> Python bindings (`hf_xet`) now **convert failures via `XetError`**
(with `xet_pkg` built with `python` support), register custom exceptions
on module init, and refine argument-validation errors to `PyValueError`
while routing network/timeout/auth/not-found to more appropriate Python
exception classes.
> 
> Misc cleanup: `git_xet` now depends on `xet-data`, simulation binaries
switch to `anyhow::Result`/`bail!`, and lockfiles are updated for
new/updated dependencies (notably `pyo3`/`inventory`).
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
f3d056a909. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
2026-03-19 16:34:28 -07:00
Di Xiao
fb83178d28 Fix session id regression (#738)
The session id was replaced from `ulid` to `UniqueID` (a self
incrementing u64 in memory) in a previous PR but it's not correct.
The session id is used on CAS server logs and traces and CDN logs to
identity a related group of activity (for debugging and etc. purposes)
and it needs to be globally unique (thus using `ulid`) instead of
locally unique.
2026-03-19 13:31:40 -07:00
Di Xiao
e25ee85c14 Fix a compilation failure (#740)
Fix a compilation failure on a new function introduced by
https://github.com/huggingface/xet-core/pull/726 caught by the new
introduced CI step `cargo bench --no-run`.
2026-03-19 12:03:33 -07:00
Hoyt Koepke
506fc28291 Simplify progress tracking + Unify Task ID tracking + Legacy Interface (#726)
Currently, progress tracking is split between callback-driven and
snapshot-driven paths, making session and task wiring across xet_data,
xet_pkg, hf_xet, and git_xet harder to keep consistent. This PR moves
upload/download progress to a polling snapshot model backed by atomics.
It also switches task identifiers to a UniqueID common with the progress
tracking throughout the session APIs.

This PR also updates the rate estimation to use the lighter weight
exponentially weighted moving averages model, so this can be done at a
low level.

To preserve compatibility for existing callback consumers,
callback-oriented upload/download progress tracking APIs are moved under
xet_pkg::legacy and bridged from polling snapshots via a callback based
updaters. hf_xet and git_xet are updated to use that legacy bridge
layer, so current integrations keep working until everything is fully
switched over to the XetSession method.
2026-03-18 18:07:43 -07:00
Hoyt Koepke
69c714c01d Update config groups to handle more of the data management values. (#702)
This PR moves some config values that were part of the data
configuration into XetConfig, specifically the compression_policy,
staging_subdir, session_dir_name, and global_dedup_query_enabled. This
also consolidates the remaining values into a single struct with
endpoint and authentication information.
2026-03-16 16:06:46 -07:00
Hoyt Koepke
ed182125fa Add optional Sha256 hash propegation through XetFileInfo object. (#718)
Currently, the SHA-256 hash of uploaded file content is computed
internally during the upload pipeline but not surfaced to callers.
Downstream consumers — e.g. OpenDAL's Hugging Face backend — need the
SHA-256 to commit files to the Hub API.

This PR adds an optional sha256 field to XetFileInfo, the session-layer
FileMetadata, and the Python-exposed PyXetUploadInfo. The field is
populated from the already-computed hash when Sha256Policy::Compute or
Sha256Policy::Provided is used, and left None for downloads and when
Sha256Policy::Skip is used. Serde attributes (default,
skip_serializing_if) ensure backward-compatible serialisation — existing
serialised data without the field deserialises cleanly.

Needed for the functionality in
https://github.com/huggingface/xet-core/pull/642.
2026-03-16 16:05:49 -07:00
Hoyt Koepke
9caf7fcc44 V2 reconstruction with client-side optional single range splitting (#703)
This PR introduces V2 multirange URL fetching for xorbs, but optionally
splits the multirange requests into multiple single-range requests that
can be executed in parallel. This allows the reconstruction process to
generate full multirange presigned URLs, but the client effectively
performs the retrieval stage as a sequence of parallel single-range
queries.

The config variable `client.enable_multirange_fetching` controls this
behavior; by default it is set to false due to the current observed
slowness of fetching multiranged URLs.

---------

Co-authored-by: Adrien <adrien@huggingface.co>
2026-03-16 14:10:50 -07:00
Adrien
820f2657c5 fix: bound file reconstruction range using file_size to prevent 416 errors (#716) 2026-03-16 07:07:03 +01:00
Brian Ronan
6232b42591 Xorb download URL debug logs (#714)
It's a bit annoying to try to ensure our CDN routing is correct. Logging
the URL domain for the first fetch term download to the debug logs.

Please don't hesitate to recommend alternative approaches.
2026-03-13 16:05:30 -07:00
Adrien
0fb930c8d0 feat: expose skip_sha256 parameter in Python upload API (#705)
## Summary

Add `skip_sha256` and `sha256s` parameters to `upload_bytes()` Python
binding for per-file SHA-256 policies:
- `skip_sha256: bool = False` - Skip SHA-256 computation entirely (sets
`Sha256Policy::Skip`)
- `sha256s: Optional[List[str]] = None` - Provide pre-computed SHA-256
hashes (companion to existing parameter on `upload_files()`)
- These parameters are mutually exclusive

## Changes

**Python binding changes:**
- Add `skip_sha256` + `sha256s` params to `upload_bytes()` /
`upload_files()`
- All policy conversion happens at Python boundary

**Internal refactoring:**
- Add `Clone`/`Copy` derives + `from_skip()`/`from_hex()` helpers to
`Sha256Policy`
- Update `upload_bytes_async`, `upload_async`, `clean_file` to use
`Vec<Sha256Policy>`
- Update all internal callers across `git_xet`, `xet_pkg`, migration
tool, tests

## Motivation

`huggingface_hub` already knows whether SHA-256 is required. This change
enables skipping expensive computation when unnecessary, or passing
pre-computed hashes for bulk operations.

Companion to #678.

---------

Co-authored-by: Wauplin <lucainp@gmail.com>
2026-03-12 18:17:12 +01:00
Hoyt Koepke
45d38a13a9 Code reorganization towards release of xet cargo package (#693)
This PR is a massive rearrangement of the code base into 5 packages
intended for release on cargo. The directories and corresponding
packages are:

1. xet_runtime/ — compiles into the xet-runtime package. Contains the
runtime, config, and logging management.
2. xet_core_structures/ — compiles into the xet-core-structures package.
Contains core data structures for hashing, shards, and xorbs as well as
internal data structures that depend on these.
3. xet_client/ — compiles into the xet-client package, contains client
code for remotely connecting to the Hugging Face servers.
4. xet_data/ — compiles into the xet-data package, contains the data
processing pipeline: chunking/deduplication, file reconstruction,
clean/smudge operations, and progress tracking.
5. xet_pkg/ — compiles into the hf-xet package, provides the top-level
session-based API for file upload and download with user-facing error
categorization. This is the primary package downstream dependencies
would use. This also contains a single summary error type, XetError,
that translates cleanly into python error types.

In addition, the other tools are: 

- git_xet/ — the git_xet CLI binary crate (location preserved). 
- hf_xet/ -- the hf_xet python package (location preserved).
- simulation/ — the simulation crate for upload scenario benchmarking.
- wasm/ -- the wasm objects. 

The full description — and information for an AI agent to use to update
downstream dependencies — is at
api_changes/update_260309_package_restructure.md.

Summary of moves:

- xet_runtime: became xet_runtime::core inside xet_runtime/.
- utils: became xet_runtime::utils inside xet_runtime/.
- xet_config: became xet_runtime::config inside xet_runtime/.
- xet_logging: became xet_runtime::logging inside xet_runtime/.
- error_printer: became xet_runtime::error_printer inside xet_runtime/.
- file_utils: became xet_runtime::file_utils inside xet_runtime/.
- merklehash: became xet_core_structures::merklehash inside
xet_core_structures/.
- mdb_shard: became xet_core_structures::metadata_shard inside
xet_core_structures/.
- xorb_object: became xet_core_structures::xorb_object inside
xet_core_structures/.
- cas_client: became xet_client::cas_client inside xet_client/.
- hub_client: became xet_client::hub_client inside xet_client/.
- cas_types: became xet_client::cas_types inside xet_client/.
- chunk_cache: became xet_client::chunk_cache inside xet_client/.
- data: became xet_data::processing inside xet_data/.
- deduplication: became xet_data::deduplication inside xet_data/.
- file_reconstruction: became xet_data::file_reconstruction inside
xet_data/.
- progress_tracking: became xet_data::progress_tracking inside
xet_data/.
- xet_session: became xet::xet_session inside xet_pkg/.

- Wasm packages (hf_xet_wasm, hf_xet_thin_wasm): moved from top-level
into wasm/; internal imports updated, public APIs unchanged.
2026-03-11 12:02:38 -07:00