mirror of https://github.com/huggingface/xet-core.git synced 2026-06-04 13:30:29 +08:00
Files
Adrien 40f9530753 feat: range aware file write (#717 )
## Summary

APIs for range-aware file writes: instead of re-uploading an entire file
when only part of it changed, compose a new CAS file from stable
segments + re-chunked dirty windows. Supports resize edits (insert /
delete / arbitrary replace) in addition to in-place rewrites.

### API: `upload_ranges`

```rust
pub async fn upload_ranges(
    config: Arc<TranslatorConfig>,
    cas_client: Arc<dyn Client>,
    original_hash: MerkleHash,
    original_size: u64,
    dirty_inputs: Vec<DirtyInput>,
) -> Result<XetFileInfo>
```

```rust
/// A single edit applied to the original file: replace `original_range` with
/// `new_length` bytes from `reader`. Edits are expressed in original-file coordinates.
pub struct DirtyInput {
    pub original_range: Range<u64>,
    pub reader: Pin<Box<dyn AsyncRead + Send>>,
    pub new_length: u64,
}
```

The output file size is **derived** from the inputs (no `total_size`
parameter): `original_size - removed + added`.

### Edit shapes (all expressible with the same struct)

| Operation | `original_range` | `new_length` |
|---|---|---|
| In-place edit | `a..b` | `b - a` |
| Resize replace | `a..b` | any |
| Pure insert | `p..p` | `> 0` |
| Pure delete | `a..b` | `0` |
| Append | `original_size..original_size` | `> 0` |
| Truncate to N | `N..original_size` | `0` |
| No-op | empty `dirty_inputs` | — |

Motivating example:

```text
abc + upload_ranges([0..1), "foo", 3) = foobc
abc + upload_ranges([0..0), "foo", 3) = fooabc
abc + upload_ranges([0..1), "",    0) = bc
```

**Per-range `AsyncRead` instead of `ReadSeek` over the staging file.**
The earlier prototype took `dirty_ranges: &[(u64, u64)] + dirty_source:
&mut dyn ReadSeek`. That had a subtle bug: for truncation we silently
extended the dirty set with a boundary chunk and read those bytes from
the staging file, but if the file was never opened for write the staging
file contains zeros at those positions (real bytes are in CAS) → silent
corruption on the truncation boundary chunk. Pairing each edit with its
own reader makes that structurally impossible: any byte not provided by
the caller is fetched from CAS.

<details>
<summary>How it works</summary>

### High level

```
                           upload_ranges
   +----------------------+   |   +----------------------+
   |  original file (CAS) |---+-->|  composed file (CAS) |
   +----------------------+       +----------------------+
   only the dirty windows are re-uploaded; everything else
   is reused as whole CAS segments.
```

### Step 1 — coalesce + snap edits to segment boundaries

Edits are user-coordinates (byte ranges). We snap each edit's
`original_range` to the **enclosing CAS segments** so composition can
swap whole segments instead of truncating one mid-chunk. Adjacent /
overlapping snapped ranges are then coalesced.

Pure inserts (`start == end`) snap to the segment that owns `start`; an
insert at `original_size` snaps to the last segment.

### Step 2 — server returns windows + gap subtrees

Single CAS call: `GET /v2/file-chunk-hashes/{file_id}` with the
segment-aligned ranges in an `X-Range-Dirty: bytes=A-B,C-D` header.
Response shape (xetcas#987):

```rust
struct FileChunkHashesResponse {
    windows:      Vec<ChunkWindow>,         // one per dirty range
    hash_ranges:  Vec<Option<MerkleHashSubtree>>, // N+1 entries: [gap0, gap1, ..., gapN]
}
```

`windows[i].chunks` carries the chunk hashes the server actually owns
for that window (we re-upload these bytes). `hash_ranges[i]` is the
**MerkleHashSubtree** for the i-th unmodified gap, or `None` when there
is no gap there. This is the key to composing the final file hash
without touching unmodified bytes.

### Step 3 — for each window, stream `[CAS prefix | edits | CAS suffix]`
through a fresh cleaner

```
window = [w_start ............................................. w_end]
edits in this window:        [edit_a]    [edit_b]
                                ^           ^
streamed input to the cleaner:
  CAS bytes [w_start, edit_a.start)
  reader bytes for edit_a (new_length bytes)
  CAS bytes [edit_a.end, edit_b.start)
  reader bytes for edit_b
  CAS bytes [edit_b.end, w_end)
```

Pure inserts contribute zero original bytes but still emit `new_length`
reader bytes. Pure deletes contribute zero reader bytes. The cleaner
produces a new `MDBFileInfo` per window and a `ChunkHashList`.

### Step 4 — compose the file hash via `MerkleHashSubtree::merge`

```text
merge_seq = [gap0, w0, gap1, w1, ..., wN, gapN]   // skip None gaps

merged          = MerkleHashSubtree::merge(merge_seq)
aggregated_hash = merged.final_hash()
combined_hash   = aggregated_hash.hmac(zero)      // matches cleaner's file_hash
```

Special-case: if `total_size == 0` (e.g. truncate to empty) the result
is `MerkleHash::default()` *without* HMAC, mirroring `file_hash([])`.

### Step 5 — splice segments + register

Walk the original `MDBFileInfo.segments` and replace any segment that
falls inside a window with that window's freshly-uploaded segments.
Verification entries follow segment-for-segment when present.
`metadata_ext = None` (no SHA-256, see Limitations). Then
`register_composed_file` + `finalize`.

### Multi-window example

Two edits: replace `[50MB, 51MB)` and `[150MB, 151MB)` on a 200MB file:

```
+-----------+-------+------------+-------+-----------+
|  GAP 0    |  W0   |   GAP 1    |  W1   |  GAP 2    |
|  reused   |upload |  reused    |upload |  reused   |
| (subtree) | ~1MB  | (subtree)  | ~1MB  | (subtree) |
+-----------+-------+------------+-------+-----------+

Wire transfer: ~2MB upload + a few hundred KB of CAS reads for window
boundary chunks. Old approach: 200MB download + 200MB upload.
```

### Empty original short-circuit

When `original_size == 0` there is nothing to compose against — every
edit's `original_range` must be `0..0` (validated). We just stream the
new bytes through a fresh cleaner (`upload_fresh_file`).

</details>

### Reviewer note: `chunk_window_builder` is a re-implementation of
xetcas

`xet_client/src/cas_client/chunk_window_builder.rs` is a port of the
same window-building state machine that already lives in xetcas — it's
only used by the local / in-memory simulation clients (`local_client`,
`memory_client`) so the mock CAS server returns the same shape as the
real one in tests. **No need to re-review it as part of this PR**: it
mirrors logic already reviewed and merged in xetcas#987. A follow-up
xetcas PR will deduplicate by removing the server-side copy and pulling
this one in (or vice versa); the duplication is intentional and
temporary.

### Limitations

- **No SHA-256 metadata**: composed files have `metadata_ext = None`
since recomputing SHA-256 would require reading the full file. Only
suitable for contexts that don't require SHA-256 verification (HF
buckets, xet-native repos), not for Git LFS-backed repos.
- **Memory**: for very large files, the per-window in-memory state
(chunk hash list + composed segments) is bounded by the dirty regions,
not the whole file. The chunk-hashes response is paginated by the
server-defined window granularity.

### Tests (27)

Covering all edit shapes + edge cases. Notable:

| Test | Purpose |
|---|---|
| `test_resize_edits_abc` | The 3 motivating FUSE examples |
| `test_resize_large_replace_grows_file` | Replace `[a..b)` with much
more data |
| `test_resize_large_replace_shrinks_file` | Replace `[a..b)` with much
less data |
| `test_resize_mid_file_insert` | Pure insert in the middle |
| `test_resize_mid_file_delete` | Pure delete in the middle |
| `test_resize_multi_edit_mix` | Insert + replace + delete in one call |
| `test_resize_insert_at_segment_boundary` | Snapping correctness for
inserts |
| `test_upload_ranges_mid_file_edit` | In-place edit |
| `test_upload_ranges_truncation` | Pure truncate (sub-segment) |
| `test_upload_ranges_truncation_empty_staging` | Truncate when staging
is all-zero (boundary read from CAS) |
| `test_upload_ranges_truncation_with_overlapping_dirty` | Truncate +
dirty range overlapping the boundary |
| `test_truncate_to_empty_matches_clean_empty` | Truncating to 0 hashes
to `MerkleHash::default()` (matches a fresh empty cleaner) |
| `test_upload_ranges_append` | Pure append |
| `test_append_with_gap_before_dirty_range` | Append where reader covers
a sparse gap too |
| `test_append_sparse_staging_file` | Append on a sparse staging file |
| `test_mid_edit_plus_append` | Mid-file edit *and* append in one call
(P1 codex regression) |
| `test_empty_original_append` | `original_size == 0` + append falls
into the fresh-file path (P2 codex regression) |
| `test_empty_original_validates_ranges` | `original_size == 0` still
runs validation (reviewer regression) |
| `test_upload_ranges_at_file_start` | Edit at offset 0 (no stable
prefix) |
| `test_upload_ranges_multiple_regions` | Two non-adjacent dirty windows
with stable gap |
| `test_single_input_spanning_many_chunks` | One edit covering many CDC
chunks |
| `test_data_integrity_scenarios` | 5 sub-scenarios covering composition
correctness |
| `test_noop_returns_original_hash` | Empty `dirty_inputs` → no CAS
call, original hash returned |
| `test_rejects_dirty_range_past_total_size` | Validation: range past
`original_size` |
| `test_rejects_overlapping_dirty_ranges` | Validation: overlapping
edits |
| `test_rejects_unsorted_dirty_ranges` | Validation: unsorted edits |
| `test_upload_ranges_small_file_mid_edit` | Small files (single
segment) |

### Dependencies

- xetcas: `GET /v2/file-chunk-hashes/{file_id}` with `windows[] +
hash_ranges[]` response shape — huggingface-internal/xetcas#987
(merged).
- Consumer: huggingface-internal/hf-mount#41.


<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **High Risk**
> High risk because it adds a new partial-upload composition path that
splices CAS segments and recomputes file hashes from window subtrees,
touching core data integrity and client/server chunk-boundary logic.
> 
> **Overview**
> Adds range-aware file writes via new `upload_ranges`, letting callers
apply insert/delete/replace edits and upload only re-chunked dirty
windows while reusing stable CAS segments.
> 
> Introduces a new CAS API `get_file_chunk_hashes` (`GET
/v2/file-chunk-hashes/{file_id}` with `X-Range-Dirty`) plus response
types (`FileChunkHashesResponse`, `ChunkWindow`) and simulation support
(`chunk_window_builder`) that extends dirty ranges to *stable* chunk
boundaries and returns gap `MerkleHashSubtree` summaries +
stable-segment verification.
> 
> Refactors dedup/cleaning plumbing to expose per-chunk hash lists
(`ChunkHashList`), adds detached cleaner/session completion and
`register_composed_file` to avoid orphan shard entries, and
moves/re-exports `next_stable_chunk_boundary` into `xet_core_structures`
for shared stable-window computations.
> 
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
2f4cee46df. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Arpit Jain <arpitjain099@gmail.com>
Co-authored-by: Hoyt Koepke <hoytak@huggingface.co>
Co-authored-by: tison <wander4096@gmail.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: Di Xiao <seanses@users.noreply.github.com>
Co-authored-by: Arpit Jain <3242828+arpitjain099@users.noreply.github.com>
Co-authored-by: Assaf Vayner <assaf@huggingface.co>
Co-authored-by: Rajat Arya <rajatarya@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 21:27:59 +02:00
src
feat: range aware file write (#717 )
2026-05-21 21:27:59 +02:00
tests
Reduce workspace dependencies (batches 1-3) (#746 )
2026-03-27 09:54:36 -07:00
Cargo.toml
set version 1.5.2 (#805 )
2026-04-20 15:06:14 -07:00
README.md
Add README.md files and Cargo.toml updates needed for publishing hf-xet (#773 )
2026-04-03 12:34:47 -07:00
README.md

xet-client

Client for communicating with Hugging Face Xet storage servers.
Overview

Upload and download data and metadata objects from the backend Hugging Face Xet storage servers. Features automatic concurrency adaptations, connection pooling, and retry resiliency. Intended to be used through the API in the hf-xet package.
This crate is part of xet-core.
License

Apache-2.0