xet-core

mirror of https://github.com/huggingface/xet-core.git synced 2026-06-04 13:30:29 +08:00

Go to file

Adrien 40f9530753 feat: range aware file write (#717 )

## Summary

APIs for range-aware file writes: instead of re-uploading an entire file
when only part of it changed, compose a new CAS file from stable
segments + re-chunked dirty windows. Supports resize edits (insert /
delete / arbitrary replace) in addition to in-place rewrites.

### API: `upload_ranges`

```rust
pub async fn upload_ranges(
    config: Arc<TranslatorConfig>,
    cas_client: Arc<dyn Client>,
    original_hash: MerkleHash,
    original_size: u64,
    dirty_inputs: Vec<DirtyInput>,
) -> Result<XetFileInfo>
```

```rust
/// A single edit applied to the original file: replace `original_range` with
/// `new_length` bytes from `reader`. Edits are expressed in original-file coordinates.
pub struct DirtyInput {
    pub original_range: Range<u64>,
    pub reader: Pin<Box<dyn AsyncRead + Send>>,
    pub new_length: u64,
}
```

The output file size is **derived** from the inputs (no `total_size`
parameter): `original_size - removed + added`.

### Edit shapes (all expressible with the same struct)

| Operation | `original_range` | `new_length` |
|---|---|---|
| In-place edit | `a..b` | `b - a` |
| Resize replace | `a..b` | any |
| Pure insert | `p..p` | `> 0` |
| Pure delete | `a..b` | `0` |
| Append | `original_size..original_size` | `> 0` |
| Truncate to N | `N..original_size` | `0` |
| No-op | empty `dirty_inputs` | — |

Motivating example:

```text
abc + upload_ranges([0..1), "foo", 3) = foobc
abc + upload_ranges([0..0), "foo", 3) = fooabc
abc + upload_ranges([0..1), "",    0) = bc
```

**Per-range `AsyncRead` instead of `ReadSeek` over the staging file.**
The earlier prototype took `dirty_ranges: &[(u64, u64)] + dirty_source:
&mut dyn ReadSeek`. That had a subtle bug: for truncation we silently
extended the dirty set with a boundary chunk and read those bytes from
the staging file, but if the file was never opened for write the staging
file contains zeros at those positions (real bytes are in CAS) → silent
corruption on the truncation boundary chunk. Pairing each edit with its
own reader makes that structurally impossible: any byte not provided by
the caller is fetched from CAS.

<details>
<summary>How it works</summary>

### High level

```
                           upload_ranges
   +----------------------+   |   +----------------------+
   |  original file (CAS) |---+-->|  composed file (CAS) |
   +----------------------+       +----------------------+
   only the dirty windows are re-uploaded; everything else
   is reused as whole CAS segments.
```

### Step 1 — coalesce + snap edits to segment boundaries

Edits are user-coordinates (byte ranges). We snap each edit's
`original_range` to the **enclosing CAS segments** so composition can
swap whole segments instead of truncating one mid-chunk. Adjacent /
overlapping snapped ranges are then coalesced.

Pure inserts (`start == end`) snap to the segment that owns `start`; an
insert at `original_size` snaps to the last segment.

### Step 2 — server returns windows + gap subtrees

Single CAS call: `GET /v2/file-chunk-hashes/{file_id}` with the
segment-aligned ranges in an `X-Range-Dirty: bytes=A-B,C-D` header.
Response shape (xetcas#987):

```rust
struct FileChunkHashesResponse {
    windows:      Vec<ChunkWindow>,         // one per dirty range
    hash_ranges:  Vec<Option<MerkleHashSubtree>>, // N+1 entries: [gap0, gap1, ..., gapN]
}
```

`windows[i].chunks` carries the chunk hashes the server actually owns
for that window (we re-upload these bytes). `hash_ranges[i]` is the
**MerkleHashSubtree** for the i-th unmodified gap, or `None` when there
is no gap there. This is the key to composing the final file hash
without touching unmodified bytes.

### Step 3 — for each window, stream `[CAS prefix | edits | CAS suffix]`
through a fresh cleaner

```
window = [w_start ............................................. w_end]
edits in this window:        [edit_a]    [edit_b]
                                ^           ^
streamed input to the cleaner:
  CAS bytes [w_start, edit_a.start)
  reader bytes for edit_a (new_length bytes)
  CAS bytes [edit_a.end, edit_b.start)
  reader bytes for edit_b
  CAS bytes [edit_b.end, w_end)
```

Pure inserts contribute zero original bytes but still emit `new_length`
reader bytes. Pure deletes contribute zero reader bytes. The cleaner
produces a new `MDBFileInfo` per window and a `ChunkHashList`.

### Step 4 — compose the file hash via `MerkleHashSubtree::merge`

```text
merge_seq = [gap0, w0, gap1, w1, ..., wN, gapN]   // skip None gaps

merged          = MerkleHashSubtree::merge(merge_seq)
aggregated_hash = merged.final_hash()
combined_hash   = aggregated_hash.hmac(zero)      // matches cleaner's file_hash
```

Special-case: if `total_size == 0` (e.g. truncate to empty) the result
is `MerkleHash::default()` *without* HMAC, mirroring `file_hash([])`.

### Step 5 — splice segments + register

Walk the original `MDBFileInfo.segments` and replace any segment that
falls inside a window with that window's freshly-uploaded segments.
Verification entries follow segment-for-segment when present.
`metadata_ext = None` (no SHA-256, see Limitations). Then
`register_composed_file` + `finalize`.

### Multi-window example

Two edits: replace `[50MB, 51MB)` and `[150MB, 151MB)` on a 200MB file:

```
+-----------+-------+------------+-------+-----------+
|  GAP 0    |  W0   |   GAP 1    |  W1   |  GAP 2    |
|  reused   |upload |  reused    |upload |  reused   |
| (subtree) | ~1MB  | (subtree)  | ~1MB  | (subtree) |
+-----------+-------+------------+-------+-----------+

Wire transfer: ~2MB upload + a few hundred KB of CAS reads for window
boundary chunks. Old approach: 200MB download + 200MB upload.
```

### Empty original short-circuit

When `original_size == 0` there is nothing to compose against — every
edit's `original_range` must be `0..0` (validated). We just stream the
new bytes through a fresh cleaner (`upload_fresh_file`).

</details>

### Reviewer note: `chunk_window_builder` is a re-implementation of
xetcas

`xet_client/src/cas_client/chunk_window_builder.rs` is a port of the
same window-building state machine that already lives in xetcas — it's
only used by the local / in-memory simulation clients (`local_client`,
`memory_client`) so the mock CAS server returns the same shape as the
real one in tests. **No need to re-review it as part of this PR**: it
mirrors logic already reviewed and merged in xetcas#987. A follow-up
xetcas PR will deduplicate by removing the server-side copy and pulling
this one in (or vice versa); the duplication is intentional and
temporary.

### Limitations

- **No SHA-256 metadata**: composed files have `metadata_ext = None`
since recomputing SHA-256 would require reading the full file. Only
suitable for contexts that don't require SHA-256 verification (HF
buckets, xet-native repos), not for Git LFS-backed repos.
- **Memory**: for very large files, the per-window in-memory state
(chunk hash list + composed segments) is bounded by the dirty regions,
not the whole file. The chunk-hashes response is paginated by the
server-defined window granularity.

### Tests (27)

Covering all edit shapes + edge cases. Notable:

| Test | Purpose |
|---|---|
| `test_resize_edits_abc` | The 3 motivating FUSE examples |
| `test_resize_large_replace_grows_file` | Replace `[a..b)` with much
more data |
| `test_resize_large_replace_shrinks_file` | Replace `[a..b)` with much
less data |
| `test_resize_mid_file_insert` | Pure insert in the middle |
| `test_resize_mid_file_delete` | Pure delete in the middle |
| `test_resize_multi_edit_mix` | Insert + replace + delete in one call |
| `test_resize_insert_at_segment_boundary` | Snapping correctness for
inserts |
| `test_upload_ranges_mid_file_edit` | In-place edit |
| `test_upload_ranges_truncation` | Pure truncate (sub-segment) |
| `test_upload_ranges_truncation_empty_staging` | Truncate when staging
is all-zero (boundary read from CAS) |
| `test_upload_ranges_truncation_with_overlapping_dirty` | Truncate +
dirty range overlapping the boundary |
| `test_truncate_to_empty_matches_clean_empty` | Truncating to 0 hashes
to `MerkleHash::default()` (matches a fresh empty cleaner) |
| `test_upload_ranges_append` | Pure append |
| `test_append_with_gap_before_dirty_range` | Append where reader covers
a sparse gap too |
| `test_append_sparse_staging_file` | Append on a sparse staging file |
| `test_mid_edit_plus_append` | Mid-file edit *and* append in one call
(P1 codex regression) |
| `test_empty_original_append` | `original_size == 0` + append falls
into the fresh-file path (P2 codex regression) |
| `test_empty_original_validates_ranges` | `original_size == 0` still
runs validation (reviewer regression) |
| `test_upload_ranges_at_file_start` | Edit at offset 0 (no stable
prefix) |
| `test_upload_ranges_multiple_regions` | Two non-adjacent dirty windows
with stable gap |
| `test_single_input_spanning_many_chunks` | One edit covering many CDC
chunks |
| `test_data_integrity_scenarios` | 5 sub-scenarios covering composition
correctness |
| `test_noop_returns_original_hash` | Empty `dirty_inputs` → no CAS
call, original hash returned |
| `test_rejects_dirty_range_past_total_size` | Validation: range past
`original_size` |
| `test_rejects_overlapping_dirty_ranges` | Validation: overlapping
edits |
| `test_rejects_unsorted_dirty_ranges` | Validation: unsorted edits |
| `test_upload_ranges_small_file_mid_edit` | Small files (single
segment) |

### Dependencies

- xetcas: `GET /v2/file-chunk-hashes/{file_id}` with `windows[] +
hash_ranges[]` response shape — huggingface-internal/xetcas#987
(merged).
- Consumer: huggingface-internal/hf-mount#41.


<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **High Risk**
> High risk because it adds a new partial-upload composition path that
splices CAS segments and recomputes file hashes from window subtrees,
touching core data integrity and client/server chunk-boundary logic.
> 
> **Overview**
> Adds range-aware file writes via new `upload_ranges`, letting callers
apply insert/delete/replace edits and upload only re-chunked dirty
windows while reusing stable CAS segments.
> 
> Introduces a new CAS API `get_file_chunk_hashes` (`GET
/v2/file-chunk-hashes/{file_id}` with `X-Range-Dirty`) plus response
types (`FileChunkHashesResponse`, `ChunkWindow`) and simulation support
(`chunk_window_builder`) that extends dirty ranges to *stable* chunk
boundaries and returns gap `MerkleHashSubtree` summaries +
stable-segment verification.
> 
> Refactors dedup/cleaning plumbing to expose per-chunk hash lists
(`ChunkHashList`), adds detached cleaner/session completion and
`register_composed_file` to avoid orphan shard entries, and
moves/re-exports `next_stable_chunk_boundary` into `xet_core_structures`
for shared stable-window computations.
> 
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
2f4cee46df. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Arpit Jain <arpitjain099@gmail.com>
Co-authored-by: Hoyt Koepke <hoytak@huggingface.co>
Co-authored-by: tison <wander4096@gmail.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: Di Xiao <seanses@users.noreply.github.com>
Co-authored-by: Arpit Jain <3242828+arpitjain099@users.noreply.github.com>
Co-authored-by: Assaf Vayner <assaf@huggingface.co>
Co-authored-by: Rajat Arya <rajatarya@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-21 21:27:59 +02:00

.cargo

fixing some issues identified in cargo audit (#802 )

2026-04-20 14:49:48 -07:00

.github

ci: route Linux jobs through internal HF package proxies (#844 )

2026-05-21 11:01:31 -07:00

.vscode

Test suite for directory logging functionality (#536 )

2025-10-24 10:06:26 -07:00

api_changes

feat: range aware file write (#717 )

2026-05-21 21:27:59 +02:00

docs

Fix simulation deletion controls and soft-delete behavior for GC simulation (#736 )

2026-03-20 10:02:21 -07:00

examples/xet_pkg_napi

Add napi smoke-test example for hf-xet (#835 )

2026-05-14 13:48:47 -07:00

git_xet

Fix spelling typos in comments and docs (#826 )

2026-04-30 13:15:18 -07:00

hf_xet

Bump openssl from 0.10.76 to 0.10.79 (#836 )

2026-05-08 11:43:17 -07:00

openapi

V2 reconstruction with client-side optional single range splitting (#703 )

2026-03-16 14:10:50 -07:00

scripts

feat: smoke tests using hf CLI with bucket and large-file coverage (#710 )

2026-03-17 19:07:05 -07:00

simulation

Bump openssl from 0.10.76 to 0.10.79 (#836 )

2026-05-08 11:43:17 -07:00

wasm

feat: range aware file write (#717 )

2026-05-21 21:27:59 +02:00

xet_client

feat: range aware file write (#717 )

2026-05-21 21:27:59 +02:00

xet_core_structures

feat: range aware file write (#717 )

2026-05-21 21:27:59 +02:00

xet_data

feat: range aware file write (#717 )

2026-05-21 21:27:59 +02:00

xet_pkg

feat: range aware file write (#717 )

2026-05-21 21:27:59 +02:00

xet_runtime

feat: range aware file write (#717 )

2026-05-21 21:27:59 +02:00

.gitignore

Move test-only deps to dev-dependencies in git_xet (#767 )

2026-03-31 13:31:20 -07:00

Cargo.lock

Bump openssl from 0.10.76 to 0.10.79 (#836 )

2026-05-08 11:43:17 -07:00

Cargo.toml

Add napi smoke-test example for hf-xet (#835 )

2026-05-14 13:48:47 -07:00

CODE_OF_CONDUCT.md

Added CoC, contribution guide, and updated readme (#133 )

2025-01-09 14:55:32 -08:00

CONTRIBUTING.md

Added CoC, contribution guide, and updated readme (#133 )

2025-01-09 14:55:32 -08:00

LICENSE

Added CoC, contribution guide, and updated readme (#133 )

2025-01-09 14:55:32 -08:00

markdownlint.toml

spec draft (#422 )

2025-09-29 10:25:25 -07:00

README.md

Add README.md files and Cargo.toml updates needed for publishing hf-xet (#773 )

2026-04-03 12:34:47 -07:00

rustfmt.toml

run cargo fmt on everything (#59 )

2024-10-23 17:57:45 -07:00

README.md

🤗 xet-core - xet client tech, used in huggingface_hub

Welcome

xet-core enables huggingface_hub to utilize xet storage for uploading and downloading to HF Hub. Xet storage provides chunk-based deduplication, efficient storage/retrieval with local disk caching, and backwards compatibility with Git LFS. This library is not meant to be used directly, and is instead intended to be used from huggingface_hub.

Key features

♻ chunk-based deduplication implementation: avoid transferring and storing chunks that are shared across binary files (models, datasets, etc).

🤗 Python bindings: bindings for huggingface_hub package.

↔ network communications: concurrent communication to HF Hub Xet backend services (CAS).

🔖 local disk caching: chunk-based cache that sits alongside the existing huggingface_hub disk cache.

Packages

This repository produces the following packages:

Rust Crates (crates.io)

Crate	Description
`hf-xet`	High-level client library for uploading and downloading files with chunk-based deduplication
`xet-client`	HTTP client for communicating with Hugging Face Xet storage servers
`xet-data`	Data processing pipeline for chunking, deduplication, and file reconstruction
`xet-core-structures`	Core data structures including MerkleHash, metadata shards, and Xorb objects
`xet-runtime`	Async runtime, configuration, logging, and utility infrastructure

Python Package (PyPI)

Package	Description
`hf-xet`	Python bindings for the Xet storage system, used by huggingface_hub

Built from the hf_xet/ directory using maturin.

CLI Binary

Binary	Description
`git-xet`	Git LFS compatible command-line tool for Xet storage

Built from the git_xet/ directory. Distributed via GitHub releases.

Contributions (feature requests, bugs, etc.) are encouraged & appreciated 💙💚💛💜🧡❤️

Please join us in making xet-core better. We value everyone's contributions. Code is not the only way to help. Answering questions, helping each other, improving documentation, filing issues all help immensely. If you are interested in contributing (please do!), check out the contribution guide for this repository.

Issues, Diagnostics & Debugging

If you encounter an issue with hf-xet, please collect diagnostic information and attach it when creating a new Issue.

The scripts/diag/ directory contains platform-specific scripts that download debug symbols, configure logging, and capture periodic stack traces and core dumps:

OS	Script
Linux	`scripts/diag/hf-xet-diag-linux.sh`
macOS	`scripts/diag/hf-xet-diag-macos.sh`
Windows (Git-Bash)	`scripts/diag/hf-xet-diag-windows.sh`

# prefix your failing command with the script for your OS, e.g.:
./scripts/diag/hf-xet-diag-macos.sh -- python my-script.py

See scripts/diag/README.md for full usage, output layout, dump analysis instructions, and how to install debug symbols manually.

Quick debugging environment variables:

RUST_BACKTRACE=full          # full Rust backtraces on panic
RUST_LOG=info                # enable hf-xet logging
HF_XET_LOG_FILE=/tmp/xet.log # write logs to a file (defaults to stdout)

Local Development

Repo Organization

xet_pkg/ (hf-xet): High-level session API for uploading and downloading files with deduplication.
xet_client/ (xet-client): HTTP client for CAS and Hub backend services.
xet_data/ (xet-data): Chunking, deduplication, and file reconstruction pipeline.
xet_core_structures/ (xet-core-structures): MerkleHash, metadata shards, Xorb objects, and shared data structures.
xet_runtime/ (xet-runtime): Async runtime, configuration, logging, and utilities.
hf_xet/: Python bindings (maturin/PyO3), produces the hf-xet PyPI package.
git_xet/: Git LFS compatible CLI tool (git-xet).
wasm/: WebAssembly builds (hf_xet_wasm, hf_xet_thin_wasm).
simulation/: Simulation and benchmarking infrastructure.

Build, Test & Benchmark

To build xet-core, look at requirements in GitHub Actions CI Workflow for the Rust toolchain to install. Follow Rust documentation for installing rustup and that version of the toolchain. Use the following steps for building, testing, benchmarking.

Many of us on the team use VSCode, so we have checked in some settings in the .vscode directory. Install the rust-analyzer extension.

Build:

cargo build

Test:

cargo test

Benchmark:

cargo bench

Linting:

cargo clippy -r --verbose -- -D warnings

Formatting (requires nightly toolchain):

cargo +nightly fmt --manifest-path ./Cargo.toml --all

Building Python package and running locally (on *nix systems):

Create Python3 virtualenv: python3 -mvenv ~/venv
Activate virtualenv: source ~/venv/bin/activate
Install maturin: pip3 install maturin ipython
Go to hf_xet crate: cd hf_xet
Build: maturin develop
Test:

ipython
import hf_xet as hfxet
hfxet.upload_files()
hfxet.download_files()

Developing with tokio console

Prerequisite is installing tokio-console (cargo install tokio-console). See https://github.com/tokio-rs/console

To use tokio-console with hf-xet there are compile hf_xet with the following command:

RUSTFLAGS="--cfg tokio_unstable" maturin develop -r --features tokio-console

Then while hf_xet is running (via a hf cli command or huggingface_hub python code), tokio-console will be able to connect.

Ex.

# In one terminal:
pip install huggingface_hub
RUSTFLAGS="--cfg tokio_unstable" maturin develop -r --features tokio-console
hf download openai/gpt-oss-20b

# In another terminal
cargo install tokio-console
tokio-console

Building universal whl for MacOS:

From hf_xet directory:

MACOSX_DEPLOYMENT_TARGET=10.9 maturin build --release --target universal2-apple-darwin --features openssl_vendored

Note: You may need to install x86_64: rustup target add x86_64-apple-darwin

Testing

Unit-tests are run with cargo test, benchmarks are run with cargo bench. Some crates have a main.rs that can be run for manual testing.

References & History

Technical Blog posts
Git is for Data 'CIDR paper
History: xet-core is adapted from xet-core, which contains deep git integration, along with very different backend services implementation.

Languages

Rust 96.7%

Python 1.7%

Shell 1.3%

JavaScript 0.1%

HTML 0.1%