This PR adds a function, next_stable_chunk_boundary, that takes a list
of chunk boundary positions and a starting cut point and returns the
next chunk boundary after the cut point such that, for all possible
alterations of the data up to the cut point, the chunk boundaries when
chunking the entire file will always be the same starting at the stable
chunk boundary.
The implication of this is that to alter a specific range of a file `[a,
b)`, we would do the following:
1. Locate the previous chunk boundary before a; call this `c_start`.
2. Take the full set of chunk boundary locations, call
next_stable_chunk_boundary with b as the cut point. this will return the
next stable chunk boundary. Call this `c_end`.
3. Make the replacement to `[a, b)`; prepend the original `data[c_start,
a)` and append `data[b, c_end)`; chunk this segment.
4. Use the merkle hash subtrees for `[0, c_start)`, the new [c_start,
c_end), and the original `[c_end, end)` to calculate the new file hash.
This will be the same as chunking the entire new file.
<!-- CURSOR_SUMMARY -->
---
> [!NOTE]
> **Medium Risk**
> Adds new public chunk-boundary selection logic used to make
resumed/partial workflows deterministic; mistakes could cause
misalignment or incorrect resume behavior in deduplication/chunking
paths. Large new randomized/stress tests reduce risk but the algorithm’s
correctness assumptions are subtle.
>
> **Overview**
> Introduces a new public helper, `next_stable_chunk_boundary`, that
computes a restart-safe/stable resume boundary *from existing
chunk-boundary metadata* (no byte access) by scanning for two
consecutive chunks that fall within a conservative size window derived
from chunking constants.
>
> Updates `find_partitions` documentation to reflect the hash
warmup/hidden-trigger verification approach and to reference the new
helper, re-exports the function from `xet_data::deduplication`, and adds
extensive edge-case and randomized mutation/stress tests to validate
boundary stability under arbitrary prefix changes.
>
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
98411603e3. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
🤗 xet-core - xet client tech, used in huggingface_hub
Welcome
xet-core enables huggingface_hub to utilize xet storage for uploading and downloading to HF Hub. Xet storage provides chunk-based deduplication, efficient storage/retrieval with local disk caching, and backwards compatibility with Git LFS. This library is not meant to be used directly, and is instead intended to be used from huggingface_hub.
Key features
♻ chunk-based deduplication implementation: avoid transferring and storing chunks that are shared across binary files (models, datasets, etc).
🤗 Python bindings: bindings for huggingface_hub package.
↔ network communications: concurrent communication to HF Hub Xet backend services (CAS).
🔖 local disk caching: chunk-based cache that sits alongside the existing huggingface_hub disk cache.
Packages
This repository produces the following packages:
Rust Crates (crates.io)
| Crate | Description |
|---|---|
hf-xet |
High-level client library for uploading and downloading files with chunk-based deduplication |
xet-client |
HTTP client for communicating with Hugging Face Xet storage servers |
xet-data |
Data processing pipeline for chunking, deduplication, and file reconstruction |
xet-core-structures |
Core data structures including MerkleHash, metadata shards, and Xorb objects |
xet-runtime |
Async runtime, configuration, logging, and utility infrastructure |
Python Package (PyPI)
| Package | Description |
|---|---|
hf-xet |
Python bindings for the Xet storage system, used by huggingface_hub |
Built from the hf_xet/ directory using maturin.
CLI Binary
| Binary | Description |
|---|---|
git-xet |
Git LFS compatible command-line tool for Xet storage |
Built from the git_xet/ directory. Distributed via GitHub releases.
Contributions (feature requests, bugs, etc.) are encouraged & appreciated 💙💚💛💜🧡❤️
Please join us in making xet-core better. We value everyone's contributions. Code is not the only way to help. Answering questions, helping each other, improving documentation, filing issues all help immensely. If you are interested in contributing (please do!), check out the contribution guide for this repository.
Issues, Diagnostics & Debugging
If you encounter an issue with hf-xet, please collect diagnostic information
and attach it when creating a new Issue.
The scripts/diag/ directory contains platform-specific scripts
that download debug symbols, configure logging, and capture periodic stack traces
and core dumps:
| OS | Script |
|---|---|
| Linux | scripts/diag/hf-xet-diag-linux.sh |
| macOS | scripts/diag/hf-xet-diag-macos.sh |
| Windows (Git-Bash) | scripts/diag/hf-xet-diag-windows.sh |
# prefix your failing command with the script for your OS, e.g.:
./scripts/diag/hf-xet-diag-macos.sh -- python my-script.py
See scripts/diag/README.md for full usage, output layout, dump analysis instructions, and how to install debug symbols manually.
Quick debugging environment variables:
RUST_BACKTRACE=full # full Rust backtraces on panic
RUST_LOG=info # enable hf-xet logging
HF_XET_LOG_FILE=/tmp/xet.log # write logs to a file (defaults to stdout)
Local Development
Repo Organization
xet_pkg/(hf-xet): High-level session API for uploading and downloading files with deduplication.xet_client/(xet-client): HTTP client for CAS and Hub backend services.xet_data/(xet-data): Chunking, deduplication, and file reconstruction pipeline.xet_core_structures/(xet-core-structures): MerkleHash, metadata shards, Xorb objects, and shared data structures.xet_runtime/(xet-runtime): Async runtime, configuration, logging, and utilities.hf_xet/: Python bindings (maturin/PyO3), produces thehf-xetPyPI package.git_xet/: Git LFS compatible CLI tool (git-xet).wasm/: WebAssembly builds (hf_xet_wasm,hf_xet_thin_wasm).simulation/: Simulation and benchmarking infrastructure.
Build, Test & Benchmark
To build xet-core, look at requirements in GitHub Actions CI Workflow for the Rust toolchain to install. Follow Rust documentation for installing rustup and that version of the toolchain. Use the following steps for building, testing, benchmarking.
Many of us on the team use VSCode, so we have checked in some settings in the .vscode directory. Install the rust-analyzer extension.
Build:
cargo build
Test:
cargo test
Benchmark:
cargo bench
Linting:
cargo clippy -r --verbose -- -D warnings
Formatting (requires nightly toolchain):
cargo +nightly fmt --manifest-path ./Cargo.toml --all
Building Python package and running locally (on *nix systems):
- Create Python3 virtualenv:
python3 -mvenv ~/venv - Activate virtualenv:
source ~/venv/bin/activate - Install maturin:
pip3 install maturin ipython - Go to hf_xet crate:
cd hf_xet - Build:
maturin develop - Test:
ipython
import hf_xet as hfxet
hfxet.upload_files()
hfxet.download_files()
Developing with tokio console
Prerequisite is installing tokio-console (
cargo install tokio-console). See https://github.com/tokio-rs/console
To use tokio-console with hf-xet there are compile hf_xet with the following command:
RUSTFLAGS="--cfg tokio_unstable" maturin develop -r --features tokio-console
Then while hf_xet is running (via a hf cli command or huggingface_hub python code), tokio-console will be able to connect.
Ex.
# In one terminal:
pip install huggingface_hub
RUSTFLAGS="--cfg tokio_unstable" maturin develop -r --features tokio-console
hf download openai/gpt-oss-20b
# In another terminal
cargo install tokio-console
tokio-console
Building universal whl for MacOS:
From hf_xet directory:
MACOSX_DEPLOYMENT_TARGET=10.9 maturin build --release --target universal2-apple-darwin --features openssl_vendored
Note: You may need to install x86_64: rustup target add x86_64-apple-darwin
Testing
Unit-tests are run with cargo test, benchmarks are run with cargo bench. Some crates have a main.rs that can be run for manual testing.
References & History
- Technical Blog posts
- Git is for Data 'CIDR paper
- History: xet-core is adapted from xet-core, which contains deep git integration, along with very different backend services implementation.