Add README.md files and Cargo.toml updates needed for publishing hf-xet (#773)

This PR adds crates.io-facing metadata (homepage, readme, keywords,
categories) for the publishable crates, along with crate README files
and concise crate-level docs so crates.io and docs.rs pages have better
context.
This commit is contained in:
Hoyt Koepke
2026-04-03 12:34:47 -07:00
committed by GitHub
parent 014ff2d75b
commit 0d9f78aaf4
17 changed files with 291 additions and 30 deletions

View File

@@ -18,6 +18,7 @@ exclude = ["simulation/chunk_cache_bench", "hf_xet", "wasm/hf_xet_wasm", "wasm/h
version = "1.4.0"
edition = "2024"
license = "Apache-2.0"
homepage = "https://github.com/huggingface/xet-core"
repository = "https://github.com/huggingface/xet-core"
[profile.release]

View File

@@ -37,6 +37,36 @@ xet-core enables huggingface_hub to utilize xet storage for uploading and downlo
🔖 **local disk caching**: chunk-based cache that sits alongside the existing [huggingface_hub disk cache](https://huggingface.co/docs/huggingface_hub/guides/manage-cache).
## Packages
This repository produces the following packages:
### Rust Crates (crates.io)
| Crate | Description |
|-------|-------------|
| [`hf-xet`](https://crates.io/crates/hf-xet) | High-level client library for uploading and downloading files with chunk-based deduplication |
| [`xet-client`](https://crates.io/crates/xet-client) | HTTP client for communicating with Hugging Face Xet storage servers |
| [`xet-data`](https://crates.io/crates/xet-data) | Data processing pipeline for chunking, deduplication, and file reconstruction |
| [`xet-core-structures`](https://crates.io/crates/xet-core-structures) | Core data structures including MerkleHash, metadata shards, and Xorb objects |
| [`xet-runtime`](https://crates.io/crates/xet-runtime) | Async runtime, configuration, logging, and utility infrastructure |
### Python Package (PyPI)
| Package | Description |
|---------|-------------|
| [`hf-xet`](https://pypi.org/project/hf-xet/) | Python bindings for the Xet storage system, used by [huggingface_hub](https://github.com/huggingface/huggingface_hub) |
Built from the [`hf_xet/`](./hf_xet) directory using [maturin](https://github.com/PyO3/maturin).
### CLI Binary
| Binary | Description |
|--------|-------------|
| `git-xet` | Git LFS compatible command-line tool for Xet storage |
Built from the [`git_xet/`](./git_xet) directory. Distributed via [GitHub releases](https://github.com/huggingface/xet-core/releases).
## Contributions (feature requests, bugs, etc.) are encouraged & appreciated 💙💚💛💜🧡❤️
Please join us in making xet-core better. We value everyone's contributions. Code is not the only way to help. Answering questions, helping each other, improving documentation, filing issues all help immensely. If you are interested in contributing (please do!), check out the [contribution guide](https://github.com/huggingface/xet-core/blob/main/CONTRIBUTING.md) for this repository.
@@ -73,21 +103,17 @@ HF_XET_LOG_FILE=/tmp/xet.log # write logs to a file (defaults to stdout)
## Local Development
### Repo Organization - Rust Crates
### Repo Organization
* [cas_client](./cas_client): communication with CAS backend services, which include APIs for Xorbs and Shards.
* [cas_object](./cas_object): CAS object (Xorb) format and associated APIs, including chunks (ranges within Xorbs).
* [cas_types](./cas_types): common types shared across crates in xet-core and xetcas.
* [chunk_cache](./chunk_cache): local disk cache of Xorb chunks.
* [chunk_cache_bench](./chunk_cache_bench): benchmarking crate for chunk_cache.
* [data](./data): main driver for client operations - FilePointerTranslator drives hydrating or shrinking files, chunking + deduplication here.
* [error_printer](./error_printer): utility for printing errors conveniently.
* [file_utils](./file_utils): SafeFileCreator utility, used by chunk_cache.
* [hf_xet](./hf_xet): Python integration with Rust code, uses maturin to build `hf-xet` Python package. Main integration with HF Hub Python package.
* [mdb_shard](./mdb_shard): Shard operations, including Shard format, dedupe probing, benchmarks, and utilities.
* [merklehash](./merklehash): MerkleHash type, 256-bit hash, widely used across many crates.
* [progress_reporting](./progress_reporting): offers ReportedWriter so progress for Writer operations can be displayed.
* [utils](./utils): general utilities, including singleflight, progress, serialization_utils and threadpool.
* [`xet_pkg/`](./xet_pkg) (`hf-xet`): High-level session API for uploading and downloading files with deduplication.
* [`xet_client/`](./xet_client) (`xet-client`): HTTP client for CAS and Hub backend services.
* [`xet_data/`](./xet_data) (`xet-data`): Chunking, deduplication, and file reconstruction pipeline.
* [`xet_core_structures/`](./xet_core_structures) (`xet-core-structures`): MerkleHash, metadata shards, Xorb objects, and shared data structures.
* [`xet_runtime/`](./xet_runtime) (`xet-runtime`): Async runtime, configuration, logging, and utilities.
* [`hf_xet/`](./hf_xet): Python bindings (maturin/PyO3), produces the `hf-xet` PyPI package.
* [`git_xet/`](./git_xet): Git LFS compatible CLI tool (`git-xet`).
* [`wasm/`](./wasm): WebAssembly builds (`hf_xet_wasm`, `hf_xet_thin_wasm`).
* [`simulation/`](./simulation): Simulation and benchmarking infrastructure.
### Build, Test & Benchmark

View File

@@ -3,8 +3,12 @@ name = "xet-client"
version.workspace = true
edition.workspace = true
license.workspace = true
homepage.workspace = true
repository.workspace = true
description = "HTTPS client for communicating with Hugging Face Xet storage servers"
description = "Client library for communicating with Hugging Face Xet storage servers. Use through the hf-xet crate."
readme = "README.md"
keywords = ["huggingface"]
categories = ["artificial-intelligence", "network-programming"]
[lib]
name = "xet_client"
@@ -42,7 +46,13 @@ url = { workspace = true }
urlencoding = { workspace = true }
[target.'cfg(target_family = "wasm")'.dependencies]
tokio = { workspace = true, features = ["sync", "macros", "io-util", "rt", "time"] }
tokio = { workspace = true, features = [
"sync",
"macros",
"io-util",
"rt",
"time",
] }
web-time = { workspace = true }
[target.'cfg(not(target_family = "wasm"))'.dependencies]
@@ -65,7 +75,14 @@ rustls-tls = ["reqwest/rustls"]
native-tls = ["reqwest/native-tls"]
native-tls-vendored = ["reqwest/native-tls-vendored"]
analysis = []
simulation = ["dep:axum", "dep:humantime", "dep:futures-util", "dep:human-bandwidth", "dep:tower-http", "xet-core-structures/simulation"]
simulation = [
"dep:axum",
"dep:humantime",
"dep:futures-util",
"dep:human-bandwidth",
"dep:tower-http",
"xet-core-structures/simulation",
]
[[bin]]
name = "local_cas_server"

17
xet_client/README.md Normal file
View File

@@ -0,0 +1,17 @@
# xet-client
[![crates.io](https://img.shields.io/crates/v/xet-client.svg)](https://crates.io/crates/xet-client)
[![docs.rs](https://docs.rs/xet-client/badge.svg)](https://docs.rs/xet-client)
[![License](https://img.shields.io/crates/l/xet-client.svg)](https://github.com/huggingface/xet-core/blob/main/LICENSE)
Client for communicating with Hugging Face Xet storage servers.
## Overview
Upload and download data and metadata objects from the backend Hugging Face Xet storage servers. Features automatic concurrency adaptations, connection pooling, and retry resiliency. Intended to be used through the API in the hf-xet package.
This crate is part of [xet-core](https://github.com/huggingface/xet-core).
## License
Apache-2.0

View File

@@ -1,3 +1,9 @@
//! HTTPS client for communicating with Hugging Face Xet storage servers.
//!
//! Includes the [`cas_client`] for uploading/downloading Xorb objects and
//! metadata shards, the [`hub_client`] for Hugging Face Hub API
//! interactions, and a local [`chunk_cache`] with LRU eviction.
#![cfg_attr(feature = "strict", deny(warnings))]
pub mod error;

View File

@@ -3,8 +3,12 @@ name = "xet-core-structures"
version.workspace = true
edition.workspace = true
license.workspace = true
homepage.workspace = true
repository.workspace = true
description = "Core data structures including MerkleHash, metadata shards, and xorb objects"
description = "Core data structures including MerkleHash, metadata shards, and Xorb objects."
readme = "README.md"
keywords = ["huggingface"]
categories = ["artificial-intelligence", "data-structures"]
[lib]
name = "xet_core_structures"
@@ -48,13 +52,27 @@ tracing = { workspace = true }
[target.'cfg(not(target_family = "wasm"))'.dependencies]
bytemuck = { workspace = true }
tokio = { workspace = true, features = ["time", "rt", "macros", "sync", "test-util", "io-util", "rt-multi-thread"] }
tokio = { workspace = true, features = [
"time",
"rt",
"macros",
"sync",
"test-util",
"io-util",
"rt-multi-thread",
] }
tokio-util = { workspace = true, features = ["io"] }
uuid = { workspace = true, features = ["v4"] }
[target.'cfg(target_family = "wasm")'.dependencies]
getrandom = { workspace = true, features = ["wasm_js"] }
tokio = { workspace = true, features = ["sync", "macros", "io-util", "rt", "time"] }
tokio = { workspace = true, features = [
"sync",
"macros",
"io-util",
"rt",
"time",
] }
uuid = { workspace = true, features = ["v4", "js"] }
web-time = { workspace = true }

View File

@@ -0,0 +1,23 @@
# xet-core-structures
[![crates.io](https://img.shields.io/crates/v/xet-core-structures.svg)](https://crates.io/crates/xet-core-structures)
[![docs.rs](https://docs.rs/xet-core-structures/badge.svg)](https://docs.rs/xet-core-structures)
[![License](https://img.shields.io/crates/l/xet-core-structures.svg)](https://github.com/huggingface/xet-core/blob/main/LICENSE)
Core data structures for the
[Hugging Face Xet](https://github.com/huggingface/xet-core) storage system,
including Merkle hashes, metadata shards, and Xorb objects.
## Overview
- **MerkleHash** — 256-bit content-addressed hash used throughout the system
- **Metadata shards** — Compact shard format mapping file ranges to Xorb chunks
- **Xorb objects** — Content-addressed storage objects with byte-grouping compression
- **Data structures** — Specialized hash maps and utilities for deduplication
This crate is part of [xet-core](https://github.com/huggingface/xet-core),
the Rust backend for [huggingface_hub](https://github.com/huggingface/huggingface_hub).
## License
Apache-2.0

View File

@@ -1,3 +1,10 @@
//! Core data structures for the Hugging Face Xet storage system.
//!
//! Provides [`merklehash::MerkleHash`] (256-bit content-addressed hashes),
//! [`metadata_shard`] (compact shard format mapping file ranges to Xorb
//! chunks), and [`xorb_object`] (content-addressed storage objects with
//! byte-grouping compression).
#![cfg_attr(feature = "strict", deny(warnings))]
pub mod error;

View File

@@ -3,8 +3,12 @@ name = "xet-data"
version.workspace = true
edition.workspace = true
license.workspace = true
homepage.workspace = true
repository.workspace = true
description = "Data processing pipeline for chunking, deduplication, and file reconstruction; used in the Hugging Face Xet client tools"
description = "Data processing pipeline for chunking, deduplication, and file reconstruction; used in the Hugging Face Xet client tools. Intended to be used through the API in the hf-xet package."
readme = "README.md"
keywords = ["huggingface"]
categories = ["artificial-intelligence", "data-structures", "filesystem"]
[lib]
name = "xet_data"
@@ -39,7 +43,13 @@ walkdir = { workspace = true }
pyo3 = { version = "0.26", features = ["abi3-py37"], optional = true }
[target.'cfg(target_family = "wasm")'.dependencies]
tokio = { workspace = true, features = ["sync", "macros", "io-util", "rt", "time"] }
tokio = { workspace = true, features = [
"sync",
"macros",
"io-util",
"rt",
"time",
] }
[target.'cfg(not(target_family = "wasm"))'.dependencies]
tokio = { workspace = true, features = ["rt-multi-thread", "rt", "time"] }

20
xet_data/README.md Normal file
View File

@@ -0,0 +1,20 @@
# xet-data
[![crates.io](https://img.shields.io/crates/v/xet-data.svg)](https://crates.io/crates/xet-data)
[![docs.rs](https://docs.rs/xet-data/badge.svg)](https://docs.rs/xet-data)
[![License](https://img.shields.io/crates/l/xet-data.svg)](https://github.com/huggingface/xet-core/blob/main/LICENSE)
Data processing pipeline for chunking, deduplication, and file reconstruction. Intended to be used through the API in the hf-xet package.
## Overview
- **Content-defined chunking** — Gear-hash based chunking for deduplication
- **Deduplication** — Probe and register chunks against metadata shards
- **File reconstruction** — Reassemble files from deduplicated chunk references
- **Progress tracking** — Hooks for upload/download progress reporting
This crate is part of [xet-core](https://github.com/huggingface/xet-core).
## License
Apache-2.0

View File

@@ -1,3 +1,10 @@
//! Data processing pipeline for chunking, deduplication, and file
//! reconstruction, used in the Hugging Face Xet storage tools.
//!
//! Provides content-defined chunking via gear hashing, deduplication
//! against metadata shards, and file reconstruction from deduplicated
//! chunk references.
#![cfg_attr(feature = "strict", deny(warnings))]
pub mod error;

View File

@@ -3,10 +3,23 @@ name = "hf-xet"
version.workspace = true
edition.workspace = true
license.workspace = true
homepage.workspace = true
repository.workspace = true
description = "Client library and tooling for the Hugging Face Xet data storage system"
keywords = ["huggingface", "xet", "data", "ai"]
categories = ["api-bindings"]
description = "Client library and tooling for the Hugging Face Xet data storage system."
readme = "README.md"
keywords = [
"huggingface",
"datasets",
"large-files",
"deduplication",
"cloud-storage",
]
categories = [
"artificial-intelligence",
"asynchronous",
"data-structures",
"filesystem",
]
[lib]
name = "xet"
@@ -46,6 +59,11 @@ serde_json = { workspace = true }
serial_test = { workspace = true }
smol = { workspace = true }
tempfile = { workspace = true }
tokio = { workspace = true, features = ["rt-multi-thread", "rt", "time", "macros"] }
tokio = { workspace = true, features = [
"rt-multi-thread",
"rt",
"time",
"macros",
] }
tracing-subscriber = { workspace = true }
wiremock = { workspace = true }

34
xet_pkg/README.md Normal file
View File

@@ -0,0 +1,34 @@
# hf-xet
[![crates.io](https://img.shields.io/crates/v/hf-xet.svg)](https://crates.io/crates/hf-xet)
[![docs.rs](https://docs.rs/hf-xet/badge.svg)](https://docs.rs/hf-xet)
[![License](https://img.shields.io/crates/l/hf-xet.svg)](https://github.com/huggingface/xet-core/blob/main/LICENSE)
Client library for the [Hugging Face Xet](https://github.com/huggingface/xet-core)
data storage system. Provides the high-level session API for uploading and
downloading files with chunk-based deduplication.
## Overview
- **XetSession** — Top-level session managing authentication, configuration,
and concurrent file transfers
- **Upload & download** — Stream files to/from Hugging Face Hub with automatic
chunking, deduplication, and local caching
## Crate Ecosystem
`hf-xet` ties together the lower-level xet-core crates:
| Crate | Role |
|-------|------|
| [`xet-runtime`](https://crates.io/crates/xet-runtime) | Async runtime, config, logging |
| [`xet-core-structures`](https://crates.io/crates/xet-core-structures) | Merkle hashes, shards, Xorb objects |
| [`xet-client`](https://crates.io/crates/xet-client) | HTTP client for CAS and Hub APIs |
| [`xet-data`](https://crates.io/crates/xet-data) | Chunking, dedup, file reconstruction |
This crate is part of [xet-core](https://github.com/huggingface/xet-core),
the Rust backend for [huggingface_hub](https://github.com/huggingface/huggingface_hub).
## License
Apache-2.0

View File

@@ -1,3 +1,10 @@
//! Client library for the Hugging Face Xet data storage system.
//!
//! Provides the high-level [`xet_session::XetSession`] API for uploading
//! and downloading files with chunk-based deduplication, tying together
//! the lower-level [`xet_runtime`], [`xet_core_structures`],
//! [`xet_client`], and [`xet_data`] crates.
pub mod error;
pub use error::XetError;
#[cfg(feature = "python")]

View File

@@ -3,8 +3,12 @@ name = "xet-runtime"
version.workspace = true
edition.workspace = true
license.workspace = true
homepage.workspace = true
repository.workspace = true
description = "Async runtime, configuration, logging, and utility infrastructure for the Hugging Face Xet client tools"
description = "Async runtime, configuration, logging, and utility infrastructure for the Hugging Face Xet client tools."
readme = "README.md"
keywords = ["huggingface"]
categories = ["artificial-intelligence", "asynchronous"]
[lib]
name = "xet_runtime"
@@ -48,11 +52,25 @@ winapi = { workspace = true }
whoami = { workspace = true }
[target.'cfg(target_family = "wasm")'.dependencies]
tokio = { workspace = true, features = ["sync", "macros", "io-util", "rt", "time"] }
tokio = { workspace = true, features = [
"sync",
"macros",
"io-util",
"rt",
"time",
] }
[target.'cfg(not(target_family = "wasm"))'.dependencies]
shellexpand = { workspace = true, features = ["path"] }
tokio = { workspace = true, features = ["time", "rt", "macros", "sync", "test-util", "io-util", "rt-multi-thread"] }
tokio = { workspace = true, features = [
"time",
"rt",
"macros",
"sync",
"test-util",
"io-util",
"rt-multi-thread",
] }
tokio-util = { workspace = true, features = ["io"] }
[target.'cfg(target_os = "macos")'.dependencies]

25
xet_runtime/README.md Normal file
View File

@@ -0,0 +1,25 @@
# xet-runtime
[![crates.io](https://img.shields.io/crates/v/xet-runtime.svg)](https://crates.io/crates/xet-runtime)
[![docs.rs](https://docs.rs/xet-runtime/badge.svg)](https://docs.rs/xet-runtime)
[![License](https://img.shields.io/crates/l/xet-runtime.svg)](https://github.com/huggingface/xet-core/blob/main/LICENSE)
Async runtime, configuration storage, logging, and utility infrastructure for the
[Hugging Face Xet](https://github.com/huggingface/xet-core) storage tools. This is meant to be used through the API in the hf-xet package.
## Overview
`xet-runtime` provides the shared foundation used by all crates in the
xet-core ecosystem:
- **Async runtime** — Tokio-based runtime with configurable thread pools
- **Configuration** — Hierarchical configuration for Xet clients
- **Structured logging** — Tracing-based logging with file and console outputs
- **Error handling** — `RuntimeError` type for the runtime layer
- **Utilities** — File operations, sync primitives, and platform abstractions
This crate is part of [xet-core](https://github.com/huggingface/xet-core).
## License
Apache-2.0

View File

@@ -1,3 +1,10 @@
//! Async runtime, configuration, logging, and utility infrastructure for
//! the Hugging Face Xet storage tools.
//!
//! This crate provides the shared foundation used by all crates in the
//! xet-core ecosystem: a Tokio-based async runtime, hierarchical
//! configuration, structured tracing-based logging, and common error types.
#![cfg_attr(feature = "strict", deny(warnings))]
pub mod error;