mirror of
https://github.com/huggingface/xet-core.git
synced 2026-06-04 13:30:29 +08:00
## Summary - Add `ShaGenerator::Skip` variant that skips SHA-256 computation entirely - `ShaGenerator::finalize()` now returns `Option<Sha256>` (None when skipped) - `SingleFileCleaner::new()` and `FileUploadSession::start_clean()` accept a `skip_sha256` boolean - When skipped, no `FileMetadataExt` is included in the shard ## Context Bucket uploads don't need SHA-256 in the shard metadata — the `sha_index` GSI is only used for LFS pointer resolution, which doesn't apply to buckets. Skipping SHA-256 for bucket uploads removes the main CPU bottleneck in the upload pipeline on non-SHA-NI instances. ## Alternative: dummy SHA-256 Instead of skipping entirely, the client could send a zeroed/dummy `FileMetadataExt`. The server would still store it but queries would never match. This avoids the server-side schema change (xetcas PR) but pollutes the GSI with dummy entries. Companion PRs: - xetcas: huggingface-internal/xetcas#498 (make `FileIdItem.sha256` optional server-side)
data
A high-level data translation layer for Xet's content-addressable storage (CAS). This crate handles:
- Cleaning (uploading) regular files into deduplicated CAS objects (xorbs + shards) and producing lightweight pointers (
XetFileInfo). - Smudging (downloading) pointer metadata back into materialized files.
Core APIs
-
High-level async functions in
data::data_client:upload_async(file_paths, endpoint, token_info, token_refresher, progress_updater) -> Vec<XetFileInfo>download_async(files: Vec<(XetFileInfo, String)>, endpoint, token_info, token_refresher, progress_updaters) -> Vec<String>
-
Sessions and primitives (re-exported at the crate root):
FileUploadSession– multi-file, deduplicated upload session. Handles chunking, xorb/shard production, and finalization.FileDownloader– smudges files from CAS given aMerkleHash/XetFileInfo.XetFileInfo– compact pointer describing a file by its hash and size.
Both high-level functions create sensible defaults (cache paths, progress aggregation, endpoint separation) via data_client::default_config and enforce bounded concurrency.
How hf_xet uses this crate
The hf_xet Python extension exposes thin wrappers around these async functions and types. In hf_xet/src/lib.rs:
upload_files(...)callsdata::data_client::upload_async.upload_bytes(...)callsdata::data_client::upload_bytes_async.download_files(...)callsdata::data_client::download_async.