Files
xet-core/data
Adrien a48f1f80e4 feat: add skip_sha256 option to SingleFileCleaner (#679)
## Summary

- Add `ShaGenerator::Skip` variant that skips SHA-256 computation
entirely
- `ShaGenerator::finalize()` now returns `Option<Sha256>` (None when
skipped)
- `SingleFileCleaner::new()` and `FileUploadSession::start_clean()`
accept a `skip_sha256` boolean
- When skipped, no `FileMetadataExt` is included in the shard

## Context

Bucket uploads don't need SHA-256 in the shard metadata — the
`sha_index` GSI is only used for LFS pointer resolution, which doesn't
apply to buckets. Skipping SHA-256 for bucket uploads removes the main
CPU bottleneck in the upload pipeline on non-SHA-NI instances.

## Alternative: dummy SHA-256

Instead of skipping entirely, the client could send a zeroed/dummy
`FileMetadataExt`. The server would still store it but queries would
never match. This avoids the server-side schema change (xetcas PR) but
pollutes the GSI with dummy entries.

Companion PRs:
- xetcas: huggingface-internal/xetcas#498 (make `FileIdItem.sha256`
optional server-side)
2026-03-10 17:36:09 +01:00
..

data

A high-level data translation layer for Xet's content-addressable storage (CAS). This crate handles:

  • Cleaning (uploading) regular files into deduplicated CAS objects (xorbs + shards) and producing lightweight pointers (XetFileInfo).
  • Smudging (downloading) pointer metadata back into materialized files.

Core APIs

  • High-level async functions in data::data_client:

    • upload_async(file_paths, endpoint, token_info, token_refresher, progress_updater) -> Vec<XetFileInfo>
    • download_async(files: Vec<(XetFileInfo, String)>, endpoint, token_info, token_refresher, progress_updaters) -> Vec<String>
  • Sessions and primitives (re-exported at the crate root):

    • FileUploadSession multi-file, deduplicated upload session. Handles chunking, xorb/shard production, and finalization.
    • FileDownloader smudges files from CAS given a MerkleHash/XetFileInfo.
    • XetFileInfo compact pointer describing a file by its hash and size.

Both high-level functions create sensible defaults (cache paths, progress aggregation, endpoint separation) via data_client::default_config and enforce bounded concurrency.

How hf_xet uses this crate

The hf_xet Python extension exposes thin wrappers around these async functions and types. In hf_xet/src/lib.rs:

  • upload_files(...) calls data::data_client::upload_async.
  • upload_bytes(...) calls data::data_client::upload_bytes_async.
  • download_files(...) calls data::data_client::download_async.