Files
xet-core/data
Rajat Arya 8ac31a1eac Add hash_files() function to hf_xet Python package (#614)
## Summary
Adds a `hash_files(file_paths: List[str])` function to the hf_xet Python
package that computes xet hashes for files without uploading them. This
enables fast, local-only file hashing without requiring authentication
or server connection.

## Key Features
- **No authentication** or server connection required
- **Pure local computation** - no deduplication queries or network I/O
- **Results in same order** as input file paths
- **API consistency** - returns `PyXetUploadInfo` like `upload_files`

## Implementation
- Added `hash_single_file()` in data/src/data_client.rs for single file
hashing
- Added `hash_files_async()` for parallel processing of multiple files
- Added Python binding `hash_files()` in hf_xet/src/lib.rs
- Reuses existing `Chunker` and `file_hash` infrastructure
- Uses `CONCURRENT_FILE_INGESTION_LIMITER` for controlled concurrency

## Usage Example
```python
import hf_xet

# Compute hashes without uploading
file_paths = ["/path/to/file1.txt", "/path/to/file2.txt"]
results = hf_xet.hash_files(file_paths)

for path, info in zip(file_paths, results):
    print(f"File: {path}, Hash: {info.hash}, Size: {info.file_size}")
```

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-27 12:35:56 -08:00
..

data

A high-level data translation layer for Xet's content-addressable storage (CAS). This crate handles:

  • Cleaning (uploading) regular files into deduplicated CAS objects (xorbs + shards) and producing lightweight pointers (XetFileInfo).
  • Smudging (downloading) pointer metadata back into materialized files.

Core APIs

  • High-level async functions in data::data_client:

    • upload_async(file_paths, endpoint, token_info, token_refresher, progress_updater) -> Vec<XetFileInfo>
    • download_async(files: Vec<(XetFileInfo, String)>, endpoint, token_info, token_refresher, progress_updaters) -> Vec<String>
  • Sessions and primitives (re-exported at the crate root):

    • FileUploadSession multi-file, deduplicated upload session. Handles chunking, xorb/shard production, and finalization.
    • FileDownloader smudges files from CAS given a MerkleHash/XetFileInfo.
    • XetFileInfo compact pointer describing a file by its hash and size.

Both high-level functions create sensible defaults (cache paths, progress aggregation, endpoint separation) via data_client::default_config and enforce bounded concurrency.

How hf_xet uses this crate

The hf_xet Python extension exposes thin wrappers around these async functions and types. In hf_xet/src/lib.rs:

  • upload_files(...) calls data::data_client::upload_async.
  • upload_bytes(...) calls data::data_client::upload_bytes_async.
  • download_files(...) calls data::data_client::download_async.