xet-core

mirror of https://github.com/huggingface/xet-core.git synced 2026-06-04 13:30:29 +08:00

Author	SHA1	Message	Date
Di Xiao	fb83178d28	Fix session id regression (#738 ) The session id was replaced from `ulid` to `UniqueID` (a self incrementing u64 in memory) in a previous PR but it's not correct. The session id is used on CAS server logs and traces and CDN logs to identity a related group of activity (for debugging and etc. purposes) and it needs to be globally unique (thus using `ulid`) instead of locally unique.	2026-03-19 13:31:40 -07:00
Di Xiao	e25ee85c14	Fix a compilation failure (#740 ) Fix a compilation failure on a new function introduced by https://github.com/huggingface/xet-core/pull/726 caught by the new introduced CI step `cargo bench --no-run`.	2026-03-19 12:03:33 -07:00
Di Xiao	9f68537319	Resolve all Dependabot alerts (#733 ) This PR should resolve all Dependabot alerts by upgrading deps and switching out some deprecated crate for suggested alternatives, e.g. `tempdir -> tempfile`. Supersede PR #721. Fix issue #722	2026-03-19 09:33:56 -07:00
Di Xiao	4d24627180	Fix bench code compilation after repo restructuring (#728 ) The last repo restructuring didn't update several bench code that are not compiled by default as part of "cargo build". This PR fixes those compilation errors and warning, and adds "cargo bench --no-run" to CI which checks compilation but doesn't actually run benchmarks.	2026-03-19 09:28:57 -07:00
Hoyt Koepke	6fb97241f3	Integration test suite on top of xet session interface. (#727 ) This PR adds a full integration test suite on top of the xet session interface that mimics the integration tests in xet_data/tests/. This one additionally tests alternate asynchronous runtimes to ensure that the bridge to the internal tokio runtime works correctly as well.	2026-03-18 18:08:06 -07:00
Hoyt Koepke	506fc28291	Simplify progress tracking + Unify Task ID tracking + Legacy Interface (#726 ) Currently, progress tracking is split between callback-driven and snapshot-driven paths, making session and task wiring across xet_data, xet_pkg, hf_xet, and git_xet harder to keep consistent. This PR moves upload/download progress to a polling snapshot model backed by atomics. It also switches task identifiers to a UniqueID common with the progress tracking throughout the session APIs. This PR also updates the rate estimation to use the lighter weight exponentially weighted moving averages model, so this can be done at a low level. To preserve compatibility for existing callback consumers, callback-oriented upload/download progress tracking APIs are moved under xet_pkg::legacy and bridged from polling snapshots via a callback based updaters. hf_xet and git_xet are updated to use that legacy bridge layer, so current integrations keep working until everything is fully switched over to the XetSession method.	2026-03-18 18:07:43 -07:00
Rajat Arya	c0f7980616	feat: smoke tests using hf CLI with bucket and large-file coverage (#710 ) ## Summary - Rewrites smoke tests to drive everything through the `hf` CLI rather than the huggingface_hub Python API, covering the actual user-facing surface area of hf-xet - Moves smoke tests and diagnostic scripts into a `scripts/` directory for cleaner repo layout - Adds storage bucket test suite exercising the full bucket lifecycle - Adds 50 MB and 100 MB files to repo upload/download tests ## Test matrix (14 tests, all passing) Repository tests (`hf upload` / `hf download`) - Upload single file, upload folder - Download individual files + SHA-256 verify - Download entire repo + SHA-256 verify - Overwrite file and verify new content served - Delete file and confirm absent Bucket tests (`hf buckets`) - `cp` upload / download + verify - `sync` upload / download + verify - Recursive list confirms expected paths - Overwrite via `cp` + verify - `sync --delete` removes extraneous remote files - `rm` + confirm absent from listing ## Test plan - [x] Run `HF_TOKEN=... ./scripts/smoke_tests/run.sh` and confirm all 14 tests pass - [x] Run `./scripts/smoke_tests/run.sh --skip-buckets` for repo-only path - [x] Run with `--hf-xet-version <version>` to confirm PyPI cache bypass works 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-17 19:07:05 -07:00
Hoyt Koepke	69c714c01d	Update config groups to handle more of the data management values. (#702 ) This PR moves some config values that were part of the data configuration into XetConfig, specifically the compression_policy, staging_subdir, session_dir_name, and global_dedup_query_enabled. This also consolidates the remaining values into a single struct with endpoint and authentication information.	2026-03-16 16:06:46 -07:00
Hoyt Koepke	ed182125fa	Add optional Sha256 hash propegation through XetFileInfo object. (#718 ) Currently, the SHA-256 hash of uploaded file content is computed internally during the upload pipeline but not surfaced to callers. Downstream consumers — e.g. OpenDAL's Hugging Face backend — need the SHA-256 to commit files to the Hub API. This PR adds an optional sha256 field to XetFileInfo, the session-layer FileMetadata, and the Python-exposed PyXetUploadInfo. The field is populated from the already-computed hash when Sha256Policy::Compute or Sha256Policy::Provided is used, and left None for downloads and when Sha256Policy::Skip is used. Serde attributes (default, skip_serializing_if) ensure backward-compatible serialisation — existing serialised data without the field deserialises cleanly. Needed for the functionality in https://github.com/huggingface/xet-core/pull/642.	2026-03-16 16:05:49 -07:00
Hoyt Koepke	40ba7b8911	Fix race condition in task status transitions on abort. (#720 ) Currently, the Queued → Running status transition in spawned upload tasks is unconditional — it overwrites whatever the current status is, including Cancelled set by a concurrent abort() call. This creates a race window: if abort() sets Cancelled between the semaphore acquisition and the status write, the task overwrites it with Running, then the completion guard (if matches!(*s, TaskStatus::Running)) passes and sets Completed. The result is a task that was aborted but reports Completed. This PR makes the Queued → Running transition conditional, matching the already-guarded Running → Completed/Failed transition. If the status is no longer Queued when the task starts, it bails early with SessionError::Aborted. This closes the race window — all three status transitions are now properly guarded against concurrent abort(). This was observed as a flaky test failure on Windows CI (test_abort_while_state_lock_held_skips_state_update_but_drains_tasks).	2026-03-16 14:17:57 -07:00
Hoyt Koepke	9caf7fcc44	V2 reconstruction with client-side optional single range splitting (#703 ) This PR introduces V2 multirange URL fetching for xorbs, but optionally splits the multirange requests into multiple single-range requests that can be executed in parallel. This allows the reconstruction process to generate full multirange presigned URLs, but the client effectively performs the retrieval stage as a sequence of parallel single-range queries. The config variable `client.enable_multirange_fetching` controls this behavior; by default it is set to false due to the current observed slowness of fetching multiranged URLs. --------- Co-authored-by: Adrien <adrien@huggingface.co>	2026-03-16 14:10:50 -07:00
Hoyt Koepke	79df99ad01	Unify sync and async download/upload groups in session interface. (#719 ) Currently, `UploadCommitSync` and `DownloadGroupSync` are thin wrappers around `UploadCommit` and `DownloadGroup` that delegate every method through `external_run_async_task`. This means two types, two sets of doc comments, and two test suites covering the same underlying behavior. This PR removes the separate sync types and adds `_blocking` suffixed methods directly on `UploadCommit` and `DownloadGroup`. The session factory methods `new_upload_commit_blocking()` and `new_download_group_blocking()` now return the same types as their async counterparts, and the entire `xet_session::sync` module is deleted (~680 lines removed). This also fixes a minor bug: `UploadCommitSync::upload_from_path` did not call `std::path::absolute()` on the file path before dispatching, unlike the async version. The new `upload_from_path_blocking` includes the `std::path::absolute()` call, matching the async version's behavior.	2026-03-16 13:33:46 -07:00
Hoyt Koepke	71f8570a0e	Optimize config struct for direct access in python (#706 ) This PR adds in a feature flag, "python" to the xet_runtime package such that when compiled, the XetConfig struct is built to have python getters and setters. This integrates the handling of the config struct directly into the XetConfig struct and the macros used to register the config values, making the handling of values in the python bindings seamless.	2026-03-16 12:23:43 -07:00
Di Xiao	b7cd43c8cb	Add Sha256Policy to XetSession APIs (#701 ) Stacking on top of https://github.com/huggingface/xet-core/pull/694, this updates both the async and sync APIs: update `XetSession::UploadCommit` and `XetSession::UploadCommitSync` pub APIs to explicitly take a `sha256: Sha256Policy` to control whether to compute and embed a sha256 for that file in the upload info. Resolves XET-898 --------- Co-authored-by: Hoyt Koepke <hoytak@huggingface.co>	2026-03-16 11:03:23 -07:00
Adrien	820f2657c5	fix: bound file reconstruction range using file_size to prevent 416 errors (#716 )	2026-03-16 07:07:03 +01:00
Brian Ronan	6232b42591	Xorb download URL debug logs (#714 ) It's a bit annoying to try to ensure our CDN routing is correct. Logging the URL domain for the first fetch term download to the debug logs. Please don't hesitate to recommend alternative approaches.	2026-03-13 16:05:30 -07:00
Di Xiao	e701aeddac	Support XetSession in async context (#694 ) `XetSession` always created its own tokio runtime via `XetRuntime::new_with_config`, and calling `external_run_async_task` panics when already inside a tokio context. This blocked embedding the session in async Rust frameworks. Core strategy: - `RuntimeMode` enum — `Owned` (session created its own thread pool via `XetSessionBuilder::build` or `XetSessionBuilder::build_async` when outside tokio context. Both `_blocking` and async methods are supported. Async methods use an internal `bridge_to_owned` bridge that routes futures onto the owned thread pool, so they work from any executor (tokio, smol, async-std)) vs `External` (session wraps a caller-supplied tokio handle via `XetSessionBuilder::with_tokio_handle` or `XetSessionBuilder::build_async` when inside qualified tokio context. Only async methods may be called; `_blocking` methods return `SessionError::WrongRuntimeMode`. No second thread pool is created). - `XetRuntime::bridge_to_owned` — a new bridge that routes a future onto the owned tokio thread pool from any executor (smol, async-std, futures::executor, non-qualified tokio runtime) by delivering the result via a `tokio::sync::oneshot` channel that can be polled by any async executor. - Async public API — `UploadCommit` and `DownloadGroup` methods (`upload_from_path`, `upload_bytes`, `upload_file`, `commit`, `finish`) are now async fn. Factory methods `XetSession::new_upload_commit` and `new_download_group` are async. Example: ``` let session = XetSessionBuilder::new().build_async().await?; // Upload let commit = session.new_upload_commit().await?; let handle = commit.upload_from_path("file.bin".into()).await?; let results = commit.commit().await?; // Download let group = session.new_download_group().await?; let info = XetFileInfo { hash: ..., file_size: ..., }; let dl_handle = group.download_file_to_path(info, "out/file.bin".into())?; let finish_results = group.finish().await?; ``` - Sync wrappers — New `UploadCommitSync` / `DownloadGroupSync` in `xet_session/sync/` expose a fully blocking API for sync Rust and Python (PyO3) callers. Returned by `new_upload_commit_blocking()` and `new_download_group_blocking()`. Example: ``` let session = XetSessionBuilder::new().build()?; // Upload let commit = session.new_upload_commit_blocking()?; let handle = commit.upload_from_path("file.bin".into())?; let results = commit.commit()?; let m = results.values().next().unwrap().as_ref().as_ref().unwrap(); // Download let group = session.new_download_group_blocking()?; let info = XetFileInfo { hash: ..., file_size: ..., }; let dl_handle = group.download_file_to_path(info, "out/file.bin".into())?; let finish_results = group.finish()?; ``` Additional fixes: `download_file_to_path` and `upload_from_path` now canonicalize paths with `std::path::absolute` before enqueuing; task status is only overwritten when still `Running`, preventing a race with concurrent abort(). Fix XET-891 --------- Co-authored-by: Hoyt Koepke <hoytak@huggingface.co>	2026-03-13 14:57:20 -07:00
Hoyt Koepke	3390bdc716	Adjust RTT prediction determining concurrency by transmission size. (#708 ) Currently, the condition for increasing connection concurrency is gated on the model predicting that a 64MB transmission will complete within 90 seconds. However, when the transmissions are primarily composed of small packets, this can drastically overestimate the round trip, artificially suppressing the connection concurrency. This PR fixes this issue by also modeling the average predicted packet size, using the 95% quantile of that (bounded by two config variables) to predict the round trip time when considering a concurrency increase.	2026-03-13 10:47:45 -07:00
Adrien	bcce76be63	chore: version bump to 1.4.2 (#712 ) ## Summary - Bump hf_xet version from 1.4.1 to 1.4.2 in Cargo.toml and Cargo.lock - Follows up on 1.4.1 release where the version bump PR was merged after the release artifacts were built v1.4.2	2026-03-13 07:46:46 +01:00
Rajat Arya	2589bf05bc	version bump to 1.4.1 (#707 )	2026-03-12 17:35:33 -07:00
Adrien	0fb930c8d0	feat: expose skip_sha256 parameter in Python upload API (#705 ) ## Summary Add `skip_sha256` and `sha256s` parameters to `upload_bytes()` Python binding for per-file SHA-256 policies: - `skip_sha256: bool = False` - Skip SHA-256 computation entirely (sets `Sha256Policy::Skip`) - `sha256s: Optional[List[str]] = None` - Provide pre-computed SHA-256 hashes (companion to existing parameter on `upload_files()`) - These parameters are mutually exclusive ## Changes Python binding changes: - Add `skip_sha256` + `sha256s` params to `upload_bytes()` / `upload_files()` - All policy conversion happens at Python boundary Internal refactoring: - Add `Clone`/`Copy` derives + `from_skip()`/`from_hex()` helpers to `Sha256Policy` - Update `upload_bytes_async`, `upload_async`, `clean_file` to use `Vec<Sha256Policy>` - Update all internal callers across `git_xet`, `xet_pkg`, migration tool, tests ## Motivation `huggingface_hub` already knows whether SHA-256 is required. This change enables skipping expensive computation when unnecessary, or passing pre-computed hashes for bulk operations. Companion to #678. --------- Co-authored-by: Wauplin <lucainp@gmail.com> v1.4.1	2026-03-12 18:17:12 +01:00
Di Xiao	cacd713218	Rework the interface for session task to get result from registered upload (#690 ) This PR updates the interface for retrieving per-task results after UploadCommit::commit() or DownloadGroup::finish(). The problem with the previous interface is that commit() and finish() return a vector of FileMetadata or DownloadResult, making it difficult for users to associate each result with a specific task. The new interface uses `task_id` as a strong binding bridge: ## Upload per-task result access patterns After commit() completes, there are two equivalent ways to retrieve a per-task FileMetadata result: 1. Lookup in the global result map: ``` let commit = session.new_upload_commit()?; let handle = commit.upload_from_path(src)?; let results = commit.commit()?; let result = results.get(&handle.task_id) ``` 2. Direct access from the handle: ``` let commit = session.new_upload_commit()?; let handle = commit.upload_from_path(src)?; commit.commit()?; // handle.result() is populated by commit() via the shared Arc. let result = handle.result() ``` ## Download per-task result access patterns The pattern is similar to the above. ## Why not put results in a vector in the same order as tasks are registered to the commit instance? After a commit instance is created, it can be cloned (since it is itself an Arc wrapping an internal struct) and sent to different threads. When multiple threads are registering tasks, there is no static registration order that a program can observe upfront.	2026-03-11 16:21:27 -07:00
Hoyt Koepke	6061debc75	Record API changes in api_changes/updates_<date>_<description>.md (#689 ) This PR creates a folder, api_changes, in which AI agents can record updates to the API surface that could affect downstream PRs and dependencies. This can be scanned by AI agents to reliably perform merges or to propagate changes. See api_changes/README.md for a description of how this should work.	2026-03-11 12:31:48 -07:00
Hoyt Koepke	45d38a13a9	Code reorganization towards release of xet cargo package (#693 ) This PR is a massive rearrangement of the code base into 5 packages intended for release on cargo. The directories and corresponding packages are: 1. xet_runtime/ — compiles into the xet-runtime package. Contains the runtime, config, and logging management. 2. xet_core_structures/ — compiles into the xet-core-structures package. Contains core data structures for hashing, shards, and xorbs as well as internal data structures that depend on these. 3. xet_client/ — compiles into the xet-client package, contains client code for remotely connecting to the Hugging Face servers. 4. xet_data/ — compiles into the xet-data package, contains the data processing pipeline: chunking/deduplication, file reconstruction, clean/smudge operations, and progress tracking. 5. xet_pkg/ — compiles into the hf-xet package, provides the top-level session-based API for file upload and download with user-facing error categorization. This is the primary package downstream dependencies would use. This also contains a single summary error type, XetError, that translates cleanly into python error types. In addition, the other tools are: - git_xet/ — the git_xet CLI binary crate (location preserved). - hf_xet/ -- the hf_xet python package (location preserved). - simulation/ — the simulation crate for upload scenario benchmarking. - wasm/ -- the wasm objects. The full description — and information for an AI agent to use to update downstream dependencies — is at api_changes/update_260309_package_restructure.md. Summary of moves: - xet_runtime: became xet_runtime::core inside xet_runtime/. - utils: became xet_runtime::utils inside xet_runtime/. - xet_config: became xet_runtime::config inside xet_runtime/. - xet_logging: became xet_runtime::logging inside xet_runtime/. - error_printer: became xet_runtime::error_printer inside xet_runtime/. - file_utils: became xet_runtime::file_utils inside xet_runtime/. - merklehash: became xet_core_structures::merklehash inside xet_core_structures/. - mdb_shard: became xet_core_structures::metadata_shard inside xet_core_structures/. - xorb_object: became xet_core_structures::xorb_object inside xet_core_structures/. - cas_client: became xet_client::cas_client inside xet_client/. - hub_client: became xet_client::hub_client inside xet_client/. - cas_types: became xet_client::cas_types inside xet_client/. - chunk_cache: became xet_client::chunk_cache inside xet_client/. - data: became xet_data::processing inside xet_data/. - deduplication: became xet_data::deduplication inside xet_data/. - file_reconstruction: became xet_data::file_reconstruction inside xet_data/. - progress_tracking: became xet_data::progress_tracking inside xet_data/. - xet_session: became xet::xet_session inside xet_pkg/. - Wasm packages (hf_xet_wasm, hf_xet_thin_wasm): moved from top-level into wasm/; internal imports updated, public APIs unchanged.	2026-03-11 12:02:38 -07:00
Rajat Arya	02da1d233b	version bump to 1.4.0 (#699 ) v1.4.0	2026-03-11 10:48:23 -07:00
Rajat Arya	83a28271ea	fix: no timeout for shard uploads (XET-885) (#685 ) Fixes [XET-885](https://linear.app/xet/issue/XET-885/investigate-unsloth-upload-failure-shard-upload-timeout-on-cas) ## Summary Shard uploads to CAS can take a long time due to server-side processing (DynamoDB writes scale with file entry count). The default `read_timeout(120s)` on the reqwest client kills these uploads. Key insight: reqwest's per-request `RequestBuilder::timeout()` does NOT override the client-level `read_timeout()` — they are independent mechanisms polled as separate futures. So the original approach of using per-request timeouts was ineffective. Fix: Create a dedicated `shard_upload_http_client` on `RemoteClient` with no `read_timeout`, built once at construction time and reused for all shard uploads. All other settings (connect timeout, pool config, auth middleware) are identical to the standard client. ## Changes ### `cas_client/src/http_client.rs` - Added `reqwest_client_no_read_timeout()` — creates a reqwest client with no `read_timeout` - Added `build_auth_http_client_no_read_timeout()` — public API wrapping it with middleware - 4 unit tests for the new builder ### `cas_client/src/remote_client.rs` - Added `shard_upload_http_client` field to `RemoteClient` (cfg'd out on wasm) - `upload_shard()` uses the pre-built no-timeout client instead of building one per request ### `cas_client/tests/test_shard_upload_timeout.rs` - Updated: slow server test now asserts success (shard uploads should wait as long as needed) ### `xet_config/src/groups/client.rs` - Removed `shard_read_timeout` config field (no longer needed) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-11 09:05:40 -07:00
Adrien	9ba5fb3e5b	fix: prevent download stall on large file reconstruction (#698 ) ## Summary Fixes download stalls/deadlocks on large file reconstruction (reported on 48.5 GB GGUF files). The root cause is a circular dependency: the main reconstruction loop holds a buffer semaphore permit while blocking on CAS connection permit acquisition, and xorb write locks held during HTTP downloads cause CAS permit starvation. ### Changes 1. Single-flight xorb downloads via `OnceCell` (`xorb_block.rs`): replaces `RwLock<Option<...>>` with `tokio::sync::OnceCell`. Only one task per xorb block acquires a CAS permit and downloads the data; concurrent callers wait on the same result without acquiring permits or duplicating work. This eliminates duplicate downloads, prevents double-counted transfer progress, and avoids a failing duplicate from killing the reconstruction. 2. Decouple CAS permit from buffer permit (`file_term.rs`): the main loop no longer blocks on CAS permits while holding a buffer permit. The spawned download task delegates to `retrieve_data` which handles permit acquisition internally via the OnceCell single-flight. This breaks the circular dependency that causes stalls. 3. Improve error propagation (`sequential_writer.rs`): when the background writer channel closes, check `RunState` for the original error before returning a generic "channel closed" message. ### Root cause The reconstruction pipeline has three resource pools: buffer permits (bounded semaphore), CAS download permits (64 concurrent), and per-xorb write locks. Before this fix, the main loop would: 1. Acquire a buffer permit (blocking if buffer full) 2. Call `get_data_task()` which acquires a CAS permit (blocking if pool exhausted) 3. Inside `retrieve_data()`, hold a write lock during the entire HTTP download This creates two deadlock vectors: - Buffer vs CAS: buffer fills up with terms waiting for CAS permits, but CAS permits are held by tasks blocked behind xorb write locks, and the writer can't drain the buffer because it's waiting for those tasks - CAS vs write lock: multiple tasks sharing the same xorb each hold a CAS permit while blocked on the write lock, starving other xorbs of permits ## Reproduction Reliably reproducible with small buffer: ``` HF_XET_RECONSTRUCTION_DOWNLOAD_BUFFER_SIZE=64mb \ HF_XET_RECONSTRUCTION_DOWNLOAD_BUFFER_LIMIT=64mb \ python3 -c "from huggingface_hub import hf_hub_download; hf_hub_download('unsloth/Qwen3-Coder-Next-GGUF', 'Qwen3-Coder-Next-Q4_K_M.gguf', local_dir='/tmp/test')" ``` - Before fix: stalls at ~3.4 GB, no progression (deadlock) - After fix: continuous progression, completes successfully With default buffer (2 GB), the stall is intermittent depending on network speed (consistently reproduced on slower connections).	2026-03-11 08:37:42 -07:00
Adrien	a48f1f80e4	feat: add skip_sha256 option to SingleFileCleaner (#679 ) ## Summary - Add `ShaGenerator::Skip` variant that skips SHA-256 computation entirely - `ShaGenerator::finalize()` now returns `Option<Sha256>` (None when skipped) - `SingleFileCleaner::new()` and `FileUploadSession::start_clean()` accept a `skip_sha256` boolean - When skipped, no `FileMetadataExt` is included in the shard ## Context Bucket uploads don't need SHA-256 in the shard metadata — the `sha_index` GSI is only used for LFS pointer resolution, which doesn't apply to buckets. Skipping SHA-256 for bucket uploads removes the main CPU bottleneck in the upload pipeline on non-SHA-NI instances. ## Alternative: dummy SHA-256 Instead of skipping entirely, the client could send a zeroed/dummy `FileMetadataExt`. The server would still store it but queries would never match. This avoids the server-side schema change (xetcas PR) but pollutes the GSI with dummy entries. Companion PRs: - xetcas: huggingface-internal/xetcas#498 (make `FileIdItem.sha256` optional server-side)	2026-03-10 17:36:09 +01:00
Hoyt Koepke	6a5535bc46	Rework simulation pipeline for adaptive concurrency and connection resiliency. (#648 ) This PR replaces the previous collection of scripts around setting up docker containers with a much more nimble and lightweight set of rust scripts and a simple, reusable proxy that can limit bandwidth and congestion simulations. The previous scripts are rewritten to be more nimble and use more reusable components. New tools: - cas_client/src/simulation/network_simulation: A lightweight, in-process network congestion simulation proxy that lives between the LocalServer instance and the RemoteClient instance, allowing simulation tests to run on a network with realistic congestion conditions and a gated bandwidth. This can be controlled dynamically through a LocalTestServer instance. - simulation/: A new package for collecting simulation scripts and analyzing the results. To run the new simulation scripts for the adaptive concurrency on upload, compile in release mode and run one of the scripts in `simulation/src/adaptive_concurrency/scripts/`. Docker is no longer needed to run any of the simulations. The old `cas_client/tests/adaptive_concurrency/` paths were removed.	2026-03-09 10:49:36 -07:00
Hoyt Koepke	ebd780d26d	Simulation interface for LocalTestServer: supports deletion, direct access, data dumps, etc. (#681 ) This PR adds interface functions to the LocalServer class that will allow it to become a full simulation environment for testing all the garbage collection stages.	2026-03-05 12:00:52 -08:00
Hoyt Koepke	70807bf012	Fix for incorrect error propagation on truncated download stream. (#683 ) Currently, the async stream logic silently swallows an UnexpectedEOF, treating it the same as an EOF. This is a bug; this PR fixes it to propagate UnexpectedEOF while handling correct EOF as the end of the stream.	2026-03-04 17:00:08 -08:00
Hoyt Koepke	e6e0413d90	Naming clarification: A Xorb is a data object, CAS is the remote server. (#680 ) This PR makes the use of the `cas` and `xorb` terms consistent. Previously, "cas" (for content addressed store) could simultaneously refer to either the remote server or the data bytes stored as a collection of chunks. After the renames in this PR, we consistently use `xorb` to refer to the data object and cas to refer to the remote server. This renames quite a few places; to aid in rebasing current work or updating downstream dependencies, this PR includes a file `API_UPDATES.md` that can be fed into an AI agent to quickly and accurately perform the renaming on any downstream dependencies.	2026-03-04 16:05:49 -08:00
Di Xiao	c4a56f889c	XetSession API (#657 ) This PR introduces a new `xet_session` crate that provides a session-based hierarchical API: Users create a XetSession to manage runtime and configuration, then batch uploads into UploadCommit objects and downloads into DownloadGroup objects — each of which runs transfers in the background by the inner XetRuntime. All pub functions are exposed as sync functions - making them easy to use in other languages, e.g. Python, C, etc.	2026-03-03 20:27:39 -08:00
Adrien	40b45fb0fb	feat: accept pre-computed SHA-256 in upload_files() (#678 ) ## Summary - Add optional `sha256s` keyword parameter to the Python-exposed `upload_files()` function - Forward it to `data_client::upload_async()` which already supports it ## Context ### Double computation today `huggingface_hub` computes SHA-256 on every file during `CommitOperationAdd.__post_init__()` for LFS batch negotiation, then `hf_xet` recomputes it internally because `upload_files()` doesn't accept pre-computed hashes. ### Performance impact This change eliminates the redundant computation entirely. ### Backward compatibility - `sha256s` is a keyword-only parameter with default `None` — no change for existing callers - `data_client::upload_async()` already accepts `sha256s: Option<Vec<String>>` since day one - When provided, `SingleFileCleaner` uses `ShaGenerator::ProvidedValue` and skips internal recomputation Companion PR: huggingface/huggingface_hub#3876	2026-03-03 21:20:09 +01:00
Adrien	e66dcef40b	Fix command injection in release workflow (CVE) (#677 ) ## Summary - Fix command injection vulnerability in `.github/workflows/release.yml` (HackerOne #3581567, severity High 8.8) - `${{ github.event.inputs.tag }}` was interpolated directly in `run:` blocks, allowing arbitrary RCE via crafted tag input (e.g. `v0.1.0; id; cat /etc/passwd;#`) - Moved all 6 occurrences to `env:` variables so the value is passed as a shell environment variable instead of being interpolated into the script ## Jobs fixed - `linux` — "Update version in toml" step - `musllinux` — "Update version in toml" step - `windows` — "Update version in toml" step - `macos` — "Update version in toml" step - `sdist` — "Update version in toml" step - `github-release` — "Create GitHub Release" step (`gh release create`)	2026-03-02 20:10:27 +01:00
Hoyt Koepke	9b3278a510	Streaming data writer (#656 ) This PR adds an integrated API for streaming downloads, exposing a DownloadStream object that is integrated with the file reconstructor. It also uses the same memory management buffer limiting process to work with the stream object. It also introduces cancellation support to the FileReconstructor to ensure that tasks waiting on a long running download or semaphore wait don't cause things to hang when an error is reported or the user drops the stream.	2026-02-27 15:08:25 -08:00
Di Xiao	c4111eb6da	Feature to monitor client process system usage (#617 ) Introduces a client benchmark utility to track system resource usage (CPU, memory, disk I/O, and network I/O) of a process, so we don't need to write scripts to capture usage stats according to different OS standards. This becomes extremely helpful when I benchmark on Python notebook instances, e.g. Google Colab, where system monitor is not easily accessible or when running a separate monitor script is not easy. # Usage # Users can enable monitoring by setting `HF_XET_SYSTEM_MONITOR_ENABLED` to true, set usage sample interval using `HF_XET_SYSTEM_MONITOR_SAMPLE_INTERVAL`, this outputs metrics to the tracing stream at `INFO` level by default. In addition, these metrics can be redirected to a separate file by setting sample log path using `HF_XET_SYSTEM_MONITOR_LOG_PATH`. # Output # The stats are output in JSON format, which can be queried using tools like `jq`, e.g. 1. Trace of peak memory usage: `jq '.memory.peak_used_bytes' [HF_XET_SYSTEM_MONITOR_LOG_PATH]` 2. Trace of disk write speed: `jq '.disk.average_write_speed' [HF_XET_SYSTEM_MONITOR_LOG_PATH]` 3. Trace of network receive speed: `jq '.network.average_rx_speed' [HF_XET_SYSTEM_MONITOR_LOG_PATH]`	2026-02-27 13:36:31 -08:00
Hoyt Koepke	543914dce1	Scale download buffer memory limit by number of active downloads (#666 ) Currently, the maximum number of downloaded files is fixed, regardless of the number of downloads currently in flight. However, as the number of downloads increases, a fixed size total could lead to waiting on individual segments that download out-of-order or don't have enough turnaround time to saturate the output. While writing to disk or the download itself often becomes the bottleneck before these effects, planned features such as streaming files and caching could be affected by this limit. The default formula for the download buffer size now is (2GB + 512MB * number of concurrent downloads) up to a maximum of 8GB (these are adjustable). This PR alleviates this by allocating an additional 512MB buffer allocation per file, prioritized to the specific download, releasing that capacity when the file finishes downloading. This is done using the AdjustableSemaphore class, first introduced for the concurrent scaling, which allows the number of total permits in a semaphore to be incremented or decremented; on decrement, permits are discarded upon return until the total permits is at the target number.	2026-02-27 11:35:55 -08:00
Rajat Arya	e31bbb5ddb	hf-xet 1.3.2 version bump (#671 ) v1.3.2	2026-02-27 08:38:24 -08:00
Hoyt Koepke	3a4a2b8294	Fixes for intermittent test failures on windows. (#669 ) This PR addresses two rare but occasional test failures on windows, both due to window's non-synchronous file system behavior. - A race condition opening the local test database causing an error. - Unwanted cleanup conditions in testing the log preservation can trigger if the test execution is stretched out long enough. - A null-termination bug in set_file_metadata that causes it to fail silently if the memory layout is such a way that the string passed in isn't null-terminated. This causes occasional failures in setting the metadata time on linux.	2026-02-27 07:53:05 -08:00
Hugo Larcher	73e531a41c	fix: wrap TrackingProgressUpdater in AggregatingProgressUpdater (#668 ) ## Summary Wrap download progress updaters in `AggregatingProgressUpdater` to eliminate GIL contention when Python callers provide per-file progress callbacks. The upload path has had this aggregation since v1.1.3 (PR #340), but the download path was missed. Without aggregation, each XORB chunk triggers a `spawn_blocking` + `Python::with_gil()` callback. With many concurrent file downloads, this causes severe GIL contention — measured as a 4x throughput reduction (3000 MB/s → 750 MB/s on a 25 Gbps link). The fix wraps the caller-provided `TrackingProgressUpdater` in an `AggregatingProgressUpdater` (200ms flush interval) inside `download_file_with_updater()`, matching the pattern already used by `FileUploadSession`. This reduces Python callback frequency from thousands/sec to ~5/sec per file while preserving progress bar feedback. ## Root cause When `huggingface_hub` calls `hf_xet.download_files()`, it passes a per-file Python callback for progress bar updates. On the Rust side, each callback invocation goes through: ``` report_bytes_written() / report_transfer_progress() → tokio::spawn(register_updates()) → spawn_blocking(Python::with_gil(callback)) ``` With the detailed download progress tracking added in PR #645 (hf-xet v1.3.0), both `report_bytes_written` and `report_transfer_progress` fire per chunk, roughly doubling callback frequency. With 8+ concurrent file downloads, each spawning dozens of concurrent XORB streams, the GIL becomes a severe bottleneck. ## History The problem has existed since xet download support was introduced, but worsened over time: \| Version \| Date \| Impact \| \|---------\|------\|--------\| \| `huggingface_hub v0.30.0` / `hf-xet 0.1.x` \| Mar 2025 \| Moderate — synchronous `with_gil()` per chunk, but hf_xet was an optional extra \| \| `huggingface_hub v0.31.0` / `hf-xet >=1.1.0` \| May 2025 \| Moderate — hf-xet became a hard dependency on x86_64/arm64 \| \| `hf-xet v1.1.3` \| Jun 2025 \| Upload path fixed with `AggregatingProgressUpdater` (PR #340); download path left unprotected \| \| `hf-xet v1.3.0` \| Feb 2026 \| Severe — PR #645 added detailed per-chunk progress tracking to downloads, doubling callback frequency without aggregation \| PR #340 explicitly noted: "each [update] has to acquire a global GIL lock. This negatively affects the upload speed on fast connections" — the same problem, but only the upload side was addressed. ## Benchmarks Downloading 3 safetensors files (16.1 GB total) from `Qwen/Qwen3.5-35B-A3B` on a 25 Gbps machine: \| Test \| Before \| After \| \|------\|--------\|-------\| \| `download_files()` with `progress_updater=None` (baseline) \| 3119 MB/s \| 3119 MB/s \| \| `download_files()` with per-file Python callbacks \| 746 MB/s \| 1789 MB/s \| \| `snapshot_download()` (full Python CLI path with tqdm) \| ~750 MB/s \| 2395 MB/s \| Progress callback overhead drops from 4x slowdown to <1%.	2026-02-26 21:23:13 -08:00
dependabot[bot]	0d0f4883ad	Bump time from 0.3.44 to 0.3.47 in /hf_xet_wasm (#654 ) Bumps [time](https://github.com/time-rs/time) from 0.3.44 to 0.3.47. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/time-rs/time/releases">time's releases</a>.</em></p> <blockquote> <h2>v0.3.47</h2> <p>See the <a href="https://github.com/time-rs/time/blob/main/CHANGELOG.md">changelog</a> for details.</p> <h2>v0.3.46</h2> <p>See the <a href="https://github.com/time-rs/time/blob/main/CHANGELOG.md">changelog</a> for details.</p> <h2>v0.3.45</h2> <p>See the <a href="https://github.com/time-rs/time/blob/main/CHANGELOG.md">changelog</a> for details.</p> </blockquote> </details> <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/time-rs/time/blob/main/CHANGELOG.md">time's changelog</a>.</em></p> <blockquote> <h2>0.3.47 [2026-02-05]</h2> <h3>Security</h3> <ul> <li> <p>The possibility of a stack exhaustion denial of service attack when parsing RFC 2822 has been eliminated. Previously, it was possible to craft input that would cause unbounded recursion. Now, the depth of the recursion is tracked, causing an error to be returned if it exceeds a reasonable limit.</p> <p>This attack vector requires parsing user-provided input, with any type, using the RFC 2822 format.</p> </li> </ul> <h3>Compatibility</h3> <ul> <li>Attempting to format a value with a well-known format (i.e. RFC 3339, RFC 2822, or ISO 8601) will error at compile time if the type being formatted does not provide sufficient information. This would previously fail at runtime. Similarly, attempting to format a value with ISO 8601 that is only configured for parsing (i.e. <code>Iso8601::PARSING</code>) will error at compile time.</li> </ul> <h3>Added</h3> <ul> <li>Builder methods for format description modifiers, eliminating the need for verbose initialization when done manually.</li> <li><code>date!(2026-W01-2)</code> is now supported. Previously, a space was required between <code>W</code> and <code>01</code>.</li> <li><code>[end]</code> now has a <code>trailing_input</code> modifier which can either be <code>prohibit</code> (the default) or <code>discard</code>. When it is <code>discard</code>, all remaining input is ignored. Note that if there are components after <code>[end]</code>, they will still attempt to be parsed, likely resulting in an error.</li> </ul> <h3>Changed</h3> <ul> <li>More performance gains when parsing.</li> </ul> <h3>Fixed</h3> <ul> <li>If manually formatting a value, the number of bytes written was one short for some components. This has been fixed such that the number of bytes written is always correct.</li> <li>The possibility of integer overflow when parsing an owned format description has been effectively eliminated. This would previously wrap when overflow checks were disabled. Instead of storing the depth as <code>u8</code>, it is stored as <code>u32</code>. This would require multiple gigabytes of nested input to overflow, at which point we've got other problems and trivial mitigations are available by downstream users.</li> </ul> <h2>0.3.46 [2026-01-23]</h2> <h3>Added</h3> <ul> <li>All possible panics are now documented for the relevant methods.</li> <li>The need to use <code>#[serde(default)]</code> when using custom <code>serde</code> formats is documented. This applies only when deserializing an <code>Option<T></code>.</li> <li><code>Duration::nanoseconds_i128</code> has been made public, mirroring <code>std::time::Duration::from_nanos_u128</code>.</li> </ul> <!-- raw HTML omitted --> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href="`d5144cd287`"><code>d5144cd</code></a> v0.3.47 release</li> <li><a href="`f6206b050f`"><code>f6206b0</code></a> Guard against integer overflow in release mode</li> <li><a href="`1c63dc7985`"><code>1c63dc7</code></a> Avoid denial of service when parsing Rfc2822</li> <li><a href="`5940df6e72`"><code>5940df6</code></a> Add builder methods to avoid verbose construction</li> <li><a href="`00881a4da1`"><code>00881a4</code></a> Manually format macros everywhere</li> <li><a href="`bb723b6d82`"><code>bb723b6</code></a> Add <code>trailing_input</code> modifier to <code>end</code></li> <li><a href="`31c4f8e0b5`"><code>31c4f8e</code></a> Permit <code>W12</code> in <code>date!</code> macro</li> <li><a href="`490a17bf30`"><code>490a17b</code></a> Mark error paths in well-known formats as cold</li> <li><a href="`6cb1896a60`"><code>6cb1896</code></a> Optimize <code>Rfc2822</code> parsing</li> <li><a href="`6d264d59c2`"><code>6d264d5</code></a> Remove erroneous <code>#[inline(never)]</code> attributes</li> <li>Additional commits viewable in <a href="https://github.com/time-rs/time/compare/v0.3.44...v0.3.47">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=time&package-manager=cargo&previous-version=0.3.44&new-version=0.3.47)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/huggingface/xet-core/network/alerts). </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-02-24 20:21:32 -08:00
Rajat Arya	438045a19e	Version bump for hf-xet 1.3.1 release (#665 ) v1.3.1	2026-02-24 15:36:20 -08:00
Rajat Arya	8808f9e64e	Add Windows ARM64 build support (#662 ) ## Summary Closes #588 - Add `win11-arm` runner with `aarch64-pc-windows-msvc` target to the hf-xet Python wheel release pipeline - Add `win11-arm` runner with `aarch64` target to the git-xet CLI release pipeline, parameterizing the WiX installer `-arch` flag ## Test plan - [x] Trigger a workflow_dispatch run of the Release workflow and verify `windows` matrix includes both `x64` and `aarch64` entries - [x] Verify ARM64 wheels and .pdb debug symbols are built and uploaded - [ ] Trigger a workflow_dispatch run of the git-xet Release workflow and verify ARM64 binary and MSI installer are produced 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-02-24 15:23:42 -08:00
Di Xiao	99105937f3	Upgrade hf-xet to 1.3.0 (#664 ) v1.3.0	2026-02-23 15:26:30 -08:00
Brian Ronan	17e900a70e	Feat: optional `request_headers` on hf_xet API calls (#661 ) Adding support for setting an optional `request_header` map on the hf_xet upload and download API calls. This map is augmented with the hf_xet user agent string and is passed along with the requests to xetcas. This PR also adds some unit tests for testing the map merging behavior to `hf_xet/lib.rs` and adds support for running these with cargo test and in github actions CI step.	2026-02-23 14:43:58 -08:00
Hoyt Koepke	b3c5d05fb7	Make specifying the file size at the beginning of an upload optional. (#651 ) Currently, the progress and dependency tracking in the upload path requires that the total size of a file be specified at the start. This PR changes this so that in cases where the upload is streamed and the total size is not known, it's updated as soon as new data is processed. Both routes now work and correctly track the file sizes.	2026-02-23 10:31:09 -08:00
Hoyt Koepke	2176e5d3ed	FileDownloadGroup (#652 ) This PR adds a FileDownloadSession struct that parallels the FileUploadSession struct, replacing the FileDownloader. It's an intermediate step in preparation for a session-based API that integrates well with interfaces other than the python interface in hf_xet.	2026-02-19 17:43:35 -08:00
Hoyt Koepke	21bc6cfdc3	Removed incorrectly included AGENTS.md. (#660 ) The AGENTS.md file was incorrectly checked into the repository (part of a claude process to prepare and check a diff for PR). This PR removes that.	2026-02-19 11:32:12 -08:00
Hoyt Koepke	5d6371a296	Progress reporting for downloads. (#645 ) This PR adds detailed progress reporting to the download path. - Transfer progress is reported as soon as the download streams start; actual bytes written are reported as the reconstructed file is written out. - Currently, each call to download_file creates a separate progress tracker, but this sets up for download groups with grouped download progress tracking. To support this, the UploadProgressStream was split into three classes; a common StreamProgressReporter and download and upload specific versions. This also allows us to simplify the API to RetryWrapper. More tracking was added to the file reconstruction paths to properly report progress.	2026-02-19 11:06:42 -08:00

1 2 3 4 5 ...

512 Commits