xet-core

mirror of https://github.com/huggingface/xet-core.git synced 2026-06-04 13:30:29 +08:00

Author	SHA1	Message	Date
Assaf Vayner	5868f64ab9	fixing some issues identified in cargo audit (#802 ) CI for hf-hub is running cargo audit and found many issues through hf-xet transitive deps. this PR attempts to solve some of them (not necessarily all of them). Main changes: - dropped derivative and reqwest-retry - replaced bincode with postcard, only used in testing - upgrade xet-core rand usage - added audit CI step and ignoring some issues that we can't easily fix. <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Medium Risk > Medium risk because it removes `reqwest-retry`/`derivative` and replaces part of the retry classification logic with an in-house equivalent, which could subtly change HTTP retry behavior; the remaining changes are dependency/version bumps and test-only serialization swaps. > > Overview > Adds a new CI `cargo audit` job and introduces `.cargo/audit.toml` to ignore a small set of dev-only RustSec advisories with documented rationale. > > Reduces audit surface by dropping `derivative` (manual `Debug` impl for `AuthConfig`) and removing `reqwest-retry`, replacing its status-code classification with a local `Retryable` enum + `default_on_request_success` helper in `RetryWrapper`. > > Updates workspace deps (notably `rand` to `0.10` and `rand_distr` to `0.6`) and adjusts call sites to the newer `rand` APIs (`RngExt` imports, minor test/bench tweaks). Test-only binary serialization switches from `bincode` to `postcard` (and updates affected tests), with corresponding lockfile updates across crates. > > <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit `26377f4a1c`. Bugbot is set up for automated code reviews on this repo. Configure [here](https://www.cursor.com/dashboard/bugbot).</sup> <!-- /CURSOR_SUMMARY -->	2026-04-20 14:49:48 -07:00
Assaf Vayner	08377eab3c	Upgrade crates version to 1.5.1 (#782 ) ## Summary - Bump workspace version from 1.5.0 to 1.5.1 - Update all internal dependency version references to match <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Low Risk > Low risk version-only bump across workspace manifests and lockfiles with no code/behavior changes in the diff. > > Overview > Bumps the workspace package version from `1.5.0` to `1.5.1` and aligns internal crate dependency version pins (`xet-runtime`, `xet-core-structures`, `xet-client`, `xet-data`, `hf-xet`) to match. > > Updates lockfiles (`Cargo.lock` plus `hf_xet` and wasm lockfiles) so published/embedded artifacts resolve to the `1.5.1` crate set (including bringing wasm lockfiles up to `1.5.1`). > > <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit `e8563700a0`. Bugbot is set up for automated code reviews on this repo. Configure [here](https://www.cursor.com/dashboard/bugbot).</sup> <!-- /CURSOR_SUMMARY -->	2026-04-06 14:03:02 -07:00
Di Xiao	1f7400cc4b	Drop xet-core-structure from xet-runtime dev dep (#776 )	2026-04-03 13:56:26 -07:00
Di Xiao	950807ba43	Upgrade crates version to 1.5.0 (#775 ) Update workspace version to `1.5.0` and intra-workspace dependency versions to `1.5.0`	2026-04-03 13:39:50 -07:00
Di Xiao	1f0918c33e	Refactor XetSession commit / group CAS endpoint and auth configuration (#771 ) There's no publicly documented Xet CAS endpoint. To interact with Xet CAS, all public clients need to obtain a CAS endpoint from the same route to obtain a CAS token. Currently users need to 1. first construct a CAS token URL with respect to a certain operation ("read" or "write", targeted repo type, targeted repo, targeted revision), 2. send a request to this URL to get a CAS token and CAS endpoint, 3. use the CAS endpoint to build a `XetSession`, 4. use the `XetSession` instance and the CAS token and CAS token URL to build an upload or download group. This is a rather completed setup. This PR address this blocker by eagerly "refresh"-ing the CAS token if no CAS endpoint is provided, thus users can 1. build a `XetSession`, 2. construct a CAS token URL with respect to a certain operation ("read" or "write", targeted repo type, targeted repo, targeted revision), 3. use the `XetSession` instance and the CAS token URL to build an upload or download group. So effectively, there will be two common patterns: Pattern A: endpoint known ahead of time — no eager refresh, token_info is used as-is ``` let session = XetSessionBuilder::new().build()?; let commit = session .new_upload_commit()? .with_endpoint(cas_url) .with_token_info(token, expiry) .with_token_refresh_url(refresh_url, /Auth headers/) .build_blocking()?; ``` Pattern B: endpoint unknown — build call fetches it; token_info seeded from response ``` let session = XetSessionBuilder::new().build()?; let commit = session .new_upload_commit()? .with_token_refresh_url(token_refresh_url, /Auth headers/) .build_blocking()?; ``` Other changes: 1. `with_endpoint()` and `with_custom_headers()` configuration is moved from the `XetSession` level down to the operation level, because we can actually have multiple operations with different CAS endpoints co-exist in the same session instance. 2. Builder for different operations `XetUploadCommit`, `XetFileDownloadGroup`, `XetDownloadStreamGroup` are refactored to share common code under `struct AuthGroupBuilder<G>`.	2026-04-02 11:07:07 -07:00
Assaf Vayner	20198a9081	Remove prometheus dependency and metrics (#769 ) ## Summary - Remove the `prometheus` crate dependency from the workspace and `xet_data` - Delete `prometheus_metrics.rs` which defined 3 IntCounter metrics (CAS bytes produced, bytes cleaned, bytes smudged) - Remove metric increment calls from `file_upload_session.rs` and `file_download_session.rs` - Fix Windows CI flake: redb "Database already open" error in `test_single_large` These metrics were collected but never exposed via any HTTP endpoint or text encoder, making them effectively dead code. ## Test plan - [x] `cargo +nightly fmt` — clean - [x] `cargo clippy --all-targets` — no new warnings - [x] `cargo test -p xet-data` — 17/17 pass - [x] `cargo test -p xet-data --features simulation --test test_clean_smudge` — 14/14 pass (including `test_single_large`) - [x] WASM builds (`hf_xet_wasm`, `hf_xet_thin_wasm`) — both succeed <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Low Risk > Low risk: this removes unused Prometheus metrics plumbing and related dependencies without changing the core upload/download logic. Main risk is loss of any downstream reliance on these counters at build time (e.g., feature flags or imports). > > Overview > Removes the `prometheus` dependency from the workspace and `xet_data`, and updates lockfiles accordingly (including WASM-related lockfiles). > > Deletes `xet_data`’s `prometheus_metrics` module and strips the associated counter increments from `FileUploadSession` and `FileDownloadSession`, leaving the data processing behavior unchanged aside from no longer recording these metrics. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit `c6c866b7ca`. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY -->	2026-04-01 14:56:58 -07:00
Assaf Vayner	7d97aa3066	Replace heed (LMDB) with redb in local CAS simulation (#766 ) This is an optional change. basically heed imports a bunch of deps and it's also using lmdb that may require more compilation/linking steps in tests. we use it for such a small subset of operations in testing I thought we might try an even thinner rust-native dep instead. that's what redb is. ## Summary - Replace `heed` (C LMDB bindings) with `redb` (pure Rust embedded KV store) in `LocalClient` - Removes C dependency, `unsafe` block, Windows retry workaround, and custom `Drop` impl - Introduces `RedbHash` newtype wrapper for `MerkleHash` to satisfy orphan rules on redb's `Key`/`Value` traits - Net reduction of ~130 lines; all 147 existing tests pass ## Test plan - [x] `cargo check -p xet-client --features simulation` — clean - [x] `cargo test -p xet-client --features simulation` — 147 passed, 0 failed - [x] `cargo clippy -p xet-client --features simulation` — clean - [x] `cargo +nightly fmt` — clean <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Medium Risk > Swaps the embedded KV store used for shard dedup/deletion metadata in the local CAS simulation, which can affect test behavior and on-disk state/locking semantics (especially with concurrent clients). Scope is contained to simulation/test code and dependency graph changes. > > Overview > Switches `LocalClient`’s disk-backed global-dedup and file deletion status storage from `heed`/LMDB to `redb`, including new `RedbHash` serialization, `TableDefinition`s, and updated read/write transaction flows. > > Adds a small global database-handle cache to avoid `redb` exclusive-lock conflicts across multiple `LocalClient` instances, and removes the prior LMDB-specific open/retry logic and custom `Drop` close path. Workspace dependencies/lockfiles are updated to drop `heed`/LMDB-related crates and add `redb`, and `.gitignore` now ignores `.worktrees/`. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit `02d39864d9`. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY -->	2026-03-31 15:23:18 -07:00
Di Xiao	15011cb230	XetSession uses direct token refresh route instead of a callback (#751 ) This PR makes two significant, breaking API redesign: 1. Auth tokens move from session-level (shared by all operations) to per-operation level (per `UploadCommit`, `FileDownloadGroup`, and `DownloadStreamGroup`). This enables uploads and downloads from the same session to carry different access-level tokens — a sensible design for HF's write-vs-read token split. 2. Instead of letting users provide a callback to refresh tokens, this new API now let users provide a token refresh URL and access credential in an HTTP header map. ### Why 1. CAS JWT have short life, but `XetSession` is intended to be held long time -- thus it makes more sense to configure CAS auth on the operation level (`UploadCommit` or `FileDownloadGroup` or `DownloadStreamGroup`) and it will be discarded once the operation is done. 2. For different access level (write vs. read) and different operation target (repo and commit), CAS JWT token will be different and the token refresh URL will be different. `UploadCommit` and `FileDownloadGroup` and `DownloadStreamGroup` they each also function as a single auth group. 3. Providing an URL is considered easier than writing a callback, and is more safe when crossing the GIL Python - Rust boundary. Examples: ``` // Upload token (write access) let mut upload_headers = HeaderMap::new(); upload_headers.insert("Authorization", "Bearer hub-write-token".parse().unwrap()); let commit = session .new_upload_commit()? .with_token_info("CAS_WRITE_JWT", 900) .with_token_refresh_url("https://huggingface.co/api/repos/token/write", upload_headers) .build_blocking()?; ``` ``` // File download token (read access) let mut dl_headers = HeaderMap::new(); dl_headers.insert("Authorization", "Bearer hub-read-token".parse().unwrap()); let group = session .new_file_download_group()? .with_token_info("CAS_READ_JWT", 900) .with_token_refresh_url("https://huggingface.co/api/repos/token/read", dl_headers) .build_blocking()?; ``` Secondary changes include: - `DirectRefreshRouteTokenRefresher` consolidated into `xet_client::cas_client::auth`. - HTTP client module moved from `cas_client` to `xet_client::common` for shared use between `xet_client::cas_client` and `xet_client::hub_client`. - New `DownloadStreamGroup` type (streaming downloads moved off `XetSession`). - Fix Session ID type regression: this was fixed once in https://github.com/huggingface/xet-core/pull/738 but regressed again, seems AI agents don't learn. - HTTP client cache key now incorporates custom headers	2026-03-30 08:39:25 -07:00
Assaf Vayner	9c0cb6e4c8	Reduce workspace dependencies (batches 1-3) (#746 ) ## Summary - Remove unused dependencies: warp (zero imports), paste (zero invocations), tower-service (zero imports), and heed misplacement in xet_core_structures - Move mockall to dev-dependencies in xet_client by gating `#[automock]` with `#[cfg_attr(test, automock)]` - Feature-gate simulation module behind `simulation` cargo feature in xet_client, making axum, heed, humantime, futures-util, human-bandwidth, and tower-http optional - Replace duration-str with humantime (~2 deps vs ~78 transitive deps) across xet_runtime, xet_client simulation, and simulation crate ## Impact \| Metric \| Before \| After \| Change \| \|---\|---\|---\|---\| \| hf-xet production deps \| 371 \| 321 \| -50 \| \| Workspace total \| 575 \| 569 \| -6 \| ## Test plan - [x] `cargo check --workspace` passes - [x] `cargo check -p hf-xet` passes (without simulation feature — key validation) - [x] `cargo test --workspace` — all tests pass (4 pre-existing auth test failures in git_xet unrelated to this PR) - [x] `cargo tree -p hf-xet -e normal --prefix none \| sort -u \| wc -l` confirms 321 deps 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Medium Risk > Medium risk because it changes dependency graph and Cargo feature gating (notably `xet-client` simulation modules and CI test features), which can affect build/test behavior across targets despite minimal runtime logic changes. > > Overview > Reduces workspace dependency surface by removing `duration-str` (replaced with `humantime`) and trimming other transitive-heavy crates; updates lockfiles accordingly across the workspace, `hf_xet`, and WASM builds. > > Introduces/propagates a `simulation` Cargo feature: `xet-client`’s simulation server-related deps become optional and are only compiled/exported when `feature = "simulation"` is enabled; `git_xet` adds a `simulation` feature that forwards to dependent crates, and CI now runs tests with `strict simulation git-xet-for-integration-test`. > > Minor repo hygiene updates include ignoring `.claude/` in `.gitignore` and wiring the `simulation` crate to depend on `xet-client` with `features = ["simulation"]` (plus swapping its duration parsing helper to `humantime`). > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit `6abc194398`. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 09:54:36 -07:00
Hoyt Koepke	69962587b5	Composable Hash Functionality (#745 ) Currently, computing aggregate chunk hashes across independently processed ranges requires recomputing over the full concatenated chunk list. This PR introduces ChunkHashRange, a composable representation that can hash contiguous partial ranges and merge them while preserving equivalence with the existing xorb_hash / file_hash behavior. This allows an intermediate representation of the hash ranges that can be merged in arbitrary order to get the final hash. It also uses O(log(n)) storage and all operations are done in linear time. Serialization and Deserialization are fully supported. The main use case for this is in doing partial file edits. Previously, to edit the middle of a large file, the client would have to know all the hashes for the full file, even if only a few in the middle were changed. With a large file, this can still be 100s of MB; the chunk metadata size is roughly 1/1000 of the data size. With this change, we can now transmit the unmodified parts of a file in O(log(n)) storage but still be able to build the entire function hash; now a sequence of 10M chunks takes the equivalent storage of ~500 chunks or so. Along the way, we also added in an optimization for the merge step to avoid an allocation, yielding a 2x speedup. --------- Co-authored-by: Hoyt Koepke <hoytak@xethub.com>	2026-03-27 08:38:59 -07:00
Hoyt Koepke	c90f0a7bd9	Session API Polish; unify task handling/cancellation behavior. (#747 ) Previously, upload and download paths each had their own ad-hoc state tracking, cancellation, and runtime bridging logic. TaskRuntime consolidates this into a single type that owns a CancellationToken tree, tracks Running/Finished/Cancelled state with recursive propagation to children, and provides bridge_async/bridge_sync wrappers that automatically wire up tokio::select! cancellation. Session → commit/group → per-file handles form a parent-child token tree, so aborting a session cancels all descendant work. The upload path gets new UploadFileHandle and UploadStreamHandle wrapper types (replacing the old UploadTaskHandle), with inner/wrapper pattern for cheap cloning. UploadCommit::commit() now returns a CommitReport containing aggregate dedup metrics, progress, and per-file FileMetadata. The download path mirrors this structure: FileDownloadGroup uses TaskRuntime for state gating and owns bespoke DownloadTaskHandle instances with per-task status and result access. <!-- CURSOR_SUMMARY --> --- > [!NOTE] > High Risk > High risk due to a breaking redesign of the public `xet_session` API (new handle/report types and renamed methods) plus new cancellation/state machinery that changes how uploads/downloads are coordinated and terminated. > > Overview > Redesigns `xet_pkg::xet_session` around a new hierarchical `TaskRuntime` (using `tokio-util` cancellation tokens) to unify state, bridging, and cancellation across session → commit/group → per-file handles. > > Replaces the old task-handle/result model (`tasks.rs`, `UploadResult`/`DownloadResult`, `TaskStatus`, group/session state enums) with explicit handle/report types: `XetFileUpload`, `XetStreamUpload`, `XetFileDownload`, `XetCommitReport`, and `XetDownloadGroupReport`, and standardizes task state via `XetTaskState`. > > Adjusts APIs and error semantics: `commit()` now returns an aggregate report (dedup metrics + progress + per-file metadata) and no longer consumes `self`; progress methods become infallible (`progress()`); cancellations/errors are consolidated (`AlreadyCompleted`, `UserCancelled`, `KeyboardInterrupt`, `TaskError`/`PreviousTaskError`) with updated Python exception mapping. `xet_data` now returns per-file `DeduplicationMetrics` from upload tasks and adds a zero-copy `SingleFileCleaner::add_data_from_bytes`. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit `153a3ebbbe`. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY -->	2026-03-27 07:54:37 -07:00
Hoyt Koepke	332a456e1d	Add ordered and unordered download streaming to session interface (#729 ) This PR adds ordered and unordered download streams on XetSession, including optional byte-range support and per-stream progress reporting. Blocking and async variants are supported. On the reconstruction side, this introduces UnorderedWriter and UnorderedDownloadStream in xet_data, and extends the FileDownloadSession stream APIs to take optional source ranges. Ordered and unordered streams now share the same session-facing access pattern for async and blocking callers. This PR also renames DownloadGroup to FileDownloadGroup; the stream data uses the per-session memory pool but don't count towards the maximum number of concurrent downloads in progress. <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Medium Risk > Touches core file reconstruction/writer plumbing (including `DataWriter` ownership and new unordered writer/stream paths) and changes public session APIs, so regressions could impact download correctness, cancellation, or progress reporting. > > Overview > Adds first-class ordered and unordered streaming download APIs to `xet_pkg::xet_session`, including async and blocking variants, optional source-relative byte ranges, and per-stream progress via new `XetDownloadStream` / `XetUnorderedDownloadStream` wrappers. > > On the data layer, introduces an unordered reconstruction path (`UnorderedWriter` + `UnorderedDownloadStream`) and refactors streaming to spawn reconstruction tasks immediately but gate execution behind `start()`; stream abort callbacks are now registered per-stream and automatically unregistered on drop to avoid callback accumulation. > > Updates the reconstruction writer contract by making `DataWriter::finish` consume the writer (and shifting `DataWriter` to `&mut self` usage), adjusts `SequentialWriter` accordingly, and adds Criterion-based reconstruction benchmarks plus extensive unordered-stream tests. Also renames session `DownloadGroup` to `FileDownloadGroup` (and constructors) and updates call sites/examples. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit `e02890aa4b`. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY -->	2026-03-20 14:40:18 -07:00
Hoyt Koepke	749c28b086	Error unification and cleanup (#737 ) This PR performs some housecleaning and removes some technical debt around using different error types, unifying them with the python interface. - Our client code tended to do a lot with anyhow errors as an artifact of first using them before switching to thiserror. This PR cleans these up in favor of using ClientError or other named error types directly. - It also removes all the aliases to the old error type names present in the packages before the refactoring, now settling into ClientError, FormatError, DataError, and RuntimeError, with XetError being the error type exposed publicly. - Also, currently, xet_session exposes SessionError as an alias of XetError, which adds an extra public type name without adding behavior. This PR removes that alias and standardizes the public API/docs onto XetError directly. -It also tightens Python-facing error behavior and moves the python handling to the XetError class directly, hidden behind a python feature flag. Using these types, hf_xet now registers XetObjectNotFoundError and XetAuthenticationError exception classes for authentication and the not-found cases. These inherit from the current exception classes, so all behavior is preserved. - In addition, the From for PyErr mapping routes timeout/network/auth/not-found categories to more appropriate Python exception types than simply RuntimeError. This is primarily an API-surface cleanup plus error-classification alignment. <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Medium Risk > API-breaking error-surface changes (removal of legacy alias modules and signature changes like `CredentialHelper::fill_credential`) may require downstream code updates, especially where errors are matched/converted. Runtime behavior should be mostly unchanged, but error mapping/propagation paths (including Python exceptions) are widely touched across crates. > > Overview > This PR unifies error types across the workspace by removing legacy re-export/alias modules (e.g. `CasClientError`, `CasTypesError`, `DataProcessingError`, `SessionError`) and updating call sites to use canonical errors like `xet_client::ClientError`, `xet_core_structures::CoreError`, and `xet_data::DataError` directly. > > It updates CAS client code to standardize on `crate::error::Result`/`ClientError`, including deleting `cas_client/error.rs`, adjusting error conversions in retry/http middleware paths, and updating simulation/local-server code to map `ClientError` to HTTP responses. > > Python bindings (`hf_xet`) now convert failures via `XetError` (with `xet_pkg` built with `python` support), register custom exceptions on module init, and refine argument-validation errors to `PyValueError` while routing network/timeout/auth/not-found to more appropriate Python exception classes. > > Misc cleanup: `git_xet` now depends on `xet-data`, simulation binaries switch to `anyhow::Result`/`bail!`, and lockfiles are updated for new/updated dependencies (notably `pyo3`/`inventory`). > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit `f3d056a909`. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY -->	2026-03-19 16:34:28 -07:00
Di Xiao	fb83178d28	Fix session id regression (#738 ) The session id was replaced from `ulid` to `UniqueID` (a self incrementing u64 in memory) in a previous PR but it's not correct. The session id is used on CAS server logs and traces and CDN logs to identity a related group of activity (for debugging and etc. purposes) and it needs to be globally unique (thus using `ulid`) instead of locally unique.	2026-03-19 13:31:40 -07:00
Di Xiao	9f68537319	Resolve all Dependabot alerts (#733 ) This PR should resolve all Dependabot alerts by upgrading deps and switching out some deprecated crate for suggested alternatives, e.g. `tempdir -> tempfile`. Supersede PR #721. Fix issue #722	2026-03-19 09:33:56 -07:00
Hoyt Koepke	506fc28291	Simplify progress tracking + Unify Task ID tracking + Legacy Interface (#726 ) Currently, progress tracking is split between callback-driven and snapshot-driven paths, making session and task wiring across xet_data, xet_pkg, hf_xet, and git_xet harder to keep consistent. This PR moves upload/download progress to a polling snapshot model backed by atomics. It also switches task identifiers to a UniqueID common with the progress tracking throughout the session APIs. This PR also updates the rate estimation to use the lighter weight exponentially weighted moving averages model, so this can be done at a low level. To preserve compatibility for existing callback consumers, callback-oriented upload/download progress tracking APIs are moved under xet_pkg::legacy and bridged from polling snapshots via a callback based updaters. hf_xet and git_xet are updated to use that legacy bridge layer, so current integrations keep working until everything is fully switched over to the XetSession method.	2026-03-18 18:07:43 -07:00
Hoyt Koepke	71f8570a0e	Optimize config struct for direct access in python (#706 ) This PR adds in a feature flag, "python" to the xet_runtime package such that when compiled, the XetConfig struct is built to have python getters and setters. This integrates the handling of the config struct directly into the XetConfig struct and the macros used to register the config values, making the handling of values in the python bindings seamless.	2026-03-16 12:23:43 -07:00
Brian Ronan	6232b42591	Xorb download URL debug logs (#714 ) It's a bit annoying to try to ensure our CDN routing is correct. Logging the URL domain for the first fetch term download to the debug logs. Please don't hesitate to recommend alternative approaches.	2026-03-13 16:05:30 -07:00
Di Xiao	e701aeddac	Support XetSession in async context (#694 ) `XetSession` always created its own tokio runtime via `XetRuntime::new_with_config`, and calling `external_run_async_task` panics when already inside a tokio context. This blocked embedding the session in async Rust frameworks. Core strategy: - `RuntimeMode` enum — `Owned` (session created its own thread pool via `XetSessionBuilder::build` or `XetSessionBuilder::build_async` when outside tokio context. Both `_blocking` and async methods are supported. Async methods use an internal `bridge_to_owned` bridge that routes futures onto the owned thread pool, so they work from any executor (tokio, smol, async-std)) vs `External` (session wraps a caller-supplied tokio handle via `XetSessionBuilder::with_tokio_handle` or `XetSessionBuilder::build_async` when inside qualified tokio context. Only async methods may be called; `_blocking` methods return `SessionError::WrongRuntimeMode`. No second thread pool is created). - `XetRuntime::bridge_to_owned` — a new bridge that routes a future onto the owned tokio thread pool from any executor (smol, async-std, futures::executor, non-qualified tokio runtime) by delivering the result via a `tokio::sync::oneshot` channel that can be polled by any async executor. - Async public API — `UploadCommit` and `DownloadGroup` methods (`upload_from_path`, `upload_bytes`, `upload_file`, `commit`, `finish`) are now async fn. Factory methods `XetSession::new_upload_commit` and `new_download_group` are async. Example: ``` let session = XetSessionBuilder::new().build_async().await?; // Upload let commit = session.new_upload_commit().await?; let handle = commit.upload_from_path("file.bin".into()).await?; let results = commit.commit().await?; // Download let group = session.new_download_group().await?; let info = XetFileInfo { hash: ..., file_size: ..., }; let dl_handle = group.download_file_to_path(info, "out/file.bin".into())?; let finish_results = group.finish().await?; ``` - Sync wrappers — New `UploadCommitSync` / `DownloadGroupSync` in `xet_session/sync/` expose a fully blocking API for sync Rust and Python (PyO3) callers. Returned by `new_upload_commit_blocking()` and `new_download_group_blocking()`. Example: ``` let session = XetSessionBuilder::new().build()?; // Upload let commit = session.new_upload_commit_blocking()?; let handle = commit.upload_from_path("file.bin".into())?; let results = commit.commit()?; let m = results.values().next().unwrap().as_ref().as_ref().unwrap(); // Download let group = session.new_download_group_blocking()?; let info = XetFileInfo { hash: ..., file_size: ..., }; let dl_handle = group.download_file_to_path(info, "out/file.bin".into())?; let finish_results = group.finish()?; ``` Additional fixes: `download_file_to_path` and `upload_from_path` now canonicalize paths with `std::path::absolute` before enqueuing; task status is only overwritten when still `Running`, preventing a race with concurrent abort(). Fix XET-891 --------- Co-authored-by: Hoyt Koepke <hoytak@huggingface.co>	2026-03-13 14:57:20 -07:00
Hoyt Koepke	45d38a13a9	Code reorganization towards release of xet cargo package (#693 ) This PR is a massive rearrangement of the code base into 5 packages intended for release on cargo. The directories and corresponding packages are: 1. xet_runtime/ — compiles into the xet-runtime package. Contains the runtime, config, and logging management. 2. xet_core_structures/ — compiles into the xet-core-structures package. Contains core data structures for hashing, shards, and xorbs as well as internal data structures that depend on these. 3. xet_client/ — compiles into the xet-client package, contains client code for remotely connecting to the Hugging Face servers. 4. xet_data/ — compiles into the xet-data package, contains the data processing pipeline: chunking/deduplication, file reconstruction, clean/smudge operations, and progress tracking. 5. xet_pkg/ — compiles into the hf-xet package, provides the top-level session-based API for file upload and download with user-facing error categorization. This is the primary package downstream dependencies would use. This also contains a single summary error type, XetError, that translates cleanly into python error types. In addition, the other tools are: - git_xet/ — the git_xet CLI binary crate (location preserved). - hf_xet/ -- the hf_xet python package (location preserved). - simulation/ — the simulation crate for upload scenario benchmarking. - wasm/ -- the wasm objects. The full description — and information for an AI agent to use to update downstream dependencies — is at api_changes/update_260309_package_restructure.md. Summary of moves: - xet_runtime: became xet_runtime::core inside xet_runtime/. - utils: became xet_runtime::utils inside xet_runtime/. - xet_config: became xet_runtime::config inside xet_runtime/. - xet_logging: became xet_runtime::logging inside xet_runtime/. - error_printer: became xet_runtime::error_printer inside xet_runtime/. - file_utils: became xet_runtime::file_utils inside xet_runtime/. - merklehash: became xet_core_structures::merklehash inside xet_core_structures/. - mdb_shard: became xet_core_structures::metadata_shard inside xet_core_structures/. - xorb_object: became xet_core_structures::xorb_object inside xet_core_structures/. - cas_client: became xet_client::cas_client inside xet_client/. - hub_client: became xet_client::hub_client inside xet_client/. - cas_types: became xet_client::cas_types inside xet_client/. - chunk_cache: became xet_client::chunk_cache inside xet_client/. - data: became xet_data::processing inside xet_data/. - deduplication: became xet_data::deduplication inside xet_data/. - file_reconstruction: became xet_data::file_reconstruction inside xet_data/. - progress_tracking: became xet_data::progress_tracking inside xet_data/. - xet_session: became xet::xet_session inside xet_pkg/. - Wasm packages (hf_xet_wasm, hf_xet_thin_wasm): moved from top-level into wasm/; internal imports updated, public APIs unchanged.	2026-03-11 12:02:38 -07:00
Rajat Arya	83a28271ea	fix: no timeout for shard uploads (XET-885) (#685 ) Fixes [XET-885](https://linear.app/xet/issue/XET-885/investigate-unsloth-upload-failure-shard-upload-timeout-on-cas) ## Summary Shard uploads to CAS can take a long time due to server-side processing (DynamoDB writes scale with file entry count). The default `read_timeout(120s)` on the reqwest client kills these uploads. Key insight: reqwest's per-request `RequestBuilder::timeout()` does NOT override the client-level `read_timeout()` — they are independent mechanisms polled as separate futures. So the original approach of using per-request timeouts was ineffective. Fix: Create a dedicated `shard_upload_http_client` on `RemoteClient` with no `read_timeout`, built once at construction time and reused for all shard uploads. All other settings (connect timeout, pool config, auth middleware) are identical to the standard client. ## Changes ### `cas_client/src/http_client.rs` - Added `reqwest_client_no_read_timeout()` — creates a reqwest client with no `read_timeout` - Added `build_auth_http_client_no_read_timeout()` — public API wrapping it with middleware - 4 unit tests for the new builder ### `cas_client/src/remote_client.rs` - Added `shard_upload_http_client` field to `RemoteClient` (cfg'd out on wasm) - `upload_shard()` uses the pre-built no-timeout client instead of building one per request ### `cas_client/tests/test_shard_upload_timeout.rs` - Updated: slow server test now asserts success (shard uploads should wait as long as needed) ### `xet_config/src/groups/client.rs` - Removed `shard_read_timeout` config field (no longer needed) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-11 09:05:40 -07:00
Hoyt Koepke	6a5535bc46	Rework simulation pipeline for adaptive concurrency and connection resiliency. (#648 ) This PR replaces the previous collection of scripts around setting up docker containers with a much more nimble and lightweight set of rust scripts and a simple, reusable proxy that can limit bandwidth and congestion simulations. The previous scripts are rewritten to be more nimble and use more reusable components. New tools: - cas_client/src/simulation/network_simulation: A lightweight, in-process network congestion simulation proxy that lives between the LocalServer instance and the RemoteClient instance, allowing simulation tests to run on a network with realistic congestion conditions and a gated bandwidth. This can be controlled dynamically through a LocalTestServer instance. - simulation/: A new package for collecting simulation scripts and analyzing the results. To run the new simulation scripts for the adaptive concurrency on upload, compile in release mode and run one of the scripts in `simulation/src/adaptive_concurrency/scripts/`. Docker is no longer needed to run any of the simulations. The old `cas_client/tests/adaptive_concurrency/` paths were removed.	2026-03-09 10:49:36 -07:00
Hoyt Koepke	e6e0413d90	Naming clarification: A Xorb is a data object, CAS is the remote server. (#680 ) This PR makes the use of the `cas` and `xorb` terms consistent. Previously, "cas" (for content addressed store) could simultaneously refer to either the remote server or the data bytes stored as a collection of chunks. After the renames in this PR, we consistently use `xorb` to refer to the data object and cas to refer to the remote server. This renames quite a few places; to aid in rebasing current work or updating downstream dependencies, this PR includes a file `API_UPDATES.md` that can be fed into an AI agent to quickly and accurately perform the renaming on any downstream dependencies.	2026-03-04 16:05:49 -08:00
Di Xiao	c4a56f889c	XetSession API (#657 ) This PR introduces a new `xet_session` crate that provides a session-based hierarchical API: Users create a XetSession to manage runtime and configuration, then batch uploads into UploadCommit objects and downloads into DownloadGroup objects — each of which runs transfers in the background by the inner XetRuntime. All pub functions are exposed as sync functions - making them easy to use in other languages, e.g. Python, C, etc.	2026-03-03 20:27:39 -08:00
Hoyt Koepke	9b3278a510	Streaming data writer (#656 ) This PR adds an integrated API for streaming downloads, exposing a DownloadStream object that is integrated with the file reconstructor. It also uses the same memory management buffer limiting process to work with the stream object. It also introduces cancellation support to the FileReconstructor to ensure that tasks waiting on a long running download or semaphore wait don't cause things to hang when an error is reported or the user drops the stream.	2026-02-27 15:08:25 -08:00
Di Xiao	c4111eb6da	Feature to monitor client process system usage (#617 ) Introduces a client benchmark utility to track system resource usage (CPU, memory, disk I/O, and network I/O) of a process, so we don't need to write scripts to capture usage stats according to different OS standards. This becomes extremely helpful when I benchmark on Python notebook instances, e.g. Google Colab, where system monitor is not easily accessible or when running a separate monitor script is not easy. # Usage # Users can enable monitoring by setting `HF_XET_SYSTEM_MONITOR_ENABLED` to true, set usage sample interval using `HF_XET_SYSTEM_MONITOR_SAMPLE_INTERVAL`, this outputs metrics to the tracing stream at `INFO` level by default. In addition, these metrics can be redirected to a separate file by setting sample log path using `HF_XET_SYSTEM_MONITOR_LOG_PATH`. # Output # The stats are output in JSON format, which can be queried using tools like `jq`, e.g. 1. Trace of peak memory usage: `jq '.memory.peak_used_bytes' [HF_XET_SYSTEM_MONITOR_LOG_PATH]` 2. Trace of disk write speed: `jq '.disk.average_write_speed' [HF_XET_SYSTEM_MONITOR_LOG_PATH]` 3. Trace of network receive speed: `jq '.network.average_rx_speed' [HF_XET_SYSTEM_MONITOR_LOG_PATH]`	2026-02-27 13:36:31 -08:00
Hoyt Koepke	543914dce1	Scale download buffer memory limit by number of active downloads (#666 ) Currently, the maximum number of downloaded files is fixed, regardless of the number of downloads currently in flight. However, as the number of downloads increases, a fixed size total could lead to waiting on individual segments that download out-of-order or don't have enough turnaround time to saturate the output. While writing to disk or the download itself often becomes the bottleneck before these effects, planned features such as streaming files and caching could be affected by this limit. The default formula for the download buffer size now is (2GB + 512MB * number of concurrent downloads) up to a maximum of 8GB (these are adjustable). This PR alleviates this by allocating an additional 512MB buffer allocation per file, prioritized to the specific download, releasing that capacity when the file finishes downloading. This is done using the AdjustableSemaphore class, first introduced for the concurrent scaling, which allows the number of total permits in a semaphore to be incremented or decremented; on decrement, permits are discarded upon return until the total permits is at the target number.	2026-02-27 11:35:55 -08:00
Brian Ronan	17e900a70e	Feat: optional `request_headers` on hf_xet API calls (#661 ) Adding support for setting an optional `request_header` map on the hf_xet upload and download API calls. This map is augmented with the hf_xet user agent string and is passed along with the requests to xetcas. This PR also adds some unit tests for testing the map merging behavior to `hf_xet/lib.rs` and adds support for running these with cargo test and in github actions CI step.	2026-02-23 14:43:58 -08:00
Hoyt Koepke	5d6371a296	Progress reporting for downloads. (#645 ) This PR adds detailed progress reporting to the download path. - Transfer progress is reported as soon as the download streams start; actual bytes written are reported as the reconstructed file is written out. - Currently, each call to download_file creates a separate progress tracker, but this sets up for download groups with grouped download progress tracking. To support this, the UploadProgressStream was split into three classes; a common StreamProgressReporter and download and upload specific versions. This also allows us to simplify the API to RetryWrapper. More tracking was added to the file reconstruction paths to properly report progress.	2026-02-19 11:06:42 -08:00
Hoyt Koepke	9d9fc72d40	XetCommon struct in the runtime to hold global counters, semaphores. (#650 ) This PR simplifies the current process of working with runtime-associated resources such as a cached Client instance or global resource semaphores. Instead of using macros, all of these are moved into a XetCommon struct that holds them explicitly. The runtime holds an instance of this, and it's initialized with a config struct. In addition, to make the logic around the memory limiting semaphore in file_reconstructor clearer, we added a ResourceLimiter struct that wraps the tokio semaphore but scales the total permits and permit requests appropriately if the total resource quantity is larger than u32::MAX, as can be the case easily. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-02-12 16:47:07 -08:00
Di Xiao	23f68bb798	Upgrade git-xet to 0.2.1 (#653 )	2026-02-12 15:45:34 -08:00
Di Xiao	7d7582c3dd	TemplatedPathBuf utility (#643 ) Implements a utility for configuring path-like parameters. This folds inside the existing function `fn normalized_path_from_user_string` that expands `~` to home directory and converts to absolute paths, and evaluates a path template by substituting case-insensitive placeholders with corresponding values: - `{pid}` for process ID, - `{timestamp}` for ISO 8601 local timestamp with offset For example, ``` let template = TemplatedPathBuf::new("~/logs/app_{PID}_{TIMESTAMP}.txt"); let path = template.as_path(); /// Returns an absolute path like "/home/user/logs/app_12345_2024-01-15T10-30-45-0500.txt" ``` or to be used directly in config groups: ``` crate::config_group!({ ref log_path: Option<TemplatedPathBuf> = None; } ```	2026-02-11 14:51:16 -08:00
Hoyt Koepke	e443ee9260	Upgrade package dependencies (#644 ) This PR updates all the package dependencies that would not cause significant API breakages to the current version. The package versions in hf_xet_wasm and hf_xet are also updated to match the versions in the base package. There should be no functional change.	2026-02-11 12:19:29 -08:00
dependabot[bot]	c9a29ffb9e	Bump oneshot from 0.1.11 to 0.1.12 (#616 ) Bumps [oneshot](https://github.com/faern/oneshot) from 0.1.11 to 0.1.12. <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/faern/oneshot/blob/main/CHANGELOG.md">oneshot's changelog</a>.</em></p> <blockquote> <h2>[0.1.12] - 2026-01-25</h2> <h3>Fixed</h3> <ul> <li>Fix race condition that could lead to use-after-free if the <code>Receiver</code> was polled asynchronously, but then dropped before completion. <a href="https://redirect.github.com/faern/oneshot/pull/74">faern/oneshot#74</a></li> <li>Fix race conditions/UB around atomic memory orderings. These were found by running tests under miri. <a href="https://redirect.github.com/faern/oneshot/pull/72">faern/oneshot#72</a></li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="`537d5de4b6`"><code>537d5de</code></a> Bump version to 0.1.12 and fix changelog</li> <li><a href="`9cc3153a7d`"><code>9cc3153</code></a> Merge branch 'improve-start_recv_ref'</li> <li><a href="`cc3d6a2b96`"><code>cc3d6a2</code></a> Improve start_recv_ref to be more like regular recv method</li> <li><a href="`78c7476979`"><code>78c7476</code></a> Merge branch 'update-documentation'</li> <li><a href="`38d7f6f2cd`"><code>38d7f6f</code></a> Add clarifying documentation on sender observing RECEIVING state</li> <li><a href="`21e0310074`"><code>21e0310</code></a> Synchronize readme with crate documentation in lib.rs</li> <li><a href="`def74fc6fe`"><code>def74fc</code></a> Fix spelling and grammar errors in documentation</li> <li><a href="`70031a4282`"><code>70031a4</code></a> Add documentation about how send and receive are synchronized</li> <li><a href="`d1a1506010`"><code>d1a1506</code></a> Merge branch 'fix-async-recv-drop-use-after-free'</li> <li><a href="`f19ff7c3bf`"><code>f19ff7c</code></a> Fix Receiver::drop bug causing a race when dropping a polled receiver</li> <li>Additional commits viewable in <a href="https://github.com/faern/oneshot/compare/v0.1.11...v0.1.12">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=oneshot&package-manager=cargo&previous-version=0.1.11&new-version=0.1.12)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/huggingface/xet-core/network/alerts). </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-01-28 10:28:04 -10:00
Hoyt Koepke	a6630293bb	Hash table with pass-through hasher for MerkleHashes (#611 ) Currently, the rust HashMap uses a randomized hasher for input, which prevents hash collision attacks. However, in our code, we don't need that protection in the client, and a MerkleHash is already a cryptographic hash. This PR adds a MerkleHashMap type that just passes the hash through to the HashMap, providing a substantial speedup: ``` ================================================================= PERFORMANCE SUMMARY (times in ms, lower is better) ================================================================= Test HashMap PassThrough ----------------------------------------------------------------- --- 100K --- Insert 2.1 0.7 Lookup 2.1 1.3 Insert+Lookup 4.4 1.6 Serialize 1.6 0.9 Deserialize 4.3 1.2 --- 10M --- Insert 433.2 204.1 Lookup 615.3 255.5 Insert+Lookup 951.6 460.4 Serialize 117.2 93.4 Deserialize 599.5 89.3 ================================================================= ``` It also replaces HashMap<MerkleHash, ...> everywhere in the code to provide an across-the-board improvement.	2026-01-22 10:42:53 -10:00
Hoyt Koepke	128fb6fc42	File download and reconstruction V2 (#603 ) This PR rewrites the download and file reconstruction path. The new version: - Separates the Client connection from the reconstruction, using a new FileReconstructor class to manage the reconstruction. This FileReconstructor is now in the file_reconstructor package. The old version is still present in the client but moved to file_reconstruction_v1/; using V1 or V2 is controlled by reconstruction.use_v1_reconstructor. - Uses a global buffer memory limiter so the space used for downloading all files never exceeds a configurable limit, set to 8gb by default. - Automatically tunes the download parallelism to adapt to the connection conditions. - Automatically tunes the number of terms fetched in order to target all terms downloading within a certain window. - Uses vectored write (configurable) to speed writing to a single file. - Moves the URL refresh logic into the RetryWrapper class. - Uses a for loop with futures to make the logic behind the reconstruction process easier to understand. - Adds extensive testing against the LocalTestServer and LocalClient to cover all the code paths. - Completely removed the retry logic level from the reqwest middleware. Next steps after this: - Implement resume on partial download. - Interface to caching layer. - Add partial-term progress reporting to match the upload path.	2026-01-14 21:02:53 -08:00
Hoyt Koepke	9332ff28b7	Mock CAS server built on LocalClient for testing and simulation. (#602 ) This PR adds a fully functional CAS server built around a LocalClient instance. This allows full testing of the RemoteClient interface without hitting the actual CAS backend. For testing, it can either be run as a standalone executable, or it can be started using a LocalTestServer instance that exposes both a RemoteClient interface as client, or direct access to the state through a stored LocalClient instance. Numerous tests are added to also cover existing functionality as well as the new server functioning. (Also, it exposed that when using a lot of tests with wiremock or this server, the testing would often hit a "Too many open files" error; this was fixed by consolidating these tests to reduce the number of separate testing servers running at once.	2026-01-09 12:39:52 -08:00
Di Xiao	d15295eff3	Clean up dependencies (#595 ) - Remove dependencies from Cargo.toml files that are not used. - Move dependencies directly referencing crates.io from crate level Cargo.toml to the workspace Cargo.toml. - Fix using RemoteClient in WASM: AdaptiveConcurrencyController uses `tokio::time::Instant` which wraps `std::time::Instant` and is not available in WASM. - Add [cargo-machete](https://github.com/bnjbvr/cargo-machete) to CI to check unused dependencies. No functionality change.	2025-12-15 15:26:02 -08:00
Di Xiao	74d7c5926c	Clean up dead code (#593 ) There have been many dead code left in xet-core due to `#![allow(dead_code)]` at a couple of places. This PR removes them and fix the corresponding linting errors. No functionality change.	2025-12-11 10:55:28 -08:00
Hoyt Koepke	9cf0e1e35e	Automatic concurrency adjustment for transfers (#410 ) Adaptive Concurrency Controller This PR introduces adaptive concurrency control for transfers based on an adaptive ML model of the network connection. It is currently implemented only for the upload path and gated behind the environment variable HF_XET_ENABLE_ADAPTIVE_CONCURRENCY, which is set to false by default. Future PRs will integrate this into the download path and then enable it by default with sufficient testing. The `AdaptiveConcurrencyController` struct dynamically adjusts concurrency for upload and download operations by continuously adapting to network conditions. It tracks two key signals: 1. Observed bandwidth via an online linear regression predictor 2. Success ratio of recent transfers using configurable success/failure thresholds Transfers are considered successful if they complete within a statistically reasonable time given the model (less than the 90% quantile) and below the configured max RTT for healthy operation (by default 90s). The model then increases the concurrency when the success ratio is high (>0.8) and the RTT prediction stays below a target RTT (60s default). It decreases the concurrency when the success ratio drops below a threshold (<0.5) or the transfers exceed a maximum healthy RTT (90s default). To prevent oscillations, it also enforces a minimum delay between adjustments, set to 500ms by default. The RTT prediction is implemented using an exponentially-weighted online linear regression model that predicts round-trip time (RTT) based on transfer size and concurrency level. The model fits: ``` duration_secs ≈ a + b * (size_bytes * concurrency) ``` Internally this is implemented using `ExpWeightedOnlineLinearRegression`, which maintains exponentially-decaying sufficient statistics to predict the mean and standard deviation of the RTT. The exponential decay of the process, with the half-life of an observation set to 60 data points, allows it to adapt to slowly changing network conditions. This model is used to predict whether adding concurrency will cause a large transfer of 64MB to take longer than 60s to complete, in which case no concurrency is added. Upon a successful transfer, this model is used to assess whether congestion might be causing completed transfer to take longer than expected; if the actual RTT is in the 90% quantile, then it's reported as a failure to the success tracker; a statistically significant number of recent failures will prevent the concurrency from increasing, and a string of failures will cause the controller to lower the concurrency. The controller tracks the success ratio (fraction of successful transfers) using an exponentially weighted moving average with a default half-life of 8 observations. This allows us to determine whether recent transfers have hit congestion, as long RTTs are recorded as failures. 80% of the recent transfers have to be successes to lower the concurrency, and if less than 50% are successful, the concurrency is dropped. By default, the model starts at the minimum concurrency and increases as soon as data reliably predicts the RTT. All bounds are controlled by config variables.	2025-12-01 16:43:24 -08:00
Di Xiao	eeee211e59	Upgrade git-xet version (#574 )	2025-11-21 10:05:02 -08:00
Di Xiao	b5563ecd93	Better support of authentication through SSH (#553 ) This PR finally enables `git-xet` on Windows authenticating to remote Git server using SSH URL. This is a crucial part as access tokens to the CAS server expire every 900 s and `git-xet` needs to re-authenticate with the Git server by itself during push/pull (whereas the first authentication is handled by `git-lfs`). This uses the same SSH connect utility to authenticate over SSH repo remote URL on both *nix OS and Windows. Resolves XET-731	2025-11-20 12:09:57 -08:00
Di Xiao	5f77ffc46a	Integration test for ssh access on Windows (#566 ) This PR builds on top of https://github.com/huggingface/xet-core/pull/565 and builds an integration test to test access to "ssh" and "sh" on Windows through the "git" (-> "git-lfs") -> "git-xet" call chain. Out of all the ssh variants, access to programs like "plink", "putty", "tortoiseplink" or "simple" should be given by the env var `$GIT_SSH_COMMAND` or `$GIT_SSH`, or by git config entry `core.sshCommand`. Direct access to the mostly used utility "ssh" and in-direct access to "ssh" via "sh -c" on Windows is provided by the "git" (-> "git-lfs") -> "git-xet" call chain, see git_xet/tests/test_ssh.rs for details.	2025-11-20 03:22:19 -08:00
Di Xiao	075a9c96c0	Add ssh connect utility according to git standard (#565 ) This implements an utility to help set up SSH connection according to Git standards. 1. Env vars `$GIT_SSH_COMMAND`, `$GIT_SSH` and git config entry `core.sshCommand` define which ssh executable to use for an SSH connection. `$GIT_SSH_COMMAND` takes precedence over `core.sshCommand` and both are interpreted by the shell (e.g. `GIT_SSH_COMMAND = "ssh -i ~/.ssh/key"`), which allows additional arguments to be included. They both takes precedence over `$GIT_SSH`, which on the other hand must be just the path to a program (which can be a wrapper shell script, if additional arguments are needed). When none of these is given, the default ssh program to use is `ssh`. 2. Env var `$GIT_SSH_VARIANT` takes precedence over git config entry `ssh.variant` and they both define whether `$GIT_SSH`/`$GIT_SSH_COMMAND`/`core.sshCommand` refer to OpenSSH, plink/putty or tortoiseplink, or instruct git to automatically detect the ssh program type. Valid values are "ssh" (to use OpenSSH options), "plink", "putty", "tortoiseplink", "simple" (no options except the host and remote command). The default auto-detection can be explicitly requested using the value "auto". Any other value is treated as "ssh". This implementation follows the git standard and how the same functionality is handled in git-lfs (`071e19e8ea/ssh/ssh.go (L41)`).	2025-11-19 12:43:16 -08:00
Hoyt Koepke	a5ea819ccb	Rework of the constant configuration system. (#564 )	2025-11-19 11:58:53 -08:00
Assaf Vayner	cd64baa6ca	separating output providers, sequential output providers (#528 ) This PR does a refactor of how we pass in the catch all "OutputProvider" to the download mechanism. It separates the download system to supporting "Sequential" and "Seeking" operations: - Seeking e.g. opening the file multiple times and seeking to location -- this is the standard writing mechanism hf_xet uses today. - Sequential e.g. opening a file once and writing data in order -- this is to be used in a set of upcoming PR's/features to use the parallel-download/sequential-write mechanism to support writing to Stdout and to a channel buffer in memory. To support an in memory channel with backpressure the Channel{Writer, Stream, Reader} are introduced (re-introduced?) in utils. This particularly could be useful in the mount functionality.	2025-10-29 14:12:24 -07:00
Hoyt Koepke	2fc772e6d0	Shard utilities needed for GC pass and server-side xorb rewriting. (#532 ) This PR adds a utility that rewrites a shard to include only the relevant xorb information, dropping unreferenced file information. In addition, to preserve the global dedup tracking information associated with the files, this PR also adds a backwards-compatible flag to the chunk metadata that marks a specific chunk as global dedup eligible. This allows the global dedup information to be tracked independently of the file metadata.	2025-10-29 12:10:57 -07:00
Hoyt Koepke	3096b3f9c3	Test suite for directory logging functionality (#536 )	2025-10-24 10:06:26 -07:00
Hoyt Koepke	69f23d630e	Logging to directory + log file management; default to log directory for hf_xet (#502 ) This PR switches the default logging to log events to a file in '~/.cache/huggingface/xet/logs' (or 'xet/logs' under the specified cache directory if not `~/.cache/huggingface/`). In this directory, log files older than 2 weeks are cleaned up on process start, and if the total size of files in the directory is larger than 1gb, then log files are deleted by age to get the directory size under 1gb. Log files are named with a timestamp and PID; by default, logs newer than 1 day or logs with an active associated PID are never deleted. All of these are user configurable constants.	2025-10-20 14:35:43 +02:00
Assaf Vayner	c55fabb6bf	hashing and chunking example tools (#496 ) Adds some basic examples tools (compiled with `cargo build --examples` on `data` crate) to compute hashes and chunk boundaries.	2025-09-26 12:49:55 -07:00

1 2 3

148 Commits