Commit Graph

148 Commits

Author SHA1 Message Date
Assaf Vayner
5868f64ab9 fixing some issues identified in cargo audit (#802)
CI for hf-hub is running cargo audit and found many issues through
hf-xet transitive deps. this PR attempts to solve some of them (not
necessarily all of them).

Main changes:
- dropped derivative and reqwest-retry
- replaced bincode with postcard, only used in testing
- upgrade xet-core rand usage
- added audit CI step and ignoring some issues that we can't easily fix.





<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Medium risk because it removes `reqwest-retry`/`derivative` and
replaces part of the retry classification logic with an in-house
equivalent, which could subtly change HTTP retry behavior; the remaining
changes are dependency/version bumps and test-only serialization swaps.
> 
> **Overview**
> Adds a new CI `cargo audit` job and introduces `.cargo/audit.toml` to
ignore a small set of **dev-only** RustSec advisories with documented
rationale.
> 
> Reduces audit surface by dropping `derivative` (manual `Debug` impl
for `AuthConfig`) and removing `reqwest-retry`, replacing its
status-code classification with a local `Retryable` enum +
`default_on_request_success` helper in `RetryWrapper`.
> 
> Updates workspace deps (notably `rand` to `0.10` and `rand_distr` to
`0.6`) and adjusts call sites to the newer `rand` APIs (`RngExt`
imports, minor test/bench tweaks). Test-only binary serialization
switches from `bincode` to `postcard` (and updates affected tests), with
corresponding lockfile updates across crates.
> 
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
26377f4a1c. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
2026-04-20 14:49:48 -07:00
Assaf Vayner
08377eab3c Upgrade crates version to 1.5.1 (#782)
## Summary
- Bump workspace version from 1.5.0 to 1.5.1
- Update all internal dependency version references to match

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Low Risk**
> Low risk version-only bump across workspace manifests and lockfiles
with no code/behavior changes in the diff.
> 
> **Overview**
> Bumps the workspace package version from `1.5.0` to `1.5.1` and aligns
internal crate dependency version pins (`xet-runtime`,
`xet-core-structures`, `xet-client`, `xet-data`, `hf-xet`) to match.
> 
> Updates lockfiles (`Cargo.lock` plus `hf_xet` and wasm lockfiles) so
published/embedded artifacts resolve to the `1.5.1` crate set (including
bringing wasm lockfiles up to `1.5.1`).
> 
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
e8563700a0. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
2026-04-06 14:03:02 -07:00
Di Xiao
1f7400cc4b Drop xet-core-structure from xet-runtime dev dep (#776) 2026-04-03 13:56:26 -07:00
Di Xiao
950807ba43 Upgrade crates version to 1.5.0 (#775)
Update workspace version to `1.5.0` and intra-workspace dependency
versions to `1.5.0`
2026-04-03 13:39:50 -07:00
Di Xiao
1f0918c33e Refactor XetSession commit / group CAS endpoint and auth configuration (#771)
There's no publicly documented Xet CAS endpoint. To interact with Xet
CAS, all public clients need to obtain a CAS endpoint from the same
route to obtain a CAS token.

Currently users need to
1. first construct a CAS token URL with respect to a certain operation
("read" or "write", targeted repo type, targeted repo, targeted
revision),
2. send a request to this URL to get a CAS token and CAS endpoint,
3. use the CAS endpoint to build a `XetSession`,
4. use the `XetSession` instance and the CAS token and CAS token URL to
build an upload or download group.

This is a rather completed setup. This PR address this blocker by
eagerly "refresh"-ing the CAS token if no CAS endpoint is provided, thus
users can
1. build a `XetSession`,
2. construct a CAS token URL with respect to a certain operation ("read"
or "write", targeted repo type, targeted repo, targeted revision),
3. use the `XetSession` instance and the CAS token URL to build an
upload or download group.

So effectively, there will be two common patterns:
Pattern A: endpoint known ahead of time — no eager refresh, token_info
is used as-is
```
let session = XetSessionBuilder::new().build()?;
let commit = session
    .new_upload_commit()?
    .with_endpoint(cas_url)
    .with_token_info(token, expiry)
    .with_token_refresh_url(refresh_url, /*Auth headers*/)
    .build_blocking()?;
```

Pattern B: endpoint unknown — build call fetches it; token_info seeded
from response
```
let session = XetSessionBuilder::new().build()?;
let commit = session
        .new_upload_commit()?
        .with_token_refresh_url(token_refresh_url, /*Auth headers*/)
        .build_blocking()?;
```

Other changes:
1. `with_endpoint()` and `with_custom_headers()` configuration is moved
from the `XetSession` level down to the operation level, because we can
actually have multiple operations with different CAS endpoints co-exist
in the same session instance.
2. Builder for different operations `XetUploadCommit`,
`XetFileDownloadGroup`, `XetDownloadStreamGroup` are refactored to share
common code under `struct AuthGroupBuilder<G>`.
2026-04-02 11:07:07 -07:00
Assaf Vayner
20198a9081 Remove prometheus dependency and metrics (#769)
## Summary
- Remove the `prometheus` crate dependency from the workspace and
`xet_data`
- Delete `prometheus_metrics.rs` which defined 3 IntCounter metrics (CAS
bytes produced, bytes cleaned, bytes smudged)
- Remove metric increment calls from `file_upload_session.rs` and
`file_download_session.rs`
- Fix Windows CI flake: redb "Database already open" error in
`test_single_large`

These metrics were collected but never exposed via any HTTP endpoint or
text encoder, making them effectively dead code.

## Test plan
- [x] `cargo +nightly fmt` — clean
- [x] `cargo clippy --all-targets` — no new warnings
- [x] `cargo test -p xet-data` — 17/17 pass
- [x] `cargo test -p xet-data --features simulation --test
test_clean_smudge` — 14/14 pass (including `test_single_large`)
- [x] WASM builds (`hf_xet_wasm`, `hf_xet_thin_wasm`) — both succeed

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Low Risk**
> Low risk: this removes unused Prometheus metrics plumbing and related
dependencies without changing the core upload/download logic. Main risk
is loss of any downstream reliance on these counters at build time
(e.g., feature flags or imports).
> 
> **Overview**
> Removes the `prometheus` dependency from the workspace and `xet_data`,
and updates lockfiles accordingly (including WASM-related lockfiles).
> 
> Deletes `xet_data`’s `prometheus_metrics` module and strips the
associated counter increments from `FileUploadSession` and
`FileDownloadSession`, leaving the data processing behavior unchanged
aside from no longer recording these metrics.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
c6c866b7ca. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
2026-04-01 14:56:58 -07:00
Assaf Vayner
7d97aa3066 Replace heed (LMDB) with redb in local CAS simulation (#766)
This is an optional change. basically heed imports a bunch of deps and
it's also using lmdb that may require more compilation/linking steps in
tests. we use it for such a small subset of operations in testing I
thought we might try an even thinner rust-native dep instead. that's
what redb is.

## Summary
- Replace `heed` (C LMDB bindings) with `redb` (pure Rust embedded KV
store) in `LocalClient`
- Removes C dependency, `unsafe` block, Windows retry workaround, and
custom `Drop` impl
- Introduces `RedbHash` newtype wrapper for `MerkleHash` to satisfy
orphan rules on redb's `Key`/`Value` traits
- Net reduction of ~130 lines; all 147 existing tests pass

## Test plan
- [x] `cargo check -p xet-client --features simulation` — clean
- [x] `cargo test -p xet-client --features simulation` — 147 passed, 0
failed
- [x] `cargo clippy -p xet-client --features simulation` — clean
- [x] `cargo +nightly fmt` — clean

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Swaps the embedded KV store used for shard dedup/deletion metadata in
the local CAS simulation, which can affect test behavior and on-disk
state/locking semantics (especially with concurrent clients). Scope is
contained to simulation/test code and dependency graph changes.
> 
> **Overview**
> Switches `LocalClient`’s disk-backed global-dedup and file deletion
status storage from `heed`/LMDB to `redb`, including new `RedbHash`
serialization, `TableDefinition`s, and updated read/write transaction
flows.
> 
> Adds a small global database-handle cache to avoid `redb`
exclusive-lock conflicts across multiple `LocalClient` instances, and
removes the prior LMDB-specific open/retry logic and custom `Drop` close
path. Workspace dependencies/lockfiles are updated to drop
`heed`/LMDB-related crates and add `redb`, and `.gitignore` now ignores
`.worktrees/`.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
02d39864d9. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
2026-03-31 15:23:18 -07:00
Di Xiao
15011cb230 XetSession uses direct token refresh route instead of a callback (#751)
This PR makes two significant, breaking API redesign: 
1. Auth tokens move from session-level (shared by all operations) to
per-operation level (per `UploadCommit`, `FileDownloadGroup`, and
`DownloadStreamGroup`). This enables uploads and downloads from the same
session to carry different access-level tokens — a sensible design for
HF's write-vs-read token split.
2. Instead of letting users provide a callback to refresh tokens, this
new API now let users provide a token refresh URL and access credential
in an HTTP header map.

### Why
1. CAS JWT have short life, but `XetSession` is intended to be held long
time -- thus it makes more sense to configure CAS auth on the operation
level (`UploadCommit` or `FileDownloadGroup` or `DownloadStreamGroup`)
and it will be discarded once the operation is done.
2. For different access level (write vs. read) and different operation
target (repo and commit), CAS JWT token will be different and the token
refresh URL will be different. `UploadCommit` and `FileDownloadGroup`
and `DownloadStreamGroup` they each also function as a single auth
group.
3. Providing an URL is considered easier than writing a callback, and is
more safe when crossing the GIL Python - Rust boundary.

Examples:
```
// Upload token (write access)
let mut upload_headers = HeaderMap::new();
upload_headers.insert("Authorization", "Bearer hub-write-token".parse().unwrap());
let commit = session
    .new_upload_commit()?
    .with_token_info("CAS_WRITE_JWT", 900)
    .with_token_refresh_url("https://huggingface.co/api/repos/token/write", upload_headers)
    .build_blocking()?;
```
```
// File download token (read access)
let mut dl_headers = HeaderMap::new();
dl_headers.insert("Authorization", "Bearer hub-read-token".parse().unwrap());
let group = session
    .new_file_download_group()?
    .with_token_info("CAS_READ_JWT", 900)
    .with_token_refresh_url("https://huggingface.co/api/repos/token/read", dl_headers)
    .build_blocking()?;
```

Secondary changes include:

- `DirectRefreshRouteTokenRefresher` consolidated into
`xet_client::cas_client::auth`.
- HTTP client module moved from `cas_client` to `xet_client::common` for
shared use between `xet_client::cas_client` and
`xet_client::hub_client`.
- New `DownloadStreamGroup` type (streaming downloads moved off
`XetSession`).
- Fix Session ID type regression: this was fixed once in
https://github.com/huggingface/xet-core/pull/738 but regressed again,
seems AI agents don't learn.
- HTTP client cache key now incorporates custom headers
2026-03-30 08:39:25 -07:00
Assaf Vayner
9c0cb6e4c8 Reduce workspace dependencies (batches 1-3) (#746)
## Summary

- **Remove unused dependencies**: warp (zero imports), paste (zero
invocations), tower-service (zero imports), and heed misplacement in
xet_core_structures
- **Move mockall to dev-dependencies** in xet_client by gating
`#[automock]` with `#[cfg_attr(test, automock)]`
- **Feature-gate simulation module** behind `simulation` cargo feature
in xet_client, making axum, heed, humantime, futures-util,
human-bandwidth, and tower-http optional
- **Replace duration-str with humantime** (~2 deps vs ~78 transitive
deps) across xet_runtime, xet_client simulation, and simulation crate

## Impact

| Metric | Before | After | Change |
|---|---|---|---|
| hf-xet production deps | 371 | 321 | **-50** |
| Workspace total | 575 | 569 | -6 |

## Test plan

- [x] `cargo check --workspace` passes
- [x] `cargo check -p hf-xet` passes (without simulation feature — key
validation)
- [x] `cargo test --workspace` — all tests pass (4 pre-existing auth
test failures in git_xet unrelated to this PR)
- [x] `cargo tree -p hf-xet -e normal --prefix none | sort -u | wc -l`
confirms 321 deps

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Medium risk because it changes dependency graph and Cargo feature
gating (notably `xet-client` simulation modules and CI test features),
which can affect build/test behavior across targets despite minimal
runtime logic changes.
> 
> **Overview**
> Reduces workspace dependency surface by removing `duration-str`
(replaced with `humantime`) and trimming other transitive-heavy crates;
updates lockfiles accordingly across the workspace, `hf_xet`, and WASM
builds.
> 
> Introduces/propagates a `simulation` Cargo feature: `xet-client`’s
simulation server-related deps become optional and are only
compiled/exported when `feature = "simulation"` is enabled; `git_xet`
adds a `simulation` feature that forwards to dependent crates, and CI
now runs tests with `strict simulation git-xet-for-integration-test`.
> 
> Minor repo hygiene updates include ignoring `.claude/` in `.gitignore`
and wiring the `simulation` crate to depend on `xet-client` with
`features = ["simulation"]` (plus swapping its duration parsing helper
to `humantime`).
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
6abc194398. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 09:54:36 -07:00
Hoyt Koepke
69962587b5 Composable Hash Functionality (#745)
Currently, computing aggregate chunk hashes across independently
processed ranges requires recomputing over the full concatenated chunk
list. This PR introduces ChunkHashRange, a composable representation
that can hash contiguous partial ranges and merge them while preserving
equivalence with the existing xorb_hash / file_hash behavior. This
allows an intermediate representation of the hash ranges that can be
merged in arbitrary order to get the final hash. It also uses O(log(n))
storage and all operations are done in linear time. Serialization and
Deserialization are fully supported.

The main use case for this is in doing partial file edits. Previously,
to edit the middle of a large file, the client would have to know all
the hashes for the full file, even if only a few in the middle were
changed. With a large file, this can still be 100s of MB; the chunk
metadata size is roughly 1/1000 of the data size. With this change, we
can now transmit the unmodified parts of a file in O(log(n)) storage but
still be able to build the entire function hash; now a sequence of 10M
chunks takes the equivalent storage of ~500 chunks or so.

Along the way, we also added in an optimization for the merge step to
avoid an allocation, yielding a 2x speedup.

---------

Co-authored-by: Hoyt Koepke <hoytak@xethub.com>
2026-03-27 08:38:59 -07:00
Hoyt Koepke
c90f0a7bd9 Session API Polish; unify task handling/cancellation behavior. (#747)
Previously, upload and download paths each had their own ad-hoc state
tracking, cancellation, and runtime bridging logic. TaskRuntime
consolidates this into a single type that owns a CancellationToken tree,
tracks Running/Finished/Cancelled state with recursive propagation to
children, and provides bridge_async/bridge_sync wrappers that
automatically wire up tokio::select! cancellation. Session →
commit/group → per-file handles form a parent-child token tree, so
aborting a session cancels all descendant work.

The upload path gets new UploadFileHandle and UploadStreamHandle wrapper
types (replacing the old UploadTaskHandle), with inner/wrapper pattern
for cheap cloning. UploadCommit::commit() now returns a CommitReport
containing aggregate dedup metrics, progress, and per-file FileMetadata.
The download path mirrors this structure: FileDownloadGroup uses
TaskRuntime for state gating and owns bespoke DownloadTaskHandle
instances with per-task status and result access.

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **High Risk**
> High risk due to a breaking redesign of the public `xet_session` API
(new handle/report types and renamed methods) plus new
cancellation/state machinery that changes how uploads/downloads are
coordinated and terminated.
> 
> **Overview**
> **Redesigns `xet_pkg::xet_session` around a new hierarchical
`TaskRuntime`** (using `tokio-util` cancellation tokens) to unify state,
bridging, and cancellation across session → commit/group → per-file
handles.
> 
> **Replaces the old task-handle/result model** (`tasks.rs`,
`UploadResult`/`DownloadResult`, `TaskStatus`, group/session state
enums) with explicit handle/report types: `XetFileUpload`,
`XetStreamUpload`, `XetFileDownload`, `XetCommitReport`, and
`XetDownloadGroupReport`, and standardizes task state via
`XetTaskState`.
> 
> **Adjusts APIs and error semantics**: `commit()` now returns an
aggregate report (dedup metrics + progress + per-file metadata) and no
longer consumes `self`; progress methods become infallible
(`progress()`); cancellations/errors are consolidated
(`AlreadyCompleted`, `UserCancelled`, `KeyboardInterrupt`,
`TaskError`/`PreviousTaskError`) with updated Python exception mapping.
`xet_data` now returns per-file `DeduplicationMetrics` from upload tasks
and adds a zero-copy `SingleFileCleaner::add_data_from_bytes`.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
153a3ebbbe. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
2026-03-27 07:54:37 -07:00
Hoyt Koepke
332a456e1d Add ordered and unordered download streaming to session interface (#729)
This PR adds ordered and unordered download streams on XetSession,
including optional byte-range support and per-stream progress reporting.
Blocking and async variants are supported.

On the reconstruction side, this introduces UnorderedWriter and
UnorderedDownloadStream in xet_data, and extends the FileDownloadSession
stream APIs to take optional source ranges. Ordered and unordered
streams now share the same session-facing access pattern for async and
blocking callers.

This PR also renames DownloadGroup to FileDownloadGroup; the stream data
uses the per-session memory pool but don't count towards the maximum
number of concurrent downloads in progress.

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Touches core file reconstruction/writer plumbing (including
`DataWriter` ownership and new unordered writer/stream paths) and
changes public session APIs, so regressions could impact download
correctness, cancellation, or progress reporting.
> 
> **Overview**
> Adds first-class **ordered and unordered streaming download APIs** to
`xet_pkg::xet_session`, including async and blocking variants, optional
source-relative byte ranges, and per-stream progress via new
`XetDownloadStream` / `XetUnorderedDownloadStream` wrappers.
> 
> On the data layer, introduces an **unordered reconstruction path**
(`UnorderedWriter` + `UnorderedDownloadStream`) and refactors streaming
to spawn reconstruction tasks immediately but gate execution behind
`start()`; stream abort callbacks are now registered per-stream and
automatically unregistered on drop to avoid callback accumulation.
> 
> Updates the reconstruction writer contract by making
`DataWriter::finish` consume the writer (and shifting `DataWriter` to
`&mut self` usage), adjusts `SequentialWriter` accordingly, and adds
Criterion-based reconstruction benchmarks plus extensive
unordered-stream tests. Also renames session `DownloadGroup` to
`FileDownloadGroup` (and constructors) and updates call sites/examples.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
e02890aa4b. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
2026-03-20 14:40:18 -07:00
Hoyt Koepke
749c28b086 Error unification and cleanup (#737)
This PR performs some housecleaning and removes some technical debt
around using different error types, unifying them with the python
interface.

- Our client code tended to do a lot with anyhow errors as an artifact
of first using them before switching to thiserror. This PR cleans these
up in favor of using ClientError or other named error types directly.
- It also removes all the aliases to the old error type names present in
the packages before the refactoring, now settling into ClientError,
FormatError, DataError, and RuntimeError, with XetError being the error
type exposed publicly.
- Also, currently, xet_session exposes SessionError as an alias of
XetError, which adds an extra public type name without adding behavior.
This PR removes that alias and standardizes the public API/docs onto
XetError directly.
-It also tightens Python-facing error behavior and moves the python
handling to the XetError class directly, hidden behind a python feature
flag. Using these types, hf_xet now registers XetObjectNotFoundError and
XetAuthenticationError exception classes for authentication and the
not-found cases. These inherit from the current exception classes, so
all behavior is preserved.
- In addition, the From for PyErr mapping routes
timeout/network/auth/not-found categories to more appropriate Python
exception types than simply RuntimeError.

This is primarily an API-surface cleanup plus error-classification
alignment.

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> API-breaking error-surface changes (removal of legacy alias modules
and signature changes like `CredentialHelper::fill_credential`) may
require downstream code updates, especially where errors are
matched/converted. Runtime behavior should be mostly unchanged, but
error mapping/propagation paths (including Python exceptions) are widely
touched across crates.
> 
> **Overview**
> This PR **unifies error types across the workspace** by removing
legacy re-export/alias modules (e.g. `CasClientError`, `CasTypesError`,
`DataProcessingError`, `SessionError`) and updating call sites to use
canonical errors like `xet_client::ClientError`,
`xet_core_structures::CoreError`, and `xet_data::DataError` directly.
> 
> It updates CAS client code to **standardize on
`crate::error::Result`/`ClientError`**, including deleting
`cas_client/error.rs`, adjusting error conversions in retry/http
middleware paths, and updating simulation/local-server code to map
`ClientError` to HTTP responses.
> 
> Python bindings (`hf_xet`) now **convert failures via `XetError`**
(with `xet_pkg` built with `python` support), register custom exceptions
on module init, and refine argument-validation errors to `PyValueError`
while routing network/timeout/auth/not-found to more appropriate Python
exception classes.
> 
> Misc cleanup: `git_xet` now depends on `xet-data`, simulation binaries
switch to `anyhow::Result`/`bail!`, and lockfiles are updated for
new/updated dependencies (notably `pyo3`/`inventory`).
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
f3d056a909. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
2026-03-19 16:34:28 -07:00
Di Xiao
fb83178d28 Fix session id regression (#738)
The session id was replaced from `ulid` to `UniqueID` (a self
incrementing u64 in memory) in a previous PR but it's not correct.
The session id is used on CAS server logs and traces and CDN logs to
identity a related group of activity (for debugging and etc. purposes)
and it needs to be globally unique (thus using `ulid`) instead of
locally unique.
2026-03-19 13:31:40 -07:00
Di Xiao
9f68537319 Resolve all Dependabot alerts (#733)
This PR should resolve all Dependabot alerts by upgrading deps and
switching out some deprecated crate for suggested alternatives, e.g.
`tempdir -> tempfile`. Supersede PR #721. Fix issue #722
2026-03-19 09:33:56 -07:00
Hoyt Koepke
506fc28291 Simplify progress tracking + Unify Task ID tracking + Legacy Interface (#726)
Currently, progress tracking is split between callback-driven and
snapshot-driven paths, making session and task wiring across xet_data,
xet_pkg, hf_xet, and git_xet harder to keep consistent. This PR moves
upload/download progress to a polling snapshot model backed by atomics.
It also switches task identifiers to a UniqueID common with the progress
tracking throughout the session APIs.

This PR also updates the rate estimation to use the lighter weight
exponentially weighted moving averages model, so this can be done at a
low level.

To preserve compatibility for existing callback consumers,
callback-oriented upload/download progress tracking APIs are moved under
xet_pkg::legacy and bridged from polling snapshots via a callback based
updaters. hf_xet and git_xet are updated to use that legacy bridge
layer, so current integrations keep working until everything is fully
switched over to the XetSession method.
2026-03-18 18:07:43 -07:00
Hoyt Koepke
71f8570a0e Optimize config struct for direct access in python (#706)
This PR adds in a feature flag, "python" to the xet_runtime package such
that when compiled, the XetConfig struct is built to have python getters
and setters. This integrates the handling of the config struct directly
into the XetConfig struct and the macros used to register the config
values, making the handling of values in the python bindings seamless.
2026-03-16 12:23:43 -07:00
Brian Ronan
6232b42591 Xorb download URL debug logs (#714)
It's a bit annoying to try to ensure our CDN routing is correct. Logging
the URL domain for the first fetch term download to the debug logs.

Please don't hesitate to recommend alternative approaches.
2026-03-13 16:05:30 -07:00
Di Xiao
e701aeddac Support XetSession in async context (#694)
`XetSession` always created its own tokio runtime via
`XetRuntime::new_with_config`, and calling `external_run_async_task`
panics when already inside a tokio context. This blocked embedding the
session in async Rust frameworks.

Core strategy:

 - `RuntimeMode` enum — 
`Owned` (session created its own thread pool via
`XetSessionBuilder::build` or `XetSessionBuilder::build_async` when
outside tokio context. Both `_blocking` and async methods are supported.
Async methods use an internal `bridge_to_owned` bridge that routes
futures onto the owned thread pool, so they work from any executor
(tokio, smol, async-std))
vs
`External` (session wraps a caller-supplied tokio handle via
`XetSessionBuilder::with_tokio_handle` or
`XetSessionBuilder::build_async` when inside qualified tokio context.
Only async methods may be called; `_blocking` methods return
`SessionError::WrongRuntimeMode`. No second thread pool is created).
- `XetRuntime::bridge_to_owned` — a new bridge that routes a future onto
the owned tokio thread pool from any executor (smol, async-std,
futures::executor, non-qualified tokio runtime) by delivering the result
via a `tokio::sync::oneshot` channel that can be polled by any async
executor.
- Async public API — `UploadCommit` and `DownloadGroup` methods
(`upload_from_path`, `upload_bytes`, `upload_file`, `commit`, `finish`)
are now async fn. Factory methods `XetSession::new_upload_commit` and
`new_download_group` are async.
Example:
```
let session = XetSessionBuilder::new().build_async().await?;
// Upload
 let commit = session.new_upload_commit().await?;
 let handle = commit.upload_from_path("file.bin".into()).await?;
 let results = commit.commit().await?;

 // Download
 let group = session.new_download_group().await?;
 let info = XetFileInfo {
     hash: ...,
     file_size: ...,
 };
 let dl_handle = group.download_file_to_path(info, "out/file.bin".into())?;
 let finish_results = group.finish().await?;
```

- Sync wrappers — New `UploadCommitSync` / `DownloadGroupSync` in
`xet_session/sync/` expose a fully blocking API for sync Rust and Python
(PyO3) callers. Returned by `new_upload_commit_blocking()` and
`new_download_group_blocking()`.
Example:
```
let session = XetSessionBuilder::new().build()?;
// Upload
let commit = session.new_upload_commit_blocking()?;
 let handle = commit.upload_from_path("file.bin".into())?;
 let results = commit.commit()?;
 let m = results.values().next().unwrap().as_ref().as_ref().unwrap();

// Download
 let group = session.new_download_group_blocking()?;
 let info = XetFileInfo {
     hash: ...,
     file_size: ...,
 };
 let dl_handle = group.download_file_to_path(info, "out/file.bin".into())?;
 let finish_results = group.finish()?;
```



Additional fixes: `download_file_to_path` and `upload_from_path` now
canonicalize paths with `std::path::absolute` before enqueuing; task
status is only overwritten when still `Running`, preventing a race with
concurrent abort().

Fix XET-891

---------

Co-authored-by: Hoyt Koepke <hoytak@huggingface.co>
2026-03-13 14:57:20 -07:00
Hoyt Koepke
45d38a13a9 Code reorganization towards release of xet cargo package (#693)
This PR is a massive rearrangement of the code base into 5 packages
intended for release on cargo. The directories and corresponding
packages are:

1. xet_runtime/ — compiles into the xet-runtime package. Contains the
runtime, config, and logging management.
2. xet_core_structures/ — compiles into the xet-core-structures package.
Contains core data structures for hashing, shards, and xorbs as well as
internal data structures that depend on these.
3. xet_client/ — compiles into the xet-client package, contains client
code for remotely connecting to the Hugging Face servers.
4. xet_data/ — compiles into the xet-data package, contains the data
processing pipeline: chunking/deduplication, file reconstruction,
clean/smudge operations, and progress tracking.
5. xet_pkg/ — compiles into the hf-xet package, provides the top-level
session-based API for file upload and download with user-facing error
categorization. This is the primary package downstream dependencies
would use. This also contains a single summary error type, XetError,
that translates cleanly into python error types.

In addition, the other tools are: 

- git_xet/ — the git_xet CLI binary crate (location preserved). 
- hf_xet/ -- the hf_xet python package (location preserved).
- simulation/ — the simulation crate for upload scenario benchmarking.
- wasm/ -- the wasm objects. 

The full description — and information for an AI agent to use to update
downstream dependencies — is at
api_changes/update_260309_package_restructure.md.

Summary of moves:

- xet_runtime: became xet_runtime::core inside xet_runtime/.
- utils: became xet_runtime::utils inside xet_runtime/.
- xet_config: became xet_runtime::config inside xet_runtime/.
- xet_logging: became xet_runtime::logging inside xet_runtime/.
- error_printer: became xet_runtime::error_printer inside xet_runtime/.
- file_utils: became xet_runtime::file_utils inside xet_runtime/.
- merklehash: became xet_core_structures::merklehash inside
xet_core_structures/.
- mdb_shard: became xet_core_structures::metadata_shard inside
xet_core_structures/.
- xorb_object: became xet_core_structures::xorb_object inside
xet_core_structures/.
- cas_client: became xet_client::cas_client inside xet_client/.
- hub_client: became xet_client::hub_client inside xet_client/.
- cas_types: became xet_client::cas_types inside xet_client/.
- chunk_cache: became xet_client::chunk_cache inside xet_client/.
- data: became xet_data::processing inside xet_data/.
- deduplication: became xet_data::deduplication inside xet_data/.
- file_reconstruction: became xet_data::file_reconstruction inside
xet_data/.
- progress_tracking: became xet_data::progress_tracking inside
xet_data/.
- xet_session: became xet::xet_session inside xet_pkg/.

- Wasm packages (hf_xet_wasm, hf_xet_thin_wasm): moved from top-level
into wasm/; internal imports updated, public APIs unchanged.
2026-03-11 12:02:38 -07:00
Rajat Arya
83a28271ea fix: no timeout for shard uploads (XET-885) (#685)
Fixes
[XET-885](https://linear.app/xet/issue/XET-885/investigate-unsloth-upload-failure-shard-upload-timeout-on-cas)

## Summary

Shard uploads to CAS can take a long time due to server-side processing
(DynamoDB writes scale with file entry count). The default
`read_timeout(120s)` on the reqwest client kills these uploads.

**Key insight:** reqwest's per-request `RequestBuilder::timeout()` does
NOT override the client-level `read_timeout()` — they are independent
mechanisms polled as separate futures. So the original approach of using
per-request timeouts was ineffective.

**Fix:** Create a dedicated `shard_upload_http_client` on `RemoteClient`
with **no `read_timeout`**, built once at construction time and reused
for all shard uploads. All other settings (connect timeout, pool config,
auth middleware) are identical to the standard client.

## Changes

### `cas_client/src/http_client.rs`
- Added `reqwest_client_no_read_timeout()` — creates a reqwest client
with no `read_timeout`
- Added `build_auth_http_client_no_read_timeout()` — public API wrapping
it with middleware
- 4 unit tests for the new builder

### `cas_client/src/remote_client.rs`
- Added `shard_upload_http_client` field to `RemoteClient` (cfg'd out on
wasm)
- `upload_shard()` uses the pre-built no-timeout client instead of
building one per request

### `cas_client/tests/test_shard_upload_timeout.rs`
- Updated: slow server test now asserts **success** (shard uploads
should wait as long as needed)

### `xet_config/src/groups/client.rs`
- Removed `shard_read_timeout` config field (no longer needed)

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-11 09:05:40 -07:00
Hoyt Koepke
6a5535bc46 Rework simulation pipeline for adaptive concurrency and connection resiliency. (#648)
This PR replaces the previous collection of scripts around setting up
docker containers with a much more nimble and lightweight set of rust
scripts and a simple, reusable proxy that can limit bandwidth and
congestion simulations. The previous scripts are rewritten to be more
nimble and use more reusable components.

New tools: 
- cas_client/src/simulation/network_simulation: A lightweight,
in-process network congestion simulation proxy that lives between the
LocalServer instance and the RemoteClient instance, allowing simulation
tests to run on a network with realistic congestion conditions and a
gated bandwidth. This can be controlled dynamically through a
LocalTestServer instance.
- simulation/: A new package for collecting simulation scripts and
analyzing the results.

To run the new simulation scripts for the adaptive concurrency on
upload, compile in release mode and run one of the scripts in
`simulation/src/adaptive_concurrency/scripts/`. Docker is no longer
needed to run any of the simulations.

The old `cas_client/tests/adaptive_concurrency/` paths were removed.
2026-03-09 10:49:36 -07:00
Hoyt Koepke
e6e0413d90 Naming clarification: A Xorb is a data object, CAS is the remote server. (#680)
This PR makes the use of the `cas` and `xorb` terms consistent.
Previously, "cas" (for content addressed store) could simultaneously
refer to either the remote server or the data bytes stored as a
collection of chunks. After the renames in this PR, we consistently use
`xorb` to refer to the data object and cas to refer to the remote
server.

This renames quite a few places; to aid in rebasing current work or
updating downstream dependencies, this PR includes a file
`API_UPDATES.md` that can be fed into an AI agent to quickly and
accurately perform the renaming on any downstream dependencies.
2026-03-04 16:05:49 -08:00
Di Xiao
c4a56f889c XetSession API (#657)
This PR introduces a new `xet_session` crate that provides a
session-based hierarchical API: Users create a XetSession to manage
runtime and configuration, then batch uploads into UploadCommit objects
and downloads into DownloadGroup objects — each of which runs transfers
in the background by the inner XetRuntime.

All pub functions are exposed as sync functions - making them easy to
use in other languages, e.g. Python, C, etc.
2026-03-03 20:27:39 -08:00
Hoyt Koepke
9b3278a510 Streaming data writer (#656)
This PR adds an integrated API for streaming downloads, exposing a
DownloadStream object that is integrated with the file reconstructor. It
also uses the same memory management buffer limiting process to work
with the stream object.

It also introduces cancellation support to the FileReconstructor to
ensure that tasks waiting on a long running download or semaphore wait
don't cause things to hang when an error is reported or the user drops
the stream.
2026-02-27 15:08:25 -08:00
Di Xiao
c4111eb6da Feature to monitor client process system usage (#617)
Introduces a client benchmark utility to track system resource usage
(CPU, memory, disk I/O, and network I/O) of a process, so we don't need
to write scripts to capture usage stats according to different OS
standards. This becomes extremely helpful when I benchmark on Python
notebook instances, e.g. Google Colab, where system monitor is not
easily accessible or when running a separate monitor script is not easy.

# Usage #
Users can enable monitoring by setting `HF_XET_SYSTEM_MONITOR_ENABLED`
to true, set usage sample interval using
`HF_XET_SYSTEM_MONITOR_SAMPLE_INTERVAL`, this outputs metrics to the
tracing stream at `INFO` level by default. In addition, these metrics
can be redirected to a separate file by setting sample log path using
`HF_XET_SYSTEM_MONITOR_LOG_PATH`.

# Output #
The stats are output in JSON format, which can be queried using tools
like `jq`, e.g.
1. Trace of peak memory usage: `jq '.memory.peak_used_bytes'
[HF_XET_SYSTEM_MONITOR_LOG_PATH]`
2. Trace of disk write speed: `jq '.disk.average_write_speed'
[HF_XET_SYSTEM_MONITOR_LOG_PATH]`
3. Trace of network receive speed: `jq '.network.average_rx_speed'
[HF_XET_SYSTEM_MONITOR_LOG_PATH]`
2026-02-27 13:36:31 -08:00
Hoyt Koepke
543914dce1 Scale download buffer memory limit by number of active downloads (#666)
Currently, the maximum number of downloaded files is fixed, regardless
of the number of downloads currently in flight. However, as the number
of downloads increases, a fixed size total could lead to waiting on
individual segments that download out-of-order or don't have enough
turnaround time to saturate the output. While writing to disk or the
download itself often becomes the bottleneck before these effects,
planned features such as streaming files and caching could be affected
by this limit. The default formula for the download buffer size now is
(2GB + 512MB * number of concurrent downloads) up to a maximum of 8GB
(these are adjustable).

This PR alleviates this by allocating an additional 512MB buffer
allocation per file, prioritized to the specific download, releasing
that capacity when the file finishes downloading. This is done using the
AdjustableSemaphore class, first introduced for the concurrent scaling,
which allows the number of total permits in a semaphore to be
incremented or decremented; on decrement, permits are discarded upon
return until the total permits is at the target number.
2026-02-27 11:35:55 -08:00
Brian Ronan
17e900a70e Feat: optional request_headers on hf_xet API calls (#661)
Adding support for setting an optional `request_header` map on the
hf_xet upload and download API calls. This map is augmented with the
hf_xet user agent string and is passed along with the requests to
xetcas.

This PR also adds some unit tests for testing the map merging behavior
to `hf_xet/lib.rs` and adds support for running these with cargo test
and in github actions CI step.
2026-02-23 14:43:58 -08:00
Hoyt Koepke
5d6371a296 Progress reporting for downloads. (#645)
This PR adds detailed progress reporting to the download path. 
- Transfer progress is reported as soon as the download streams start;
actual bytes written are reported as the reconstructed file is written
out.
- Currently, each call to download_file creates a separate progress
tracker, but this sets up for download groups with grouped download
progress tracking.
 
To support this, the UploadProgressStream was split into three classes;
a common StreamProgressReporter and download and upload specific
versions. This also allows us to simplify the API to RetryWrapper.

More tracking was added to the file reconstruction paths to properly
report progress.
2026-02-19 11:06:42 -08:00
Hoyt Koepke
9d9fc72d40 XetCommon struct in the runtime to hold global counters, semaphores. (#650)
This PR simplifies the current process of working with
runtime-associated resources such as a cached Client instance or global
resource semaphores. Instead of using macros, all of these are moved
into a XetCommon struct that holds them explicitly. The runtime holds an
instance of this, and it's initialized with a config struct.

In addition, to make the logic around the memory limiting semaphore in
file_reconstructor clearer, we added a ResourceLimiter struct that wraps
the tokio semaphore but scales the total permits and permit requests
appropriately if the total resource quantity is larger than u32::MAX, as
can be the case easily.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-12 16:47:07 -08:00
Di Xiao
23f68bb798 Upgrade git-xet to 0.2.1 (#653) 2026-02-12 15:45:34 -08:00
Di Xiao
7d7582c3dd TemplatedPathBuf utility (#643)
Implements a utility for configuring path-like parameters.

This folds inside the existing function `fn
normalized_path_from_user_string` that expands `~` to home directory and
converts to absolute paths, and evaluates a path template by
substituting **case-insensitive** placeholders with corresponding
values:
- `{pid}` for process ID,
- `{timestamp}` for ISO 8601 local timestamp with offset

For example,
```
let template = TemplatedPathBuf::new("~/logs/app_{PID}_{TIMESTAMP}.txt");
let path = template.as_path();
/// Returns an absolute path like "/home/user/logs/app_12345_2024-01-15T10-30-45-0500.txt"
```
or to be used directly in config groups:
```
crate::config_group!({
    ref log_path: Option<TemplatedPathBuf> = None;
}
```
2026-02-11 14:51:16 -08:00
Hoyt Koepke
e443ee9260 Upgrade package dependencies (#644)
This PR updates all the package dependencies that would not cause
significant API breakages to the current version. The package versions
in hf_xet_wasm and hf_xet are also updated to match the versions in the
base package. There should be no functional change.
2026-02-11 12:19:29 -08:00
dependabot[bot]
c9a29ffb9e Bump oneshot from 0.1.11 to 0.1.12 (#616)
Bumps [oneshot](https://github.com/faern/oneshot) from 0.1.11 to 0.1.12.
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/faern/oneshot/blob/main/CHANGELOG.md">oneshot's
changelog</a>.</em></p>
<blockquote>
<h2>[0.1.12] - 2026-01-25</h2>
<h3>Fixed</h3>
<ul>
<li>Fix race condition that could lead to use-after-free if the
<code>Receiver</code> was polled asynchronously,
but then dropped before completion. <a
href="https://redirect.github.com/faern/oneshot/pull/74">faern/oneshot#74</a></li>
<li>Fix race conditions/UB around atomic memory orderings. These were
found by running tests under
miri. <a
href="https://redirect.github.com/faern/oneshot/pull/72">faern/oneshot#72</a></li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="537d5de4b6"><code>537d5de</code></a>
Bump version to 0.1.12 and fix changelog</li>
<li><a
href="9cc3153a7d"><code>9cc3153</code></a>
Merge branch 'improve-start_recv_ref'</li>
<li><a
href="cc3d6a2b96"><code>cc3d6a2</code></a>
Improve start_recv_ref to be more like regular recv method</li>
<li><a
href="78c7476979"><code>78c7476</code></a>
Merge branch 'update-documentation'</li>
<li><a
href="38d7f6f2cd"><code>38d7f6f</code></a>
Add clarifying documentation on sender observing RECEIVING state</li>
<li><a
href="21e0310074"><code>21e0310</code></a>
Synchronize readme with crate documentation in lib.rs</li>
<li><a
href="def74fc6fe"><code>def74fc</code></a>
Fix spelling and grammar errors in documentation</li>
<li><a
href="70031a4282"><code>70031a4</code></a>
Add documentation about how send and receive are synchronized</li>
<li><a
href="d1a1506010"><code>d1a1506</code></a>
Merge branch 'fix-async-recv-drop-use-after-free'</li>
<li><a
href="f19ff7c3bf"><code>f19ff7c</code></a>
Fix Receiver::drop bug causing a race when dropping a polled
receiver</li>
<li>Additional commits viewable in <a
href="https://github.com/faern/oneshot/compare/v0.1.11...v0.1.12">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=oneshot&package-manager=cargo&previous-version=0.1.11&new-version=0.1.12)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/huggingface/xet-core/network/alerts).

</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-01-28 10:28:04 -10:00
Hoyt Koepke
a6630293bb Hash table with pass-through hasher for MerkleHashes (#611)
Currently, the rust HashMap uses a randomized hasher for input, which
prevents hash collision attacks. However, in our code, we don't need
that protection in the client, and a MerkleHash is already a
cryptographic hash. This PR adds a MerkleHashMap type that just passes
the hash through to the HashMap, providing a substantial speedup:

```
=================================================================
PERFORMANCE SUMMARY (times in ms, lower is better)
=================================================================
Test                                  HashMap         PassThrough
-----------------------------------------------------------------
--- 100K ---
  Insert                                  2.1                 0.7
  Lookup                                  2.1                 1.3
  Insert+Lookup                           4.4                 1.6
  Serialize                               1.6                 0.9
  Deserialize                             4.3                 1.2


--- 10M ---
  Insert                                433.2               204.1
  Lookup                                615.3               255.5
  Insert+Lookup                         951.6               460.4
  Serialize                             117.2                93.4
  Deserialize                           599.5                89.3

=================================================================
```

It also replaces HashMap<MerkleHash, ...> everywhere in the code to
provide an across-the-board improvement.
2026-01-22 10:42:53 -10:00
Hoyt Koepke
128fb6fc42 File download and reconstruction V2 (#603)
This PR rewrites the download and file reconstruction path. The new
version:

- Separates the Client connection from the reconstruction, using a new
FileReconstructor class to manage the reconstruction. This
FileReconstructor is now in the file_reconstructor package. The old
version is still present in the client but moved to
file_reconstruction_v1/; using V1 or V2 is controlled by
reconstruction.use_v1_reconstructor.
- Uses a global buffer memory limiter so the space used for downloading
all files never exceeds a configurable limit, set to 8gb by default.
- Automatically tunes the download parallelism to adapt to the
connection conditions.
- Automatically tunes the number of terms fetched in order to target all
terms downloading within a certain window.
- Uses vectored write (configurable) to speed writing to a single file. 
- Moves the URL refresh logic into the RetryWrapper class.
- Uses a for loop with futures to make the logic behind the
reconstruction process easier to understand.
- Adds extensive testing against the LocalTestServer and LocalClient to
cover all the code paths.
- Completely removed the retry logic level from the reqwest middleware.

Next steps after this: 
- Implement resume on partial download.
- Interface to caching layer. 
- Add partial-term progress reporting to match the upload path.
2026-01-14 21:02:53 -08:00
Hoyt Koepke
9332ff28b7 Mock CAS server built on LocalClient for testing and simulation. (#602)
This PR adds a fully functional CAS server built around a LocalClient
instance. This allows full testing of the RemoteClient interface without
hitting the actual CAS backend.

For testing, it can either be run as a standalone executable, or it can
be started using a LocalTestServer instance that exposes both a
RemoteClient interface as client, or direct access to the state through
a stored LocalClient instance.

Numerous tests are added to also cover existing functionality as well as
the new server functioning.

(Also, it exposed that when using a lot of tests with wiremock or this
server, the testing would often hit a "Too many open files" error; this
was fixed by consolidating these tests to reduce the number of separate
testing servers running at once.
2026-01-09 12:39:52 -08:00
Di Xiao
d15295eff3 Clean up dependencies (#595)
- Remove dependencies from Cargo.toml files that are not used.
- Move dependencies directly referencing crates.io from crate level
Cargo.toml to the workspace Cargo.toml.
- Fix using RemoteClient in WASM: AdaptiveConcurrencyController uses
`tokio::time::Instant` which wraps `std::time::Instant` and is not
available in WASM.
- Add [cargo-machete](https://github.com/bnjbvr/cargo-machete) to CI to
check unused dependencies.

No functionality change.
2025-12-15 15:26:02 -08:00
Di Xiao
74d7c5926c Clean up dead code (#593)
There have been many dead code left in xet-core due to
`#![allow(dead_code)]` at a couple of places. This PR removes them and
fix the corresponding linting errors. No functionality change.
2025-12-11 10:55:28 -08:00
Hoyt Koepke
9cf0e1e35e Automatic concurrency adjustment for transfers (#410)
Adaptive Concurrency Controller

This PR introduces adaptive concurrency control for transfers based on
an adaptive ML model of the network connection.

It is currently implemented only for the upload path and gated behind
the environment variable HF_XET_ENABLE_ADAPTIVE_CONCURRENCY, which is
set to false by default. Future PRs will integrate this into the
download path and then enable it by default with sufficient testing.

The `AdaptiveConcurrencyController` struct dynamically adjusts
concurrency for upload and download operations by continuously adapting
to network conditions. It tracks two key signals:

1. Observed bandwidth via an online linear regression predictor
2. Success ratio of recent transfers using configurable success/failure
thresholds

Transfers are considered successful if they complete within a
statistically reasonable time given the model (less than the 90%
quantile) and below the configured max RTT for healthy operation (by
default 90s). The model then increases the concurrency when the success
ratio is high (>0.8) and the RTT prediction stays below a target RTT
(60s default). It decreases the concurrency when the success ratio drops
below a threshold (<0.5) or the transfers exceed a maximum healthy RTT
(90s default). To prevent oscillations, it also enforces a minimum delay
between adjustments, set to 500ms by default.

The RTT prediction is implemented using an exponentially-weighted online
linear regression model that predicts round-trip time (RTT) based on
transfer size and concurrency level. The model fits:
   ```
   duration_secs ≈ a + b * (size_bytes * concurrency)
   ```
Internally this is implemented using
`ExpWeightedOnlineLinearRegression`, which maintains
exponentially-decaying sufficient statistics to predict the mean and
standard deviation of the RTT. The exponential decay of the process,
with the half-life of an observation set to 60 data points, allows it to
adapt to slowly changing network conditions. This model is used to
predict whether adding concurrency will cause a large transfer of 64MB
to take longer than 60s to complete, in which case no concurrency is
added. Upon a successful transfer, this model is used to assess whether
congestion might be causing completed transfer to take longer than
expected; if the actual RTT is in the 90% quantile, then it's reported
as a failure to the success tracker; a statistically significant number
of recent failures will prevent the concurrency from increasing, and a
string of failures will cause the controller to lower the concurrency.

The controller tracks the success ratio (fraction of successful
transfers) using an exponentially weighted moving average with a default
half-life of 8 observations. This allows us to determine whether recent
transfers have hit congestion, as long RTTs are recorded as failures.
80% of the recent transfers have to be successes to lower the
concurrency, and if less than 50% are successful, the concurrency is
dropped.
   
By default, the model starts at the minimum concurrency and increases as
soon as data reliably predicts the RTT. All bounds are controlled by
config variables.
2025-12-01 16:43:24 -08:00
Di Xiao
eeee211e59 Upgrade git-xet version (#574) 2025-11-21 10:05:02 -08:00
Di Xiao
b5563ecd93 Better support of authentication through SSH (#553)
This PR finally enables `git-xet` on Windows authenticating to remote
Git server using SSH URL. This is a crucial part as access tokens to the
CAS server expire every 900 s and `git-xet` needs to re-authenticate
with the Git server by itself during push/pull (whereas the first
authentication is handled by `git-lfs`).

This uses the same SSH connect utility to authenticate over SSH repo
remote URL on both *nix OS and Windows.

Resolves XET-731
2025-11-20 12:09:57 -08:00
Di Xiao
5f77ffc46a Integration test for ssh access on Windows (#566)
This PR builds on top of
https://github.com/huggingface/xet-core/pull/565 and builds an
integration test to test access to "ssh" and "sh" on Windows through the
"git" (-> "git-lfs") -> "git-xet" call chain.

Out of all the ssh variants, access to programs like "plink", "putty",
"tortoiseplink" or "simple" should be given by the env var
`$GIT_SSH_COMMAND` or `$GIT_SSH`, or by git config entry
`core.sshCommand`. Direct access to the mostly used utility "ssh" and
in-direct access to "ssh" via "sh -c" on Windows is provided by the
"git" (-> "git-lfs") -> "git-xet" call chain, see
git_xet/tests/test_ssh.rs for details.
2025-11-20 03:22:19 -08:00
Di Xiao
075a9c96c0 Add ssh connect utility according to git standard (#565)
This implements an utility to help set up SSH connection according to
Git standards.

1. Env vars `$GIT_SSH_COMMAND`, `$GIT_SSH` and git config entry
`core.sshCommand` define
which ssh executable to use for an SSH connection. `$GIT_SSH_COMMAND`
takes precedence over `core.sshCommand` and both are interpreted by the
shell (e.g. `GIT_SSH_COMMAND = "ssh -i ~/.ssh/key"`), which allows
additional arguments to be included. They both takes precedence over
`$GIT_SSH`, which on the other hand must be just the path to a program
(which can be a wrapper shell script, if additional arguments are
needed). When none of these is given, the default ssh program to use is
`ssh`.

2. Env var `$GIT_SSH_VARIANT` takes precedence over git config entry
`ssh.variant` and they both define whether
`$GIT_SSH`/`$GIT_SSH_COMMAND`/`core.sshCommand` refer to OpenSSH,
plink/putty or tortoiseplink, or instruct git to automatically detect
the ssh program type. Valid values are "ssh" (to use OpenSSH options),
"plink", "putty",
"tortoiseplink", "simple" (no options except the host and remote
command). The default auto-detection can be
explicitly requested using the value "auto". Any other value is treated
as "ssh".


This implementation follows the git standard and how the same
functionality is handled in
git-lfs
(071e19e8ea/ssh/ssh.go (L41)).
2025-11-19 12:43:16 -08:00
Hoyt Koepke
a5ea819ccb Rework of the constant configuration system. (#564) 2025-11-19 11:58:53 -08:00
Assaf Vayner
cd64baa6ca separating output providers, sequential output providers (#528)
This PR does a refactor of how we pass in the catch all "OutputProvider"
to the download mechanism.

It separates the download system to supporting "Sequential" and
"Seeking" operations:

- Seeking e.g. opening the file multiple times and seeking to location
-- this is the standard writing mechanism hf_xet uses today.
- Sequential e.g. opening a file once and writing data in order -- this
is to be used in a set of upcoming PR's/features to use the
parallel-download/sequential-write mechanism to support writing to
Stdout and to a channel buffer in memory.

To support an in memory channel with backpressure the Channel{Writer,
Stream, Reader} are introduced (re-introduced?) in utils. This
particularly could be useful in the mount functionality.
2025-10-29 14:12:24 -07:00
Hoyt Koepke
2fc772e6d0 Shard utilities needed for GC pass and server-side xorb rewriting. (#532)
This PR adds a utility that rewrites a shard to include only the
relevant xorb information, dropping unreferenced file information.

In addition, to preserve the global dedup tracking information
associated with the files, this PR also adds a backwards-compatible flag
to the chunk metadata that marks a specific chunk as global dedup
eligible. This allows the global dedup information to be tracked
independently of the file metadata.
2025-10-29 12:10:57 -07:00
Hoyt Koepke
3096b3f9c3 Test suite for directory logging functionality (#536) 2025-10-24 10:06:26 -07:00
Hoyt Koepke
69f23d630e Logging to directory + log file management; default to log directory for hf_xet (#502)
This PR switches the default logging to log events to a file in
'~/.cache/huggingface/xet/logs' (or 'xet/logs' under the specified cache
directory if not `~/.cache/huggingface/`).

In this directory, log files older than 2 weeks are cleaned up on
process start, and if the total size of files in the directory is larger
than 1gb, then log files are deleted by age to get the directory size
under 1gb. Log files are named with a timestamp and PID; by default,
logs newer than 1 day or logs with an active associated PID are never
deleted. All of these are user configurable constants.
2025-10-20 14:35:43 +02:00
Assaf Vayner
c55fabb6bf hashing and chunking example tools (#496)
Adds some basic examples tools (compiled with `cargo build --examples`
on `data` crate) to compute hashes and chunk boundaries.
2025-09-26 12:49:55 -07:00