Commit Graph

12 Commits

Author SHA1 Message Date
Hoyt Koepke
0d9f78aaf4 Add README.md files and Cargo.toml updates needed for publishing hf-xet (#773)
This PR adds crates.io-facing metadata (homepage, readme, keywords,
categories) for the publishable crates, along with crate README files
and concise crate-level docs so crates.io and docs.rs pages have better
context.
2026-04-03 12:34:47 -07:00
Hoyt Koepke
014ff2d75b Fix for FD leak (#774)
Currently, the tests can fail intermittently due to a subtle fd leak in
how the session and the runtimes interact. This causes tests using the
sessions to quickly run out of file handles.

There were two different issues: 

1. XetSessionInner tracked active upload commits and file download
groups in strong-reference maps, and those child objects held a clone of
the session. That created a second cycle (session -> child -> session)
that prevented cleanup of commit/download resources and the runtime
handles. This is dropped. (Note that all abort/sigint-cancellation
behavior is handled automatically through TaskRuntime; the session
classes don't need any explicit code for it outside of that).

2. The static thread-local reference to the tokio runtime prevented the
tokio runtime from getting cleaned up when it was created explicitly and
not aborted. In addition, JoinHandle objects hold a reference back to
the runtime, so if these are not aborted or joined, then they also
prevent the runtime from shutting down.

The FD tracking code was left in but feature gated behind feature
`fd-track`.

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Changes runtime/session lifetime management (TLS runtime refs,
shutdown behavior, and session child ownership), which can affect task
cancellation and runtime teardown across the library.
> 
> **Overview**
> Fixes intermittent file-descriptor leaks by **breaking ownership
cycles** between `XetSession` and child upload/download objects and by
ensuring `XetRuntime` can actually drop/shutdown when the last external
reference is released.
> 
> Adds an opt-in `fd-track` feature with lightweight FD counting/scoped
tracing, plus new leak-focused tests, and tightens local CAS DB/shard
manager caching to avoid duplicate `redb` opens (canonicalized paths,
weak cached handles, and cleanup on drop).
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
041426e73e. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
2026-04-02 18:28:26 -07:00
Assaf Vayner
20198a9081 Remove prometheus dependency and metrics (#769)
## Summary
- Remove the `prometheus` crate dependency from the workspace and
`xet_data`
- Delete `prometheus_metrics.rs` which defined 3 IntCounter metrics (CAS
bytes produced, bytes cleaned, bytes smudged)
- Remove metric increment calls from `file_upload_session.rs` and
`file_download_session.rs`
- Fix Windows CI flake: redb "Database already open" error in
`test_single_large`

These metrics were collected but never exposed via any HTTP endpoint or
text encoder, making them effectively dead code.

## Test plan
- [x] `cargo +nightly fmt` — clean
- [x] `cargo clippy --all-targets` — no new warnings
- [x] `cargo test -p xet-data` — 17/17 pass
- [x] `cargo test -p xet-data --features simulation --test
test_clean_smudge` — 14/14 pass (including `test_single_large`)
- [x] WASM builds (`hf_xet_wasm`, `hf_xet_thin_wasm`) — both succeed

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Low Risk**
> Low risk: this removes unused Prometheus metrics plumbing and related
dependencies without changing the core upload/download logic. Main risk
is loss of any downstream reliance on these counters at build time
(e.g., feature flags or imports).
> 
> **Overview**
> Removes the `prometheus` dependency from the workspace and `xet_data`,
and updates lockfiles accordingly (including WASM-related lockfiles).
> 
> Deletes `xet_data`’s `prometheus_metrics` module and strips the
associated counter increments from `FileUploadSession` and
`FileDownloadSession`, leaving the data processing behavior unchanged
aside from no longer recording these metrics.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
c6c866b7ca. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
2026-04-01 14:56:58 -07:00
Hoyt Koepke
3051478cdd Allow shard expiration to be set on global dedup queries for GC simulation (#762)
Currently, simulation global dedup shard queries return full shard bytes
with no configurable shard footer expiration, and simulation control
knobs are split between partially implemented paths. This PR adds global
dedup shard expiration control to simulation clients and servers, and
extends /simulation/set_config to cover shard expiration, max range
splitting, V2 reconstruction disabling, API delay, and URL expiration in
one path. This enables rapid simulation of the GC paths by setting the
global dedup expiration to a sub-epoch value.

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Touches simulation client/server APIs and shard serialization behavior
(including new trait methods and HTTP knobs), so downstream implementors
and tests may break if not updated. Changes are scoped to simulation/GC
tooling paths but affect how global-dedup shard bytes are produced and
validated.
> 
> **Overview**
> Adds a new simulation control to set **global-dedup shard
expiration**: `DirectAccessClient::set_global_dedup_shard_expiration`
now makes `query_for_global_dedup_shard` optionally return *minimal*
shard bytes (file section stripped) with `shard_key_expiry = now +
expiration` (sub-second durations round up).
> 
> Extends `MDBMinimalShard` serialization with
`serialize_xorb_subset_with_expiry` to write an optional
`shard_key_expiry` footer, and updates `LocalClient`/`MemoryClient` to
use it when expiration is enabled.
> 
> Unifies and expands runtime simulation knobs under
`/simulation/set_config` (global dedup expiration, max ranges per fetch,
disable V2 reconstruction, API delay, URL expiration) and updates
`SimulationControlClient` to apply them via a retried async POST. Also
moves integrity/reachability checks to `DeletionControlableClient`, adds
`verify_all_reachable`, and wires new `/simulation/verify_all_reachable`
with 501 behavior when no deletion client is configured.
> 
> Separately, introduces **simulation-only xorb cut thresholds**
(`XORB_CUT_THRESHOLD_*`) driven by new `xet_runtime` xorb config
overrides, and updates upload/dedup code paths to use these thresholds.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
42bd9c3f4f. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
2026-03-31 18:35:19 -07:00
Assaf Vayner
86935b4117 Move test-only deps to dev-dependencies in git_xet (#767)
## Summary
- Move `russh`, `rand_core`, and `tempfile` from regular dependencies to
dev-dependencies in `git_xet`, since they are only used in test code
- `russh` and `rand_core` are also declared as optional regular deps
activated by the `git-xet-for-integration-test` feature flag, since the
integration test SSH server is compiled into the library under that
feature
- Gate `test_utils/ssh_server` module and related exports behind
`#[cfg(any(test, feature = "git-xet-for-integration-test"))]`
- Gate `tests/test_ssh.rs` integration test file behind `#![cfg(feature
= "git-xet-for-integration-test")]`

## Test plan
- [x] `cargo check -p git_xet` passes (no features)
- [x] `cargo test -p git_xet --no-run` passes (no features)
- [x] `cargo test -p git_xet --features git-xet-for-integration-test
--no-run` passes

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Low Risk**
> Low risk: primarily Cargo dependency/feature and `cfg` gating changes,
with no production logic changes; risk is limited to build/test
configuration and feature-flagged integration test coverage.
> 
> **Overview**
> **Reduces default build dependencies for `git_xet`.** Moves `russh`,
`rand_core`, and `tempfile` into `dev-dependencies`, and keeps
`russh`/`rand_core` available as *optional* deps enabled only by the
`git-xet-for-integration-test` feature.
> 
> **Gates SSH test helpers and integration tests behind a feature
flag.** Exposes `GitLFSAuthenticateResponse*` and the local SSH test
server only under `#[cfg(test)]` or `feature =
"git-xet-for-integration-test"`, and makes `tests/test_ssh.rs` compile
only when that feature is enabled.
> 
> Separately, cleans up workspace manifests/lockfiles by moving some
crates (`half`, `regex`, `futures-util`) to dev-deps where they’re only
needed for tests/benches, and adds `.worktrees/` to `.gitignore`.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
cdc30a5a8f. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
2026-03-31 13:31:20 -07:00
Assaf Vayner
9c0cb6e4c8 Reduce workspace dependencies (batches 1-3) (#746)
## Summary

- **Remove unused dependencies**: warp (zero imports), paste (zero
invocations), tower-service (zero imports), and heed misplacement in
xet_core_structures
- **Move mockall to dev-dependencies** in xet_client by gating
`#[automock]` with `#[cfg_attr(test, automock)]`
- **Feature-gate simulation module** behind `simulation` cargo feature
in xet_client, making axum, heed, humantime, futures-util,
human-bandwidth, and tower-http optional
- **Replace duration-str with humantime** (~2 deps vs ~78 transitive
deps) across xet_runtime, xet_client simulation, and simulation crate

## Impact

| Metric | Before | After | Change |
|---|---|---|---|
| hf-xet production deps | 371 | 321 | **-50** |
| Workspace total | 575 | 569 | -6 |

## Test plan

- [x] `cargo check --workspace` passes
- [x] `cargo check -p hf-xet` passes (without simulation feature — key
validation)
- [x] `cargo test --workspace` — all tests pass (4 pre-existing auth
test failures in git_xet unrelated to this PR)
- [x] `cargo tree -p hf-xet -e normal --prefix none | sort -u | wc -l`
confirms 321 deps

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Medium risk because it changes dependency graph and Cargo feature
gating (notably `xet-client` simulation modules and CI test features),
which can affect build/test behavior across targets despite minimal
runtime logic changes.
> 
> **Overview**
> Reduces workspace dependency surface by removing `duration-str`
(replaced with `humantime`) and trimming other transitive-heavy crates;
updates lockfiles accordingly across the workspace, `hf_xet`, and WASM
builds.
> 
> Introduces/propagates a `simulation` Cargo feature: `xet-client`’s
simulation server-related deps become optional and are only
compiled/exported when `feature = "simulation"` is enabled; `git_xet`
adds a `simulation` feature that forwards to dependent crates, and CI
now runs tests with `strict simulation git-xet-for-integration-test`.
> 
> Minor repo hygiene updates include ignoring `.claude/` in `.gitignore`
and wiring the `simulation` crate to depend on `xet-client` with
`features = ["simulation"]` (plus swapping its duration parsing helper
to `humantime`).
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
6abc194398. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 09:54:36 -07:00
Hoyt Koepke
332a456e1d Add ordered and unordered download streaming to session interface (#729)
This PR adds ordered and unordered download streams on XetSession,
including optional byte-range support and per-stream progress reporting.
Blocking and async variants are supported.

On the reconstruction side, this introduces UnorderedWriter and
UnorderedDownloadStream in xet_data, and extends the FileDownloadSession
stream APIs to take optional source ranges. Ordered and unordered
streams now share the same session-facing access pattern for async and
blocking callers.

This PR also renames DownloadGroup to FileDownloadGroup; the stream data
uses the per-session memory pool but don't count towards the maximum
number of concurrent downloads in progress.

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Touches core file reconstruction/writer plumbing (including
`DataWriter` ownership and new unordered writer/stream paths) and
changes public session APIs, so regressions could impact download
correctness, cancellation, or progress reporting.
> 
> **Overview**
> Adds first-class **ordered and unordered streaming download APIs** to
`xet_pkg::xet_session`, including async and blocking variants, optional
source-relative byte ranges, and per-stream progress via new
`XetDownloadStream` / `XetUnorderedDownloadStream` wrappers.
> 
> On the data layer, introduces an **unordered reconstruction path**
(`UnorderedWriter` + `UnorderedDownloadStream`) and refactors streaming
to spawn reconstruction tasks immediately but gate execution behind
`start()`; stream abort callbacks are now registered per-stream and
automatically unregistered on drop to avoid callback accumulation.
> 
> Updates the reconstruction writer contract by making
`DataWriter::finish` consume the writer (and shifting `DataWriter` to
`&mut self` usage), adjusts `SequentialWriter` accordingly, and adds
Criterion-based reconstruction benchmarks plus extensive
unordered-stream tests. Also renames session `DownloadGroup` to
`FileDownloadGroup` (and constructors) and updates call sites/examples.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
e02890aa4b. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
2026-03-20 14:40:18 -07:00
Hoyt Koepke
602d7679f6 Add cargo smoke-test for rapid full-workspace testing. (#741)
Currently, the full test validation is rather heavy, but running local
tests often fails to catch many issues due to the tests that probe the
full stack. This PR adds a smoke-test path that runs a meaningful subset
of the tests across the workspace that covers most errors. This runs in
about 1/8 of the time as cargo test, so it's useful to use in speeding
up AI model iteration.

In addition, a few intermittent failures were also fixed. 

There should be no runtime functionality change.

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Low Risk**
> Low risk since changes are limited to Cargo configuration and test
gating; no production code paths are modified. Main risk is accidentally
skipping too much coverage or misconfiguring feature flags in CI/local
workflows.
> 
> **Overview**
> Adds a new `cargo smoke-test` workflow by introducing a `smoke-test`
Cargo profile and a `cargo` alias that runs `test` with per-crate
`smoke-test` features enabled.
> 
> Defines `smoke-test` features across multiple crates and uses
`#[cfg_attr(feature = "smoke-test", ignore)]` / `#[cfg(... not(feature =
"smoke-test"))]` to skip long-running, concurrency-heavy, or full-stack
integration tests during smoke runs.
> 
> Tightens test robustness by making `SafeFileCreator` permission
assertions umask-tolerant (require owner read/write rather than an exact
`0o644`).
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
5d53009652. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Co-authored-by: Hoyt Koepke <hoytak@xethub.com>
2026-03-20 13:32:38 -07:00
Di Xiao
fb83178d28 Fix session id regression (#738)
The session id was replaced from `ulid` to `UniqueID` (a self
incrementing u64 in memory) in a previous PR but it's not correct.
The session id is used on CAS server logs and traces and CDN logs to
identity a related group of activity (for debugging and etc. purposes)
and it needs to be globally unique (thus using `ulid`) instead of
locally unique.
2026-03-19 13:31:40 -07:00
Hoyt Koepke
506fc28291 Simplify progress tracking + Unify Task ID tracking + Legacy Interface (#726)
Currently, progress tracking is split between callback-driven and
snapshot-driven paths, making session and task wiring across xet_data,
xet_pkg, hf_xet, and git_xet harder to keep consistent. This PR moves
upload/download progress to a polling snapshot model backed by atomics.
It also switches task identifiers to a UniqueID common with the progress
tracking throughout the session APIs.

This PR also updates the rate estimation to use the lighter weight
exponentially weighted moving averages model, so this can be done at a
low level.

To preserve compatibility for existing callback consumers,
callback-oriented upload/download progress tracking APIs are moved under
xet_pkg::legacy and bridged from polling snapshots via a callback based
updaters. hf_xet and git_xet are updated to use that legacy bridge
layer, so current integrations keep working until everything is fully
switched over to the XetSession method.
2026-03-18 18:07:43 -07:00
Brian Ronan
6232b42591 Xorb download URL debug logs (#714)
It's a bit annoying to try to ensure our CDN routing is correct. Logging
the URL domain for the first fetch term download to the debug logs.

Please don't hesitate to recommend alternative approaches.
2026-03-13 16:05:30 -07:00
Hoyt Koepke
45d38a13a9 Code reorganization towards release of xet cargo package (#693)
This PR is a massive rearrangement of the code base into 5 packages
intended for release on cargo. The directories and corresponding
packages are:

1. xet_runtime/ — compiles into the xet-runtime package. Contains the
runtime, config, and logging management.
2. xet_core_structures/ — compiles into the xet-core-structures package.
Contains core data structures for hashing, shards, and xorbs as well as
internal data structures that depend on these.
3. xet_client/ — compiles into the xet-client package, contains client
code for remotely connecting to the Hugging Face servers.
4. xet_data/ — compiles into the xet-data package, contains the data
processing pipeline: chunking/deduplication, file reconstruction,
clean/smudge operations, and progress tracking.
5. xet_pkg/ — compiles into the hf-xet package, provides the top-level
session-based API for file upload and download with user-facing error
categorization. This is the primary package downstream dependencies
would use. This also contains a single summary error type, XetError,
that translates cleanly into python error types.

In addition, the other tools are: 

- git_xet/ — the git_xet CLI binary crate (location preserved). 
- hf_xet/ -- the hf_xet python package (location preserved).
- simulation/ — the simulation crate for upload scenario benchmarking.
- wasm/ -- the wasm objects. 

The full description — and information for an AI agent to use to update
downstream dependencies — is at
api_changes/update_260309_package_restructure.md.

Summary of moves:

- xet_runtime: became xet_runtime::core inside xet_runtime/.
- utils: became xet_runtime::utils inside xet_runtime/.
- xet_config: became xet_runtime::config inside xet_runtime/.
- xet_logging: became xet_runtime::logging inside xet_runtime/.
- error_printer: became xet_runtime::error_printer inside xet_runtime/.
- file_utils: became xet_runtime::file_utils inside xet_runtime/.
- merklehash: became xet_core_structures::merklehash inside
xet_core_structures/.
- mdb_shard: became xet_core_structures::metadata_shard inside
xet_core_structures/.
- xorb_object: became xet_core_structures::xorb_object inside
xet_core_structures/.
- cas_client: became xet_client::cas_client inside xet_client/.
- hub_client: became xet_client::hub_client inside xet_client/.
- cas_types: became xet_client::cas_types inside xet_client/.
- chunk_cache: became xet_client::chunk_cache inside xet_client/.
- data: became xet_data::processing inside xet_data/.
- deduplication: became xet_data::deduplication inside xet_data/.
- file_reconstruction: became xet_data::file_reconstruction inside
xet_data/.
- progress_tracking: became xet_data::progress_tracking inside
xet_data/.
- xet_session: became xet::xet_session inside xet_pkg/.

- Wasm packages (hf_xet_wasm, hf_xet_thin_wasm): moved from top-level
into wasm/; internal imports updated, public APIs unchanged.
2026-03-11 12:02:38 -07:00