Commit Graph

8 Commits

Author SHA1 Message Date
Hoyt Koepke
0d9f78aaf4 Add README.md files and Cargo.toml updates needed for publishing hf-xet (#773)
This PR adds crates.io-facing metadata (homepage, readme, keywords,
categories) for the publishable crates, along with crate README files
and concise crate-level docs so crates.io and docs.rs pages have better
context.
2026-04-03 12:34:47 -07:00
Hoyt Koepke
3051478cdd Allow shard expiration to be set on global dedup queries for GC simulation (#762)
Currently, simulation global dedup shard queries return full shard bytes
with no configurable shard footer expiration, and simulation control
knobs are split between partially implemented paths. This PR adds global
dedup shard expiration control to simulation clients and servers, and
extends /simulation/set_config to cover shard expiration, max range
splitting, V2 reconstruction disabling, API delay, and URL expiration in
one path. This enables rapid simulation of the GC paths by setting the
global dedup expiration to a sub-epoch value.

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Touches simulation client/server APIs and shard serialization behavior
(including new trait methods and HTTP knobs), so downstream implementors
and tests may break if not updated. Changes are scoped to simulation/GC
tooling paths but affect how global-dedup shard bytes are produced and
validated.
> 
> **Overview**
> Adds a new simulation control to set **global-dedup shard
expiration**: `DirectAccessClient::set_global_dedup_shard_expiration`
now makes `query_for_global_dedup_shard` optionally return *minimal*
shard bytes (file section stripped) with `shard_key_expiry = now +
expiration` (sub-second durations round up).
> 
> Extends `MDBMinimalShard` serialization with
`serialize_xorb_subset_with_expiry` to write an optional
`shard_key_expiry` footer, and updates `LocalClient`/`MemoryClient` to
use it when expiration is enabled.
> 
> Unifies and expands runtime simulation knobs under
`/simulation/set_config` (global dedup expiration, max ranges per fetch,
disable V2 reconstruction, API delay, URL expiration) and updates
`SimulationControlClient` to apply them via a retried async POST. Also
moves integrity/reachability checks to `DeletionControlableClient`, adds
`verify_all_reachable`, and wires new `/simulation/verify_all_reachable`
with 501 behavior when no deletion client is configured.
> 
> Separately, introduces **simulation-only xorb cut thresholds**
(`XORB_CUT_THRESHOLD_*`) driven by new `xet_runtime` xorb config
overrides, and updates upload/dedup code paths to use these thresholds.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
42bd9c3f4f. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
2026-03-31 18:35:19 -07:00
Assaf Vayner
86935b4117 Move test-only deps to dev-dependencies in git_xet (#767)
## Summary
- Move `russh`, `rand_core`, and `tempfile` from regular dependencies to
dev-dependencies in `git_xet`, since they are only used in test code
- `russh` and `rand_core` are also declared as optional regular deps
activated by the `git-xet-for-integration-test` feature flag, since the
integration test SSH server is compiled into the library under that
feature
- Gate `test_utils/ssh_server` module and related exports behind
`#[cfg(any(test, feature = "git-xet-for-integration-test"))]`
- Gate `tests/test_ssh.rs` integration test file behind `#![cfg(feature
= "git-xet-for-integration-test")]`

## Test plan
- [x] `cargo check -p git_xet` passes (no features)
- [x] `cargo test -p git_xet --no-run` passes (no features)
- [x] `cargo test -p git_xet --features git-xet-for-integration-test
--no-run` passes

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Low Risk**
> Low risk: primarily Cargo dependency/feature and `cfg` gating changes,
with no production logic changes; risk is limited to build/test
configuration and feature-flagged integration test coverage.
> 
> **Overview**
> **Reduces default build dependencies for `git_xet`.** Moves `russh`,
`rand_core`, and `tempfile` into `dev-dependencies`, and keeps
`russh`/`rand_core` available as *optional* deps enabled only by the
`git-xet-for-integration-test` feature.
> 
> **Gates SSH test helpers and integration tests behind a feature
flag.** Exposes `GitLFSAuthenticateResponse*` and the local SSH test
server only under `#[cfg(test)]` or `feature =
"git-xet-for-integration-test"`, and makes `tests/test_ssh.rs` compile
only when that feature is enabled.
> 
> Separately, cleans up workspace manifests/lockfiles by moving some
crates (`half`, `regex`, `futures-util`) to dev-deps where they’re only
needed for tests/benches, and adds `.worktrees/` to `.gitignore`.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
cdc30a5a8f. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
2026-03-31 13:31:20 -07:00
Assaf Vayner
9c0cb6e4c8 Reduce workspace dependencies (batches 1-3) (#746)
## Summary

- **Remove unused dependencies**: warp (zero imports), paste (zero
invocations), tower-service (zero imports), and heed misplacement in
xet_core_structures
- **Move mockall to dev-dependencies** in xet_client by gating
`#[automock]` with `#[cfg_attr(test, automock)]`
- **Feature-gate simulation module** behind `simulation` cargo feature
in xet_client, making axum, heed, humantime, futures-util,
human-bandwidth, and tower-http optional
- **Replace duration-str with humantime** (~2 deps vs ~78 transitive
deps) across xet_runtime, xet_client simulation, and simulation crate

## Impact

| Metric | Before | After | Change |
|---|---|---|---|
| hf-xet production deps | 371 | 321 | **-50** |
| Workspace total | 575 | 569 | -6 |

## Test plan

- [x] `cargo check --workspace` passes
- [x] `cargo check -p hf-xet` passes (without simulation feature — key
validation)
- [x] `cargo test --workspace` — all tests pass (4 pre-existing auth
test failures in git_xet unrelated to this PR)
- [x] `cargo tree -p hf-xet -e normal --prefix none | sort -u | wc -l`
confirms 321 deps

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Medium risk because it changes dependency graph and Cargo feature
gating (notably `xet-client` simulation modules and CI test features),
which can affect build/test behavior across targets despite minimal
runtime logic changes.
> 
> **Overview**
> Reduces workspace dependency surface by removing `duration-str`
(replaced with `humantime`) and trimming other transitive-heavy crates;
updates lockfiles accordingly across the workspace, `hf_xet`, and WASM
builds.
> 
> Introduces/propagates a `simulation` Cargo feature: `xet-client`’s
simulation server-related deps become optional and are only
compiled/exported when `feature = "simulation"` is enabled; `git_xet`
adds a `simulation` feature that forwards to dependent crates, and CI
now runs tests with `strict simulation git-xet-for-integration-test`.
> 
> Minor repo hygiene updates include ignoring `.claude/` in `.gitignore`
and wiring the `simulation` crate to depend on `xet-client` with
`features = ["simulation"]` (plus swapping its duration parsing helper
to `humantime`).
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
6abc194398. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 09:54:36 -07:00
Hoyt Koepke
69962587b5 Composable Hash Functionality (#745)
Currently, computing aggregate chunk hashes across independently
processed ranges requires recomputing over the full concatenated chunk
list. This PR introduces ChunkHashRange, a composable representation
that can hash contiguous partial ranges and merge them while preserving
equivalence with the existing xorb_hash / file_hash behavior. This
allows an intermediate representation of the hash ranges that can be
merged in arbitrary order to get the final hash. It also uses O(log(n))
storage and all operations are done in linear time. Serialization and
Deserialization are fully supported.

The main use case for this is in doing partial file edits. Previously,
to edit the middle of a large file, the client would have to know all
the hashes for the full file, even if only a few in the middle were
changed. With a large file, this can still be 100s of MB; the chunk
metadata size is roughly 1/1000 of the data size. With this change, we
can now transmit the unmodified parts of a file in O(log(n)) storage but
still be able to build the entire function hash; now a sequence of 10M
chunks takes the equivalent storage of ~500 chunks or so.

Along the way, we also added in an optimization for the merge step to
avoid an allocation, yielding a 2x speedup.

---------

Co-authored-by: Hoyt Koepke <hoytak@xethub.com>
2026-03-27 08:38:59 -07:00
Hoyt Koepke
602d7679f6 Add cargo smoke-test for rapid full-workspace testing. (#741)
Currently, the full test validation is rather heavy, but running local
tests often fails to catch many issues due to the tests that probe the
full stack. This PR adds a smoke-test path that runs a meaningful subset
of the tests across the workspace that covers most errors. This runs in
about 1/8 of the time as cargo test, so it's useful to use in speeding
up AI model iteration.

In addition, a few intermittent failures were also fixed. 

There should be no runtime functionality change.

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Low Risk**
> Low risk since changes are limited to Cargo configuration and test
gating; no production code paths are modified. Main risk is accidentally
skipping too much coverage or misconfiguring feature flags in CI/local
workflows.
> 
> **Overview**
> Adds a new `cargo smoke-test` workflow by introducing a `smoke-test`
Cargo profile and a `cargo` alias that runs `test` with per-crate
`smoke-test` features enabled.
> 
> Defines `smoke-test` features across multiple crates and uses
`#[cfg_attr(feature = "smoke-test", ignore)]` / `#[cfg(... not(feature =
"smoke-test"))]` to skip long-running, concurrency-heavy, or full-stack
integration tests during smoke runs.
> 
> Tightens test robustness by making `SafeFileCreator` permission
assertions umask-tolerant (require owner read/write rather than an exact
`0o644`).
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
5d53009652. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Co-authored-by: Hoyt Koepke <hoytak@xethub.com>
2026-03-20 13:32:38 -07:00
Hoyt Koepke
749c28b086 Error unification and cleanup (#737)
This PR performs some housecleaning and removes some technical debt
around using different error types, unifying them with the python
interface.

- Our client code tended to do a lot with anyhow errors as an artifact
of first using them before switching to thiserror. This PR cleans these
up in favor of using ClientError or other named error types directly.
- It also removes all the aliases to the old error type names present in
the packages before the refactoring, now settling into ClientError,
FormatError, DataError, and RuntimeError, with XetError being the error
type exposed publicly.
- Also, currently, xet_session exposes SessionError as an alias of
XetError, which adds an extra public type name without adding behavior.
This PR removes that alias and standardizes the public API/docs onto
XetError directly.
-It also tightens Python-facing error behavior and moves the python
handling to the XetError class directly, hidden behind a python feature
flag. Using these types, hf_xet now registers XetObjectNotFoundError and
XetAuthenticationError exception classes for authentication and the
not-found cases. These inherit from the current exception classes, so
all behavior is preserved.
- In addition, the From for PyErr mapping routes
timeout/network/auth/not-found categories to more appropriate Python
exception types than simply RuntimeError.

This is primarily an API-surface cleanup plus error-classification
alignment.

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> API-breaking error-surface changes (removal of legacy alias modules
and signature changes like `CredentialHelper::fill_credential`) may
require downstream code updates, especially where errors are
matched/converted. Runtime behavior should be mostly unchanged, but
error mapping/propagation paths (including Python exceptions) are widely
touched across crates.
> 
> **Overview**
> This PR **unifies error types across the workspace** by removing
legacy re-export/alias modules (e.g. `CasClientError`, `CasTypesError`,
`DataProcessingError`, `SessionError`) and updating call sites to use
canonical errors like `xet_client::ClientError`,
`xet_core_structures::CoreError`, and `xet_data::DataError` directly.
> 
> It updates CAS client code to **standardize on
`crate::error::Result`/`ClientError`**, including deleting
`cas_client/error.rs`, adjusting error conversions in retry/http
middleware paths, and updating simulation/local-server code to map
`ClientError` to HTTP responses.
> 
> Python bindings (`hf_xet`) now **convert failures via `XetError`**
(with `xet_pkg` built with `python` support), register custom exceptions
on module init, and refine argument-validation errors to `PyValueError`
while routing network/timeout/auth/not-found to more appropriate Python
exception classes.
> 
> Misc cleanup: `git_xet` now depends on `xet-data`, simulation binaries
switch to `anyhow::Result`/`bail!`, and lockfiles are updated for
new/updated dependencies (notably `pyo3`/`inventory`).
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
f3d056a909. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
2026-03-19 16:34:28 -07:00
Hoyt Koepke
45d38a13a9 Code reorganization towards release of xet cargo package (#693)
This PR is a massive rearrangement of the code base into 5 packages
intended for release on cargo. The directories and corresponding
packages are:

1. xet_runtime/ — compiles into the xet-runtime package. Contains the
runtime, config, and logging management.
2. xet_core_structures/ — compiles into the xet-core-structures package.
Contains core data structures for hashing, shards, and xorbs as well as
internal data structures that depend on these.
3. xet_client/ — compiles into the xet-client package, contains client
code for remotely connecting to the Hugging Face servers.
4. xet_data/ — compiles into the xet-data package, contains the data
processing pipeline: chunking/deduplication, file reconstruction,
clean/smudge operations, and progress tracking.
5. xet_pkg/ — compiles into the hf-xet package, provides the top-level
session-based API for file upload and download with user-facing error
categorization. This is the primary package downstream dependencies
would use. This also contains a single summary error type, XetError,
that translates cleanly into python error types.

In addition, the other tools are: 

- git_xet/ — the git_xet CLI binary crate (location preserved). 
- hf_xet/ -- the hf_xet python package (location preserved).
- simulation/ — the simulation crate for upload scenario benchmarking.
- wasm/ -- the wasm objects. 

The full description — and information for an AI agent to use to update
downstream dependencies — is at
api_changes/update_260309_package_restructure.md.

Summary of moves:

- xet_runtime: became xet_runtime::core inside xet_runtime/.
- utils: became xet_runtime::utils inside xet_runtime/.
- xet_config: became xet_runtime::config inside xet_runtime/.
- xet_logging: became xet_runtime::logging inside xet_runtime/.
- error_printer: became xet_runtime::error_printer inside xet_runtime/.
- file_utils: became xet_runtime::file_utils inside xet_runtime/.
- merklehash: became xet_core_structures::merklehash inside
xet_core_structures/.
- mdb_shard: became xet_core_structures::metadata_shard inside
xet_core_structures/.
- xorb_object: became xet_core_structures::xorb_object inside
xet_core_structures/.
- cas_client: became xet_client::cas_client inside xet_client/.
- hub_client: became xet_client::hub_client inside xet_client/.
- cas_types: became xet_client::cas_types inside xet_client/.
- chunk_cache: became xet_client::chunk_cache inside xet_client/.
- data: became xet_data::processing inside xet_data/.
- deduplication: became xet_data::deduplication inside xet_data/.
- file_reconstruction: became xet_data::file_reconstruction inside
xet_data/.
- progress_tracking: became xet_data::progress_tracking inside
xet_data/.
- xet_session: became xet::xet_session inside xet_pkg/.

- Wasm packages (hf_xet_wasm, hf_xet_thin_wasm): moved from top-level
into wasm/; internal imports updated, public APIs unchanged.
2026-03-11 12:02:38 -07:00