Commit Graph

368 Commits

Author SHA1 Message Date
Di Xiao
75952ae618 Better support "xet-write-token" API authorization model and LFS Batch Api change (#498)
1. This PR updates the hub client xet access token request to use custom
rev in addition to the default "main". This better supports the
"xet-write-token" API authorization model:
Clients can get a xet write token, if
- the "rev" is a regular branch, with a HF write token;
- the "rev" is a pr branch with an corresponding open PR, with a HF
write or read token;
- it intends to create a pr and repo is enabled for discussion, with a
HF write or read token.

2. Fixed a bug when getting the current branch name in a repo, which
didn't parse branch names with "/" correctly: change
`refs_heads_branch.rsplit('/').next()` to
`refs_heads_branch.strip_prefix("refs/heads/")`.

3. Also updated xet transfer agent to use the refresh route in the LFS
Batch Api
[response](e3be2b3c8f/server/app/gitHostingRoutes.ts (L1713)).

4. Use the session id in the LFS Batch Api
[response](e3be2b3c8f/server/app/gitHostingRoutes.ts (L1657))
for token refresh and CAS requests.
2025-09-23 16:07:54 -07:00
Di Xiao
15942e295e Fix git xet release bug (#504)
`gh release create` creates tags and thus requires repo checkout
2025-09-22 16:57:35 -07:00
Di Xiao
55234c489b Build and release git-xet (#499)
The defines the workflow to build git-xet on Linux (amd64 & arm64),
macOS (amd64 & arm64) and Windows (amd64). For the macOS build the
compiled binary is signed and notarized in place.
2025-09-22 15:44:02 -07:00
Hoyt Koepke
610874ab04 Allow Duration and byte sizes in constants for easier use. (#495)
- Duration: Currently, we use a lot of _MS and _SEC suffixes in
constants to denote duration. This PR allows std::time::Duration to be
used directly, with values such as "10sec" or "100ms" or "1d" translated
directly into std::time::Duration.

- ByteSize: It also introduces a new utility type, ByteSize, that simply
wraps a u64 but allows the user to specify "1mb" or "45gb" as the value
when setting constant values. The suffixes mb, mib, kb, kib, gb, gib, b,
etc. are all supported, with the default being the raw value.
2025-09-19 10:59:11 -07:00
Rajat Arya
f612564c25 MacOS diag scripts (#497)
Adds corresponding Diagnostics script for MacOS.

Lightly tested with hf-xet 1.1.10 and MacOS 15.6.1 - correctly takes
stack traces on interval and writes out to diagnostics folder as
expected.
2025-09-17 15:07:35 -07:00
Di Xiao
fa030edcd5 upgrade rust edition to 2024; upgrade rustc to 1.89 (#494)
- Upgrade Rust edition and rustc version to bring in some nice features,
e.g. let chains instead of nested if block.
- Fix clippy and format due to the upgrade.
- Fix a bug identified by the new rustc:
6cb0a7fb4e/xet_runtime/src/runtime.rs (L195)
```
#[cfg(not(target_family = "wasm"))]
{
    // A new multithreaded runtime with a capped number of threads
    TokioRuntimeBuilder::new_multi_thread().worker_threads(get_num_tokio_worker_threads())
}
```
here the end curly bracket drops the temporary builder while a `&mut
Self` to the dropped value is returned. (this may be due to a difference
between compilers regarding how they treat the scope of "{...}" of
`#[cfg(...))] {...}`?)
2025-09-17 10:28:50 -07:00
Hoyt Koepke
6cb0a7fb4e Improved user-configurable constant handling (#493)
This PR adds in two upgrades to the current configurable_constants!
macro that allows for users to specify the values of configuration
constants using environment variables and the like. It adds two things:
- Allows bool values to be parsed by 0, 1, true, false, on, off, etc.
configurable_bool_constants! is no longer needed.
- Allows Option<T> to be a specified type with a default value of None,
which parses the environment value as type T but puts it in Some(Value)
if it's present and None if it's not specified. This allows us to
determine if a value has been specified, e.g. in the case where the
default depends on other things but can be overridden.
2025-09-16 12:43:33 -07:00
Hoyt Koepke
a715926cc7 Rename Threadpool class name to XetRuntime to reflect usage (#491)
The Threadpool class does quite a bit more than just manage a
threadpool; this PR simply changes the name to reflect this usage.
2025-09-15 11:30:28 -07:00
Assaf Vayner
22f86db343 Adding README to few crates for documentation (#492)
Added README to a few crates so that we can link to the crate directory
to link to individual crates and have something rendered for reference
implementation when linking to a specific file doesn't make sense.
2025-09-15 11:15:05 -07:00
Assaf Vayner
81b0833965 hf_xet 1.1.10 (#490)
update version of hf_xet
v1.1.10-rc0 v1.1.10
2025-09-11 14:57:19 -07:00
Rajat Arya
c762c681ef Diagnostic Scripts + README changes (#489)
- Adds diagnostic scripts to root of repo and references them in README.
- Also reorganizes README to make diagnostics & debugging more visible.

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-09-11 14:54:22 -07:00
Hoyt Koepke
fe71e3dd54 Updated chunker to eliminate spurious boundary triggering. (#487)
We must enforce that the next boundary is actually past the minimum
chunk size. In rare cases, a boundary could be triggered before the
minimum chunk size has passed, and this triggering would be based not on
the content of the file but on the previous state of the rolling hash
function. This PR fixes this case.
2025-09-10 16:52:48 -07:00
Joseph Godlewski
f24c97eedb Adding retry for unhandled io errors when sending requests (#468)
Allows us to retry errors when we receive I/O errors when sending
requests. For example, on macOS, when we see:
```
No buffer space available (os error 55)
```
when downloading from S3, we should wait and retry once the system has
more network resources.
2025-09-10 16:44:05 -07:00
Hoyt Koepke
263020646f Fix wheel upload for linux for dev/alpha/beta tags (#379)
When using beta tags, the upload process doesn't have the wheels in the
correct directory. This fixes that.
2025-09-09 10:29:27 -07:00
Di Xiao
e2f7861809 Drop "GaiResolverWithAbsolute" (#486)
Removes this custom DNS resolver that forces domain names to be
absolute. Using relative domain names is a valid practice in Kubernetes
clusters to configure proxy servers, e.g.
https://github.com/huggingface/huggingface_hub/issues/3323

The original [PR](https://github.com/huggingface/xet-core/pull/383) that
added this resolver was trying to resolve
https://github.com/huggingface/huggingface_hub/issues/3155 later
appeared to be a problem of user local misconfiguration.
2025-09-09 10:26:56 -07:00
Di Xiao
0e1f9f4cf0 Git-Xet: LFS custom transfer agent with Xet protocol (#425)
This PR builds a Git integration called `git-xet` that enables users to
upload files using the Xet protocol as part of a standard git push.

This integration builds on the Git LFS custom transfer adapter protocol,
the same mechanism we now use to handle Git LFS uploads for files larger
than 5 GB through multipart PUT.
To enable uploads to Xet, users run `git-xet install`, which writes the
following configuration to the Git config file at a selected scope
[`--system`, `--global` (default), or `--local`]:
```
[lfs "customtransfer.xet"]
	path = git-xet
	args = transfer
	concurrent = true
```
This setup registers a new transfer adapter named xet, allowing Git to
delegate LFS file transfers to the git-xet binary when applicable.

On the server side, support is rolled out in two stages:

Stage 1 (Upload): The Git LFS batch API for the "upload" operation is
updated.

- If a repo is Xet enabled but users didn't run git-xet install,
moon-landing rejects the request when users initiated git push and
returns an instruction to install git-xet.

- If a repo is Xet enabled and users have git-xet configured correctly,
moon-landing accepts the request and replies with CAS server URL and
access token, which git-xet will use to upload files to Xet.

- If a repo is NOT Xet enabled, upload goes through the LFS path.
2025-09-08 16:08:50 -07:00
Assaf Vayner
e01896e074 use u64 rather than usize in file hashing paths (#485)
Using the file hashing components in WASM found a bug that using 32 but
usize causes errors when hashing the file.

This PR enforces the use of u64 everywhere along that path (and also
pins the wasm-bindgen version)
2025-09-08 14:27:58 -07:00
Hoyt Koepke
4d948d1a76 Rename xet_threadpool to xet_runtime to reflect usage (#484)
The xet_threadpool subdirectory is increasingly the place for utilities
related to the runtime, managing file handle limits, etc. This PR simply
renames this directory to reflect this switch.
2025-09-08 13:32:48 -07:00
Assaf Vayner
6203653ecf update api paths to use plural nouns (#482)
Updates paths used by the clients to use latest CAS paths as defined in
the spec.

All paths now use plural nouns and shard upload no longer uses the hash,
removes the prefix and hash from the client trait upload_shard function.
2025-09-08 13:02:49 -07:00
Eliott C.
3ff4eb2d56 Thin wasm: do not automatically set is_dedup to true for first chunk (#481)
Related to https://github.com/huggingface/huggingface.js/pull/1718

We'll want to edit parts of file while loading old data's dedup info

In those case we don't always want to load dedup info for the first
chunk (since it may not be at the beginning of the file)

So the is_dedup = true for first chunk is handled client side
2025-09-06 09:31:58 +02:00
Rajat Arya
cc247a9d5a Add input params to Run name in GH Workflow UI (#478) 2025-08-29 09:21:24 -07:00
Rajat Arya
7f53907434 Bumping version to 1.1.9 (#476) v1.1.9-rc1 v1.1.9 1.1.9-rc1 2025-08-27 15:35:45 -07:00
Rajat Arya
003b154284 Update hf_xet/README.md for hf_xet project (#475)
- wrote hf_xet/README.md about hf_xet
- verified sdist build is successful
- moved docs from hf_xet/README.md to xet-core/README.md
2025-08-27 15:27:56 -07:00
Rajat Arya
50ced6cb65 Update PyPI package metadata for hf-xet (#472)
Fixes #465. Adapts #464.

Verified locally using `pip-licenses` and manual inspection of metadata.

Once merged will verify with PyPI through RC build.

@jsulz, @hoytak, @seanses, @assafvayner: Let me know if you want a
different email address as a maintainer - these tie into your PyPI user
profile.

---------

Co-authored-by: Jared Sulzdorf <j.sulzdorf@gmail.com>
2025-08-26 11:15:52 -07:00
Di Xiao
740887a453 CI test on macos (#473)
We test on Ubuntu and Windows, so it seems reasonable to test on macOS
too. This also gets CI prepared for git-xet tests.
2025-08-26 10:46:34 -07:00
Erik Cederstrand
9c20ec6f43 Use a valid SPDX identifier as license classifier (#464)
This helps automatic license checkers like pip-licenses to identify the
right license for this project
2025-08-25 10:14:16 -07:00
Assaf Vayner
3865e945d1 run_and_extract_custom: remove use of explicit tokio_retry without utility (#460)
we had 1 case of using "raw" tokio_retry rather than the retry utility.
This was due to using a special custom parsing logic for chunks, rather
than built in json functionality. This PR adds a run_and_extract custom
that let's a user specify the function to parse the response body.
v1.1.9-rc0 v0.1.9-rc0
2025-08-21 10:19:36 -07:00
Hoyt Koepke
df1145f9ad Raise soft file handle limits to hard limits on OSX. (#453)
On OSX, raise the soft file handle limits to the hard limits (which
cannot be changed in the process).

---------

Co-authored-by: Assaf Vayner <assaf@huggingface.co>
2025-08-19 13:24:41 -07:00
Assaf Vayner
6beab3b197 enforce linting on hf_xet (#462)
This PR adds an explicit lint command on the hf_xet directory. This is
necessary because it is excluded from the workspace. Other excluded
directories aren't touched very often and are less important for now.
2025-08-18 16:55:05 -07:00
Assaf Vayner
1578af406c tokio console setup (#458)
adds feature to enable deps for use with tokio console and documents how
to compile and use along with tokio console
2025-08-18 15:46:56 -07:00
Assaf Vayner
39b85696b9 parutils makeover remove async_scoped (#454)
removing async_scoped dep and creating new parallel util to replace
tokio_par_for_each that relied on async_scoped.

Any usage of tokio_par_for_each which is the only fn used out of
parutils has been replaced with the new
`tokio_run_max_concurrency_fold_result_with_semaphore`

TODO:
- [x] add more tests
- [x] use semaphore acquired from the global semaphore provider where/if
relevant.
2025-08-18 15:35:43 -07:00
Assaf Vayner
48be7b08ab update version (#461) v1.1.8 2025-08-18 14:35:23 -07:00
Hoyt Koepke
e484a6b150 Limit number of idle connections (#459)
Limit the number of idle connections maintained in the reqwest
connection pool.

---------

Co-authored-by: Assaf Vayner <assaf@huggingface.co>
v1.1.8-rc1
2025-08-18 13:39:43 -07:00
Di Xiao
04c1e30079 Cache and reuse reqwest Client (#457)
This PR caches the reqwest::Client in a runtime if it exists and re-uses
it in all wrapped clients in the `RemoteClient` object. This effectively
shares the connection pool and thus reduces opening sockets.

Fix XET-704
2025-08-15 16:39:06 -07:00
Di Xiao
f4800863d3 Clean up dependencies (no functionality change) (#456)
Update entries in cas_client/cargo.toml that specifies dependencies from
crates.io to inheriting from the workspace. Also sort all dependency
entries.
2025-08-15 11:06:54 -07:00
Di Xiao
4870043d87 Fix DataHash hex string serde to little endian (#445)
Different programming languages and platforms may have different byte
order layout of a u64 in memory, this is not an issue when we typecast a
[u8;32] to [u64;4] (which is essentially a pointer typecasting and thus
doesn't reorder bytes at all), until these bytes are actually used as
u64s. For such cases we make it explicit to use the little-endian order.
2025-08-14 15:05:40 -07:00
Di Xiao
36f1138a6f Add back retry for connection setup and sending request (#455)
This PR fixes the regression that part of the retry logic for
downloading was accidentally removed. The added back retry logic
complements the retry for deserializing the data stream of responses.
2025-08-14 15:01:09 -07:00
Joseph Godlewski
6553ade9cb fix: singleflight owner task not removing Call from Group if dropped (#447)
When a singleflight `Group` is called by many tasks for a particular
key, one of these tasks is chosen as the `owner` to actually perform the
work. The other tasks are considered `waiters` and will wait until the
`owner` completes the work.

In the normal case, the `owner` runs the work, takes the result and
provides it to any `waiter` tasks. It is also the responsibility of the
`owner` to clean up the state it added to the singleflight `Group`,
namely, the `Call` record in the `Group`'s `CallMap`.

However, if the `owner` is `drop`'ed while waiting for the work to
complete (e.g. by its task being canceled), then the work will still
finish and any waiter tasks will be notified (as those actions are part
of a separately spawned task), but the state the owner added to the
`Group` is not cleaned up (i.e. there is still a lingering mapping of
key -> `Call`).

What this means is that if the `owner` is dropped before it can remove
the mapping, then the results of the work are permanently "cached" in
the singleflight `Group` and any subsequent tasks looking for the key
will always get back those results until the `Group` is dropped (usually
on process restart).

Since much of the work we use singleflight for is downloading immutable
blobs of data, "caching" the `Call` results doesn't sound that terrible
(we already cache some stuff outside of singleflight). However, this is
caching the `Result`, not just the `Ok(…)` variant, so if the work
returned an `Error`, then that becomes what is permanently cached,
which, for intermittent networking issues can cause problems.

This PR fixes the issue by adding a RAII guard struct when the `Call` is
added to the `CallMap` so that the `work()` function will remove the
`Call` when the owner exits the function (either successfully, due to
error/panic, or if it is dropped).

See the `test_dropped_owner` test for an example of how this situation
can occur.
2025-08-11 17:11:15 -07:00
Hoyt Koepke
9bbc0c65ab Updated version to v1.1.7 (#443) v1.1.7 2025-08-05 17:25:31 -07:00
Hoyt Koepke
3260e0852c Changed default number of parallel downloads from 64 to 48. (#442)
Some more testing found that 64 parallel range gets can still possibly
exhaust the existing file handles on OSX; this addressed this issue
without (hopefully) impacting the download performance.
2025-08-05 17:24:31 -07:00
Hoyt Koepke
0c810fa3d0 Remove telemetry code; eliminate Mutex on logging setup. (#441)
This PR removes the unused telemetry code from hf_xet.  

In addition, it also removes the Mutex around the logging setup, which
appears to cause an intermittent hang when os.fork() gets involved.
v1.1.7-rc0
2025-08-05 16:41:01 -07:00
Hoyt Koepke
7becae3fde Updated version to 1.1.6. (#440) v1.1.6 2025-08-05 15:40:15 -07:00
Hoyt Koepke
d575b593d4 Make hf_xet safe(ish) across python os.fork() (#437)
This PR ensures that none of the tokio thread state exists through a
call to python's os.fork() as used in the multiprocessing library. For
an explanation of the issue, see
https://github.com/vllm-project/vllm/blob/main/docs/design/multiprocessing.md#tradeoffs.

It does this by offloading all the async calls to a separate and
transient OS thread, which would not exist after the spawn process. Thus
any possible restart of the tokio runtime due to a spawn would occur in
a clean environment and without thread-local storage causing issues.

To accomplish this, this PR refactors the hf_xet logging layer to
separate it out from the python runtime, as the python runtime is not
Send/Sync. This also simplifies this layer somewhat and isolates the
telemetry reporting logic so that only the background sending thread of
the telemetry logic is restarted after a spawn.

In addition, this PR removes the use of parking_lot, both in
singleflight.rs and as part of tokio. The library is not safe across
fork(); in particular, note
9c810e4a11/core/src/parking_lot.rs (L51).
2025-08-05 15:23:48 -07:00
Assaf Vayner
2645b96eb5 only debug log 416 on get recon (#439) 2025-08-05 11:56:35 -07:00
Assaf Vayner
fdfff55726 halve num concurrent range gets (#438) 2025-08-05 11:44:56 -07:00
Eliott C.
663c3a7c7d Remove logging from wasm lib (#434) 2025-08-01 19:10:32 +02:00
Hoyt Koepke
a2562c9476 Associate static semaphores with runtime (#433)
If a parent process spawns a child process while permits from a static
semaphore are issued, the number of permits available to the child
process will be reduced for the entirety of the child's lifetime, even
when the parent process permits are returned. This could potentially
cause a deadlock or painful slowdown on upload or download. This PR
moves all our static semaphores to ones associated with the runtime, so
after a spawn they are reset.
2025-07-31 11:03:14 -07:00
Hoyt Koepke
1f80b9ec7b Respect XDG_CACHE_HOME and ~/ when setting cache directory. (#426)
Currently, we default the cache directory to `home_dir()/.cache`, but as
pointed out in https://github.com/huggingface/xet-core/issues/417, this
is the incorrect behavior. This PR switches this behavior to use the
[cache_dir()](https://docs.rs/dirs/latest/dirs/fn.cache_dir.html)
function to properly determine this.

The side effect of this, however, is that on OSX the cache dir will
change to the more standard `$HOME/Library/Caches/huggingface/xet`,
which means nothing in `~/.cache/huggingface/xet` will be valid anymore.

Fixes https://github.com/huggingface/xet-core/issues/417.
2025-07-30 19:13:26 -07:00
Hoyt Koepke
90036fbbc2 Limit number of async worker threads on large CPUs (#431)
Currently, tokio spins up async worker threads equal to the number of
cores, which can be quite large on huge machines, e.g. 128. This isn't
needed to keep everything running; we already offload much of the
compute to blocking threads and so here the number is a significant
overkill, especially if hf_xet is used for downloading only a few files.

This PR limits the number of async worker threads to 32 unless
TOKIO_WORKER_THREADS is set, in which case that value is used. It also
removes the cap on the number of blocking threads tokio can spin up as
needed as there is no real reason to not use tokio's default value
there.

---------

Co-authored-by: Di Xiao <di@huggingface.co>
2025-07-30 18:12:56 -07:00
Assaf Vayner
725973bccc revert use of v1 api paths (#432)
Avoiding using the v1 api paths in the next release until the spec is
fixed.

reverting to the old API paths.
2025-07-30 15:07:36 -07:00