Commit Graph

328 Commits

Author SHA1 Message Date
Hoyt Koepke
0c810fa3d0 Remove telemetry code; eliminate Mutex on logging setup. (#441)
This PR removes the unused telemetry code from hf_xet.  

In addition, it also removes the Mutex around the logging setup, which
appears to cause an intermittent hang when os.fork() gets involved.
v1.1.7-rc0
2025-08-05 16:41:01 -07:00
Hoyt Koepke
7becae3fde Updated version to 1.1.6. (#440) v1.1.6 2025-08-05 15:40:15 -07:00
Hoyt Koepke
d575b593d4 Make hf_xet safe(ish) across python os.fork() (#437)
This PR ensures that none of the tokio thread state exists through a
call to python's os.fork() as used in the multiprocessing library. For
an explanation of the issue, see
https://github.com/vllm-project/vllm/blob/main/docs/design/multiprocessing.md#tradeoffs.

It does this by offloading all the async calls to a separate and
transient OS thread, which would not exist after the spawn process. Thus
any possible restart of the tokio runtime due to a spawn would occur in
a clean environment and without thread-local storage causing issues.

To accomplish this, this PR refactors the hf_xet logging layer to
separate it out from the python runtime, as the python runtime is not
Send/Sync. This also simplifies this layer somewhat and isolates the
telemetry reporting logic so that only the background sending thread of
the telemetry logic is restarted after a spawn.

In addition, this PR removes the use of parking_lot, both in
singleflight.rs and as part of tokio. The library is not safe across
fork(); in particular, note
9c810e4a11/core/src/parking_lot.rs (L51).
2025-08-05 15:23:48 -07:00
Assaf Vayner
2645b96eb5 only debug log 416 on get recon (#439) 2025-08-05 11:56:35 -07:00
Assaf Vayner
fdfff55726 halve num concurrent range gets (#438) 2025-08-05 11:44:56 -07:00
Eliott C.
663c3a7c7d Remove logging from wasm lib (#434) 2025-08-01 19:10:32 +02:00
Hoyt Koepke
a2562c9476 Associate static semaphores with runtime (#433)
If a parent process spawns a child process while permits from a static
semaphore are issued, the number of permits available to the child
process will be reduced for the entirety of the child's lifetime, even
when the parent process permits are returned. This could potentially
cause a deadlock or painful slowdown on upload or download. This PR
moves all our static semaphores to ones associated with the runtime, so
after a spawn they are reset.
2025-07-31 11:03:14 -07:00
Hoyt Koepke
1f80b9ec7b Respect XDG_CACHE_HOME and ~/ when setting cache directory. (#426)
Currently, we default the cache directory to `home_dir()/.cache`, but as
pointed out in https://github.com/huggingface/xet-core/issues/417, this
is the incorrect behavior. This PR switches this behavior to use the
[cache_dir()](https://docs.rs/dirs/latest/dirs/fn.cache_dir.html)
function to properly determine this.

The side effect of this, however, is that on OSX the cache dir will
change to the more standard `$HOME/Library/Caches/huggingface/xet`,
which means nothing in `~/.cache/huggingface/xet` will be valid anymore.

Fixes https://github.com/huggingface/xet-core/issues/417.
2025-07-30 19:13:26 -07:00
Hoyt Koepke
90036fbbc2 Limit number of async worker threads on large CPUs (#431)
Currently, tokio spins up async worker threads equal to the number of
cores, which can be quite large on huge machines, e.g. 128. This isn't
needed to keep everything running; we already offload much of the
compute to blocking threads and so here the number is a significant
overkill, especially if hf_xet is used for downloading only a few files.

This PR limits the number of async worker threads to 32 unless
TOKIO_WORKER_THREADS is set, in which case that value is used. It also
removes the cap on the number of blocking threads tokio can spin up as
needed as there is no real reason to not use tokio's default value
there.

---------

Co-authored-by: Di Xiao <di@huggingface.co>
2025-07-30 18:12:56 -07:00
Assaf Vayner
725973bccc revert use of v1 api paths (#432)
Avoiding using the v1 api paths in the next release until the spec is
fixed.

reverting to the old API paths.
2025-07-30 15:07:36 -07:00
Hoyt Koepke
b86692fc2e Make hf_xet fork-exec safe (#429)
Currently, some environments (e.g. vllm) use spawn or fork-exec to
create a child process. However, this is known to cause issues within
the tokio runtime and lead to hangs, as only the calling thread survives
across spawns but the runtime assumes these threads exist and are
accessible. This PR detects when a fork-exec has occurred and silently
discards the old runtime, creating a new one and reinstalling the signal
handlers and restarting the background threads.

Possible fix for https://github.com/huggingface/xet-core/issues/415;
also issue in https://github.com/vllm-project/vllm/pull/21539.
2025-07-30 13:37:32 -07:00
Eliott C.
7087d68aaf Export hmac function in thin wasm (#427)
For global dedup, since we download shards with an hmac key, client-side
we need to hmac local hashes to compare them to those in the shards
2025-07-30 18:53:31 +02:00
Hoyt Koepke
803e6b7bcf Logging improvements (#428)
Added logging improvements to hf_xet: 
- When HF_XET_LOG_FILE is set to a valid file path, then events are
logged to that file in a non-blocking manner.
- When HF_XET_LOG_FORMAT is set to either "json" or "text", it overrides
the log format default. By default, when done to a file, logging is done
in json, and to console it is done using pretty printing.
- Added in debug log statements in the upload and download paths to help
trace activity.
2025-07-29 14:44:47 -07:00
Eliott C.
e78df8ccda add whether chunk should be checked against global dedup (#423)
https://huggingface.slack.com/archives/C07KCK52LBY/p1753369740405649

It's not trivial in JS - need to either create a dataview on the
Uint8Arr (we have string) or make little endian conversion manually on
the specific two bytes involved from the hash

At the moment it's simpler to output this information from the WASM

---------

Co-authored-by: Assaf Vayner <assafvayner@gmail.com>
2025-07-28 14:35:37 -07:00
Hoyt Koepke
d0bb17c44c Errors on shard reading are now logged and ignored. (#424)
Reading the lookup hashes from a shard could cause an error to be
propagated when it should simply be ignored; This PR logs the error but
ignores it.
2025-07-24 16:02:41 -07:00
Assaf Vayner
5e6284b628 remove footer from upload shard payload (#419)
Since the upload shard API hardening the server no longer uses the shard
footer. This PR truncates the footer from the body of the upload shard
api call when sending to CAS server.
2025-07-24 13:41:45 -07:00
Assaf Vayner
84ce436aa5 set shard size limit as max, not target min (#420)
Enforce that shards are cut at <= target_shard_max_size (64 MiB)

Previously we were enforcing that shards are cut after exceeding this
limit. This enabled in theory shards <128MiB, all shards after this PR
will be <= 64 MiB
2025-07-24 13:41:36 -07:00
Assaf Vayner
66edd71e8e use v1 api paths (#421)
Using API paths as specified in new API updates with `/v1` prefix and no
shard hash.


https://docs.google.com/document/d/14CkiKiwX_y6m4oboh9rUrRCvVZtNqUTZbGzK8M5fh3o/edit?usp=sharing

- removes use of salts all around
2025-07-23 13:57:38 -07:00
Assaf Vayner
85869f4fe8 add verification hash and file hash functions (#416)
Adding more functions to the thin wasm export, needed for forming shards
in particular

CC @coyotte508

After the first draft, binary size is 97K for me
2025-07-17 11:03:07 -07:00
Hoyt Koepke
5afd1eeef1 Move MDB v1 to reference test code; add standalone hash functions (#414)
This PR removes the old merkledb/ code, extracting the remaining
functions still used for calculating the aggregate hash functions and
moves them to merklehash. This provides a significant gain in code
clarity while also providing a speedup to hash computation as the entire
merkle tree is not built to compute the hash. Existing tests ensure
correctness.
2025-07-16 09:46:15 -07:00
Assaf Vayner
b2fc01d479 thin wasm (#411)
Re-adding a thin wasm crate for JS client development.

checks build in ci job build_and_test-wasm

only includes a wrapper over a chunker and function to compute xorb hash
at the moment.
2025-07-15 10:11:38 -07:00
Hoyt Koepke
9439193f17 Enabling proxy support for reqwest (#413)
Adds the system-proxy feature for reqwest in order to handle proxies
specified on the system. This required a minor version upgrade to
support. Hopefully this is a fix for
https://github.com/huggingface/xet-core/issues/400.
2025-07-14 15:04:24 -07:00
Hoyt Koepke
5f13f91d1c Add correctness tests for aggregate hash functions. (#412)
This PR adds reference correctness tests for the cas and file node
aggregate test functions, including corner cases. It hardcodes the
values in a way that allow other implementations to test correctness and
guards against future changes altering the hash values.
2025-07-14 13:32:40 -07:00
Hoyt Koepke
225f4b0e9b Simplified Client interface. (#408)
Currently, the Client trait has numerous small traits underneath it, but
only the umbrella Client trait is ever used. This PR simplifies this
interface by dropping the unneeded fine-grained traits.

It also consolidates the multiple routes for the global dedup query into
a single function that returns an in-memory shard to further reduce the
complexity.

There should be no functionality change, just code moving around and
trait simplification.
2025-07-10 11:13:04 -07:00
Jared Sulzdorf
8fbe9684fc Updating chunk and shard cache default sizes (#406)
Updating the chunk and shard cache default size to use powers of 10
instead of powers of 2. See
https://github.com/huggingface/huggingface_hub/pull/3190#discussion_r2176840066
for background.

Note: This is my first real PR to the repo! I didn't do anything outside
of:
1. Change the constants to these new values
2. Verify the existing docs
3. Run `cargo test` and check to make sure there were no failing tests. 

If there were other steps I should've taken to validate the change, let
me know!

Sidenote: Noticed while working on this that I need to update the
existing huggingface_hub docs as
https://linear.app/xet/issue/XET-602/document-hf-xet-shard-cache-size-limit
grew out of date with the default shard cache size.
2025-07-07 17:34:29 -07:00
Joseph Godlewski
948c7b6920 Adding buffer to JWT token expiration check (#405)
Should help reduce the number of HTTP-401 users see by refreshing the
tokens earlier than their expiry time.
2025-07-02 12:25:50 -07:00
Hoyt Koepke
45c90b566f Fix for retry failure due to non-clonability (#402)
The current main has an issue where the retry versions of the client are
passed into the retry wrapper code, causing a failure for cloning.
2025-07-01 16:39:46 -06:00
Hoyt Koepke
642e8b7e52 Generic retry wrapper to consolidate and streamline retry logic. (#397)
This PR consolidates the retry logic for the http connections around a
single utility, RetryWrapper, and integrates it more cleanly with the
logging and parsing logic. The goal is to replace the retry middleware
with this, which doesn't work with streaming connections.

This PR introduces this and replaces the simpler connection paths in
RemoteClient with this wrapper but leaves the previous logic intact. The
next step is to fully switch the remaining cases over to this wrapper to
clean up the code.

---------

Co-authored-by: Assaf Vayner <assaf@huggingface.co>
2025-07-01 15:23:47 -06:00
Di Xiao
9fbd234328 wasm poc (#272)
This implements uploading through Xet protocol in WASM environment, and
makes necessary changes to make dependent crates WASM compatible.
1. Uploading through Xet protocol is done in hf_xet_wasm crate;
2. Separate Cas Client trait definitions into upload and download
functionality groups and disable download for WASM;
3. Disable Cas Client request retry in WASM environment, which isn't
critical for a POC (until we have a retry strategy that doesn't depends
on time);
4. Disable async CasObject deserialization;
5. Enable in-memory global dedup;

---------

Co-authored-by: Assaf Vayner <assaf@huggingface.co>
2025-06-25 12:08:48 -07:00
Assaf Vayner
8061c2c1fd streaming shard interface updates (#392)
Changes the streaming_shard/MDBMinimalShard implementation with the
following changes:

- minimal shard now holds vec's of MDBFileInfoView's and
MDBCASInfoView's
- Each View type holds a bytes::Bytes object instead of `Arc<[u8]>` and
an offset
- enables passing in a custom callback taking a reference when
deserializing a MinimalShard from an `AsyncRead`
  - this will be used by CAS server.


Also removed deserialize_async functions defined in #382 .
2025-06-24 09:22:23 -07:00
Rajat Arya
d55c6a27e9 Cargo.toml+lock version update (#395) v1.1.5 v1.1.5-rc0 2025-06-20 14:09:14 -07:00
Hoyt Koepke
2cdc186775 Switch cert loading to use load_native_certs(); (#393) 2025-06-20 10:29:09 -07:00
Assaf Vayner
7d6301ff2b fix MDBFileInfo::deserialize_async in case of no verification entries (#388)
fixes issue in #382, where if a file info has not verification info then
deserialization would be be incorrect.
2025-06-17 12:57:54 -07:00
Hoyt Koepke
7f89855147 Background loading for shards (#384)
Currently, loading all of the shards is done in a blocking manner, which
means that a large number of shards causes the call to upload_files to
take a long time to get started. This PR optimizes this path by loading
the lookup table sections of the shard directories in the background
while the chunking and file reading can get started.

It also introduces a new utility class, RwTaskLock, which provides a
RwLock-like interface around a value that can either be specified by the
value itself or by a future that resolves to the value. This makes it
easy to background tasks when values like lookup tables are held behind
an rwlock-like interface. This utility is self-contained and unit tests
are provided.
2025-06-17 11:00:07 -07:00
Assaf Vayner
8c2bbaa8d0 Shard interface updates (#382)
Changes to be used potentially in a CAS server PR.

- consistent usage of futures::io::AsyncRead and import pattern
- add deserialize_async variants to cas info and file info used structs.
The only difference is the use of async readers, but we still just read
the whole struct worth (expect top level) of data and deserialize from
slice.
- constants exports
2025-06-17 10:01:50 -07:00
Assaf Vayner
8f7e9c8d47 hf_xet Cargo.toml 1.1.4 (#387)
https://www.notion.so/huggingface2/CM-20250616-hf_xet-1-1-4-release-2141384ebcac80d69926e0203f55ee08?source=copy_link
v1.1.4-rc0 v1.1.4
2025-06-16 13:51:58 -07:00
Hoyt Koepke
5d94e17813 Change download currency limit from local to global. (#385)
This PR changes the limit on the number of simultaneous range fetches
from a per-file limit to a global limit. It defaults to 128.

Currently, each TCP connection creates a socket, which uses up a file
handle. On OSX, this is limited to 256 by default, which means that
downloading multiple large files would quickly exhaust this.

This is also the cause of the dns resolver errors in
https://github.com/huggingface/xet-core/issues/373.

The tokio semaphore is fair, which means that permits are released in
the order in which they were requested. Thus this shouldn't have any
behavior change from the existing, except for the cap on the number of
simultaneous fetches.
v1.1.4-rc5
2025-06-12 18:31:00 -07:00
Hugo Larcher
99e27b69c1 Fix/dns resolution (#383)
It seems that some DNS resolvers struggle to correctly resolve CAS
server address and fallback to local search domain
(https://github.com/huggingface/huggingface_hub/issues/3155).
This may be due to a wrong DNS resolver configuration (possibly a
`ndots>2`.
This PR implements a custom DNS resolver to force absolute DNS name
resolution.
v1.1.4-rc3
2025-06-12 11:11:06 -07:00
Hoyt Koepke
713164b419 Remove hickory-dns and use system dns provider (#380)
On custom environments, hickory-dns doesn't access a system-configured
dns server, which means it's not going to work universally.
2025-06-11 10:33:25 -07:00
Hoyt Koepke
62a6739452 Update shard cache default size. (#381)
After some back-of-the-envelop calculations and looking at some of our
users and what they're uploading, I think a 16GB limit on the shard
cache size is more appropriate. This effectively allows dedup against 16
TB of data while not being a huge burden relative to other aspects of
the hugging face cache.
2025-06-11 10:33:11 -07:00
Assaf Vayner
80c0a7ffc9 add ci steps to check cargo.lock is up to date (#377)
We keep having out of date hf_xet/Cargo.lock, likely people are not
building hf_xet 100% of the time they are pushing to the repo. This PR
enforces that hf_xet/Cargo.lock and the root Cargo.lock must be up to
date, a CI job will fail if this is not true.
2025-06-11 10:25:28 -07:00
Hoyt Koepke
c42173ecfc Switch reqwest to rustls-tls from default; use hickory-dns for dns resolution. (#378)
This PR switches reqwest to use the native-tls package, which provides a
more reliable abstraction over linux openssl implementation oddities. In
addition, it enables the trust-dns feature to embed a more robust dns
resolution path, which hopefully will fix the dns name resolution errors
such as

Now, the cas_client package exposes three features. The native-tls and
native-tls-vendored are passed on to hf_xet for building wheels, while
rustls-tls is the default.

```
# rustls-ssl embeds all ssl stuff in a rust package.  This is the most portable option, but also may not respect local 
# network configurations.   Default.
rustls-tls = ["reqwest/rustls-tls"]

# Uses native tls in the request package; this uses the native-tls package to wrap openssl, which is a more robust and portable
# way of ensuring that tls just works. 
native-tls = ["reqwest/native-tls"]

# This uses the above, but statically compiles in openssl, which makes the result more portable at the expense of 
# library size. 
native-tls-vendored = ["reqwest/native-tls-vendored"]
```

To enable this, some dependencies were moved around so that the hf_xet
package now depends directly on cas_client, which is the only package to
depend on reqwest.

---------

Co-authored-by: marked23 <@marked23>
v1.1.4-rc1
2025-06-10 16:03:50 -07:00
Hoyt Koepke
07a6507272 Small optimizations for chunking / upload path (#371)
This PR converts the fundamental data type in the Chunk class from
Arc<[u8]> to bytes::Bytes. The latter functions like the former, but
allows us to avoid a copy of the data on creation as conversion from a
vector is trivial. In addition, it allows creating slices by reference,
so chunks can just refer to their original data without a copy. This is
now okay, as we convert the xorbs to their compressed form quickly so
chunks do not hang around long, meaning the extra memory overhead of
this is negligible.
2025-06-09 11:47:44 -07:00
Jared Sulzdorf
dd9541299b Adding issue templates to repo (#374)
This PR adds: 
* bug-report.yml - an issue form for bug reports (will likely need to be
extended/changed as new clients are added; for now it is very
Python-centric)
* feature-request.yml - an issue form for feature requests 
* config.yml - additional links that people will see when clicking on
"New Issue" - see https://github.com/huggingface/huggingface_hub/issues
-> `New Issue` for an example

The `.yml` files and directory structure follow the issue form
syntax/structure described here
https://docs.github.com/en/communities/using-templates-to-encourage-useful-issues-and-pull-requests/syntax-for-issue-forms

A lot of this was lifted from the way the `huggingface_hub` repository
structures these templates - see here:
https://github.com/huggingface/huggingface_hub/tree/main/.github/ISSUE_TEMPLATE
- and then modified for some of the more common things we ask (machine
info; HF_XET envvars, etc)

This symmetry between `huggingface_hub` and this repo was intentional so
community members aren't started by a completely different format for
providing information when opening an issue/asking for a feature.
2025-06-06 07:34:12 -07:00
Assaf Vayner
7c83812242 remove footer serialized from upload xorb payload on remote_client (#372)
Updates the xorb serialized format to not include the xorb footer when
specified.

This is not the cleanest solution necessarily, because the cas_object
tests still require that we serialize the footer (those tests are useful
for the cas server where xorbs will later actually contain the footer)
but are not relevant for the remote_client upload path. Likewise the
LocalClient relies on the xorb footer being serialized.

We may want to refactor this code to remove the footer from xet-core
entirely. We would then have to re-implement LocalClient somehow (might
be worth enough reason to keep the footer in xet-core).
2025-06-05 14:59:15 -07:00
Hoyt Koepke
cf03296027 Update chunker to separate out calculation of next boundary (#368)
Currently, the chunker has the logic for calculating the next boundary
woven in with building the next chunk. This code, anticipating several
planned optimizations, separates this functionality into a separate
function.
2025-06-04 10:28:25 -07:00
Rajat Arya
437f5fcc09 Release 1.1.3 version bump (#370)
Confirmed Cargo.lock updated as well.
v1.1.3-rc0 v1.1.3
2025-06-03 17:18:31 -07:00
Hoyt Koepke
e770ac79c9 Reference correctness tests for chunker (#366)
This PR adds in a collection of correctness tests for the base
gearhash-based chunker code. It's meant to also provide reference values
for other implementations of the core algorithm and is written in a way
that can be easily ported.
2025-06-03 15:00:30 -07:00
Hoyt Koepke
5f611bd7de Rename Chunk in the MerkleDB implementation to ChunkInfo. (#367)
This PR renames the struct Chunk that only contained the hash and length
used in the old MerkleDB implementation to ChunkInfo to avoid confusion
with the Chunk class in the Chunker. No functionality change.
2025-06-03 14:48:11 -07:00
Brian Ronan
b39e0a02ab Debug Symbol cleanup and instructions (#348)
A few adjustments to the debug symbol build and release process:

1. Consolidating the debug symbols into a single zip file to clean up
the release asset list and the contents of each zip.
2. Baking the platforms into the names, so we don't have heterogenous
layers of zip files.
3. Added instructions for how to use the debug symbols to the top-level
readme.

Note: for Linux, since we use the fully qualified platform name in the
symbol linking phase, this will work. Mac doesn't do any name matching
for dSYM files, it only matters that the relocation file is correct.
Windows is unchanged.

Resolves XET-582 and XET-571.
2025-06-03 14:46:58 -07:00