This PR removes the unused telemetry code from hf_xet.
In addition, it also removes the Mutex around the logging setup, which
appears to cause an intermittent hang when os.fork() gets involved.
This PR ensures that none of the tokio thread state exists through a
call to python's os.fork() as used in the multiprocessing library. For
an explanation of the issue, see
https://github.com/vllm-project/vllm/blob/main/docs/design/multiprocessing.md#tradeoffs.
It does this by offloading all the async calls to a separate and
transient OS thread, which would not exist after the spawn process. Thus
any possible restart of the tokio runtime due to a spawn would occur in
a clean environment and without thread-local storage causing issues.
To accomplish this, this PR refactors the hf_xet logging layer to
separate it out from the python runtime, as the python runtime is not
Send/Sync. This also simplifies this layer somewhat and isolates the
telemetry reporting logic so that only the background sending thread of
the telemetry logic is restarted after a spawn.
In addition, this PR removes the use of parking_lot, both in
singleflight.rs and as part of tokio. The library is not safe across
fork(); in particular, note
9c810e4a11/core/src/parking_lot.rs (L51).
If a parent process spawns a child process while permits from a static
semaphore are issued, the number of permits available to the child
process will be reduced for the entirety of the child's lifetime, even
when the parent process permits are returned. This could potentially
cause a deadlock or painful slowdown on upload or download. This PR
moves all our static semaphores to ones associated with the runtime, so
after a spawn they are reset.
Currently, tokio spins up async worker threads equal to the number of
cores, which can be quite large on huge machines, e.g. 128. This isn't
needed to keep everything running; we already offload much of the
compute to blocking threads and so here the number is a significant
overkill, especially if hf_xet is used for downloading only a few files.
This PR limits the number of async worker threads to 32 unless
TOKIO_WORKER_THREADS is set, in which case that value is used. It also
removes the cap on the number of blocking threads tokio can spin up as
needed as there is no real reason to not use tokio's default value
there.
---------
Co-authored-by: Di Xiao <di@huggingface.co>
Currently, some environments (e.g. vllm) use spawn or fork-exec to
create a child process. However, this is known to cause issues within
the tokio runtime and lead to hangs, as only the calling thread survives
across spawns but the runtime assumes these threads exist and are
accessible. This PR detects when a fork-exec has occurred and silently
discards the old runtime, creating a new one and reinstalling the signal
handlers and restarting the background threads.
Possible fix for https://github.com/huggingface/xet-core/issues/415;
also issue in https://github.com/vllm-project/vllm/pull/21539.
Added logging improvements to hf_xet:
- When HF_XET_LOG_FILE is set to a valid file path, then events are
logged to that file in a non-blocking manner.
- When HF_XET_LOG_FORMAT is set to either "json" or "text", it overrides
the log format default. By default, when done to a file, logging is done
in json, and to console it is done using pretty printing.
- Added in debug log statements in the upload and download paths to help
trace activity.
https://huggingface.slack.com/archives/C07KCK52LBY/p1753369740405649
It's not trivial in JS - need to either create a dataview on the
Uint8Arr (we have string) or make little endian conversion manually on
the specific two bytes involved from the hash
At the moment it's simpler to output this information from the WASM
---------
Co-authored-by: Assaf Vayner <assafvayner@gmail.com>
Since the upload shard API hardening the server no longer uses the shard
footer. This PR truncates the footer from the body of the upload shard
api call when sending to CAS server.
Enforce that shards are cut at <= target_shard_max_size (64 MiB)
Previously we were enforcing that shards are cut after exceeding this
limit. This enabled in theory shards <128MiB, all shards after this PR
will be <= 64 MiB
This PR removes the old merkledb/ code, extracting the remaining
functions still used for calculating the aggregate hash functions and
moves them to merklehash. This provides a significant gain in code
clarity while also providing a speedup to hash computation as the entire
merkle tree is not built to compute the hash. Existing tests ensure
correctness.
Re-adding a thin wasm crate for JS client development.
checks build in ci job build_and_test-wasm
only includes a wrapper over a chunker and function to compute xorb hash
at the moment.
Adds the system-proxy feature for reqwest in order to handle proxies
specified on the system. This required a minor version upgrade to
support. Hopefully this is a fix for
https://github.com/huggingface/xet-core/issues/400.
This PR adds reference correctness tests for the cas and file node
aggregate test functions, including corner cases. It hardcodes the
values in a way that allow other implementations to test correctness and
guards against future changes altering the hash values.
Currently, the Client trait has numerous small traits underneath it, but
only the umbrella Client trait is ever used. This PR simplifies this
interface by dropping the unneeded fine-grained traits.
It also consolidates the multiple routes for the global dedup query into
a single function that returns an in-memory shard to further reduce the
complexity.
There should be no functionality change, just code moving around and
trait simplification.
Updating the chunk and shard cache default size to use powers of 10
instead of powers of 2. See
https://github.com/huggingface/huggingface_hub/pull/3190#discussion_r2176840066
for background.
Note: This is my first real PR to the repo! I didn't do anything outside
of:
1. Change the constants to these new values
2. Verify the existing docs
3. Run `cargo test` and check to make sure there were no failing tests.
If there were other steps I should've taken to validate the change, let
me know!
Sidenote: Noticed while working on this that I need to update the
existing huggingface_hub docs as
https://linear.app/xet/issue/XET-602/document-hf-xet-shard-cache-size-limit
grew out of date with the default shard cache size.
This PR consolidates the retry logic for the http connections around a
single utility, RetryWrapper, and integrates it more cleanly with the
logging and parsing logic. The goal is to replace the retry middleware
with this, which doesn't work with streaming connections.
This PR introduces this and replaces the simpler connection paths in
RemoteClient with this wrapper but leaves the previous logic intact. The
next step is to fully switch the remaining cases over to this wrapper to
clean up the code.
---------
Co-authored-by: Assaf Vayner <assaf@huggingface.co>
This implements uploading through Xet protocol in WASM environment, and
makes necessary changes to make dependent crates WASM compatible.
1. Uploading through Xet protocol is done in hf_xet_wasm crate;
2. Separate Cas Client trait definitions into upload and download
functionality groups and disable download for WASM;
3. Disable Cas Client request retry in WASM environment, which isn't
critical for a POC (until we have a retry strategy that doesn't depends
on time);
4. Disable async CasObject deserialization;
5. Enable in-memory global dedup;
---------
Co-authored-by: Assaf Vayner <assaf@huggingface.co>
Changes the streaming_shard/MDBMinimalShard implementation with the
following changes:
- minimal shard now holds vec's of MDBFileInfoView's and
MDBCASInfoView's
- Each View type holds a bytes::Bytes object instead of `Arc<[u8]>` and
an offset
- enables passing in a custom callback taking a reference when
deserializing a MinimalShard from an `AsyncRead`
- this will be used by CAS server.
Also removed deserialize_async functions defined in #382 .
Currently, loading all of the shards is done in a blocking manner, which
means that a large number of shards causes the call to upload_files to
take a long time to get started. This PR optimizes this path by loading
the lookup table sections of the shard directories in the background
while the chunking and file reading can get started.
It also introduces a new utility class, RwTaskLock, which provides a
RwLock-like interface around a value that can either be specified by the
value itself or by a future that resolves to the value. This makes it
easy to background tasks when values like lookup tables are held behind
an rwlock-like interface. This utility is self-contained and unit tests
are provided.
Changes to be used potentially in a CAS server PR.
- consistent usage of futures::io::AsyncRead and import pattern
- add deserialize_async variants to cas info and file info used structs.
The only difference is the use of async readers, but we still just read
the whole struct worth (expect top level) of data and deserialize from
slice.
- constants exports
This PR changes the limit on the number of simultaneous range fetches
from a per-file limit to a global limit. It defaults to 128.
Currently, each TCP connection creates a socket, which uses up a file
handle. On OSX, this is limited to 256 by default, which means that
downloading multiple large files would quickly exhaust this.
This is also the cause of the dns resolver errors in
https://github.com/huggingface/xet-core/issues/373.
The tokio semaphore is fair, which means that permits are released in
the order in which they were requested. Thus this shouldn't have any
behavior change from the existing, except for the cap on the number of
simultaneous fetches.
It seems that some DNS resolvers struggle to correctly resolve CAS
server address and fallback to local search domain
(https://github.com/huggingface/huggingface_hub/issues/3155).
This may be due to a wrong DNS resolver configuration (possibly a
`ndots>2`.
This PR implements a custom DNS resolver to force absolute DNS name
resolution.
After some back-of-the-envelop calculations and looking at some of our
users and what they're uploading, I think a 16GB limit on the shard
cache size is more appropriate. This effectively allows dedup against 16
TB of data while not being a huge burden relative to other aspects of
the hugging face cache.
We keep having out of date hf_xet/Cargo.lock, likely people are not
building hf_xet 100% of the time they are pushing to the repo. This PR
enforces that hf_xet/Cargo.lock and the root Cargo.lock must be up to
date, a CI job will fail if this is not true.
This PR switches reqwest to use the native-tls package, which provides a
more reliable abstraction over linux openssl implementation oddities. In
addition, it enables the trust-dns feature to embed a more robust dns
resolution path, which hopefully will fix the dns name resolution errors
such as
Now, the cas_client package exposes three features. The native-tls and
native-tls-vendored are passed on to hf_xet for building wheels, while
rustls-tls is the default.
```
# rustls-ssl embeds all ssl stuff in a rust package. This is the most portable option, but also may not respect local
# network configurations. Default.
rustls-tls = ["reqwest/rustls-tls"]
# Uses native tls in the request package; this uses the native-tls package to wrap openssl, which is a more robust and portable
# way of ensuring that tls just works.
native-tls = ["reqwest/native-tls"]
# This uses the above, but statically compiles in openssl, which makes the result more portable at the expense of
# library size.
native-tls-vendored = ["reqwest/native-tls-vendored"]
```
To enable this, some dependencies were moved around so that the hf_xet
package now depends directly on cas_client, which is the only package to
depend on reqwest.
---------
Co-authored-by: marked23 <@marked23>
This PR converts the fundamental data type in the Chunk class from
Arc<[u8]> to bytes::Bytes. The latter functions like the former, but
allows us to avoid a copy of the data on creation as conversion from a
vector is trivial. In addition, it allows creating slices by reference,
so chunks can just refer to their original data without a copy. This is
now okay, as we convert the xorbs to their compressed form quickly so
chunks do not hang around long, meaning the extra memory overhead of
this is negligible.
This PR adds:
* bug-report.yml - an issue form for bug reports (will likely need to be
extended/changed as new clients are added; for now it is very
Python-centric)
* feature-request.yml - an issue form for feature requests
* config.yml - additional links that people will see when clicking on
"New Issue" - see https://github.com/huggingface/huggingface_hub/issues
-> `New Issue` for an example
The `.yml` files and directory structure follow the issue form
syntax/structure described here
https://docs.github.com/en/communities/using-templates-to-encourage-useful-issues-and-pull-requests/syntax-for-issue-forms
A lot of this was lifted from the way the `huggingface_hub` repository
structures these templates - see here:
https://github.com/huggingface/huggingface_hub/tree/main/.github/ISSUE_TEMPLATE
- and then modified for some of the more common things we ask (machine
info; HF_XET envvars, etc)
This symmetry between `huggingface_hub` and this repo was intentional so
community members aren't started by a completely different format for
providing information when opening an issue/asking for a feature.
Updates the xorb serialized format to not include the xorb footer when
specified.
This is not the cleanest solution necessarily, because the cas_object
tests still require that we serialize the footer (those tests are useful
for the cas server where xorbs will later actually contain the footer)
but are not relevant for the remote_client upload path. Likewise the
LocalClient relies on the xorb footer being serialized.
We may want to refactor this code to remove the footer from xet-core
entirely. We would then have to re-implement LocalClient somehow (might
be worth enough reason to keep the footer in xet-core).
Currently, the chunker has the logic for calculating the next boundary
woven in with building the next chunk. This code, anticipating several
planned optimizations, separates this functionality into a separate
function.
This PR adds in a collection of correctness tests for the base
gearhash-based chunker code. It's meant to also provide reference values
for other implementations of the core algorithm and is written in a way
that can be easily ported.
This PR renames the struct Chunk that only contained the hash and length
used in the old MerkleDB implementation to ChunkInfo to avoid confusion
with the Chunk class in the Chunker. No functionality change.
A few adjustments to the debug symbol build and release process:
1. Consolidating the debug symbols into a single zip file to clean up
the release asset list and the contents of each zip.
2. Baking the platforms into the names, so we don't have heterogenous
layers of zip files.
3. Added instructions for how to use the debug symbols to the top-level
readme.
Note: for Linux, since we use the fully qualified platform name in the
symbol linking phase, this will work. Mac doesn't do any name matching
for dSYM files, it only matters that the relocation file is correct.
Windows is unchanged.
Resolves XET-582 and XET-571.