xet-core

mirror of https://github.com/huggingface/xet-core.git synced 2026-06-04 13:30:29 +08:00

Author	SHA1	Message	Date
Hoyt Koepke	0c810fa3d0	Remove telemetry code; eliminate Mutex on logging setup. (#441 ) This PR removes the unused telemetry code from hf_xet. In addition, it also removes the Mutex around the logging setup, which appears to cause an intermittent hang when os.fork() gets involved. v1.1.7-rc0	2025-08-05 16:41:01 -07:00
Hoyt Koepke	7becae3fde	Updated version to 1.1.6. (#440 ) v1.1.6	2025-08-05 15:40:15 -07:00
Hoyt Koepke	d575b593d4	Make hf_xet safe(ish) across python os.fork() (#437 ) This PR ensures that none of the tokio thread state exists through a call to python's os.fork() as used in the multiprocessing library. For an explanation of the issue, see https://github.com/vllm-project/vllm/blob/main/docs/design/multiprocessing.md#tradeoffs. It does this by offloading all the async calls to a separate and transient OS thread, which would not exist after the spawn process. Thus any possible restart of the tokio runtime due to a spawn would occur in a clean environment and without thread-local storage causing issues. To accomplish this, this PR refactors the hf_xet logging layer to separate it out from the python runtime, as the python runtime is not Send/Sync. This also simplifies this layer somewhat and isolates the telemetry reporting logic so that only the background sending thread of the telemetry logic is restarted after a spawn. In addition, this PR removes the use of parking_lot, both in singleflight.rs and as part of tokio. The library is not safe across fork(); in particular, note `9c810e4a11/core/src/parking_lot.rs (L51)`.	2025-08-05 15:23:48 -07:00
Assaf Vayner	2645b96eb5	only debug log 416 on get recon (#439 )	2025-08-05 11:56:35 -07:00
Assaf Vayner	fdfff55726	halve num concurrent range gets (#438 )	2025-08-05 11:44:56 -07:00
Eliott C.	663c3a7c7d	Remove logging from wasm lib (#434 )	2025-08-01 19:10:32 +02:00
Hoyt Koepke	a2562c9476	Associate static semaphores with runtime (#433 ) If a parent process spawns a child process while permits from a static semaphore are issued, the number of permits available to the child process will be reduced for the entirety of the child's lifetime, even when the parent process permits are returned. This could potentially cause a deadlock or painful slowdown on upload or download. This PR moves all our static semaphores to ones associated with the runtime, so after a spawn they are reset.	2025-07-31 11:03:14 -07:00
Hoyt Koepke	1f80b9ec7b	Respect XDG_CACHE_HOME and ~/ when setting cache directory. (#426 ) Currently, we default the cache directory to `home_dir()/.cache`, but as pointed out in https://github.com/huggingface/xet-core/issues/417, this is the incorrect behavior. This PR switches this behavior to use the [cache_dir()](https://docs.rs/dirs/latest/dirs/fn.cache_dir.html) function to properly determine this. The side effect of this, however, is that on OSX the cache dir will change to the more standard `$HOME/Library/Caches/huggingface/xet`, which means nothing in `~/.cache/huggingface/xet` will be valid anymore. Fixes https://github.com/huggingface/xet-core/issues/417.	2025-07-30 19:13:26 -07:00
Hoyt Koepke	90036fbbc2	Limit number of async worker threads on large CPUs (#431 ) Currently, tokio spins up async worker threads equal to the number of cores, which can be quite large on huge machines, e.g. 128. This isn't needed to keep everything running; we already offload much of the compute to blocking threads and so here the number is a significant overkill, especially if hf_xet is used for downloading only a few files. This PR limits the number of async worker threads to 32 unless TOKIO_WORKER_THREADS is set, in which case that value is used. It also removes the cap on the number of blocking threads tokio can spin up as needed as there is no real reason to not use tokio's default value there. --------- Co-authored-by: Di Xiao <di@huggingface.co>	2025-07-30 18:12:56 -07:00
Assaf Vayner	725973bccc	revert use of v1 api paths (#432 ) Avoiding using the v1 api paths in the next release until the spec is fixed. reverting to the old API paths.	2025-07-30 15:07:36 -07:00
Hoyt Koepke	b86692fc2e	Make hf_xet fork-exec safe (#429 ) Currently, some environments (e.g. vllm) use spawn or fork-exec to create a child process. However, this is known to cause issues within the tokio runtime and lead to hangs, as only the calling thread survives across spawns but the runtime assumes these threads exist and are accessible. This PR detects when a fork-exec has occurred and silently discards the old runtime, creating a new one and reinstalling the signal handlers and restarting the background threads. Possible fix for https://github.com/huggingface/xet-core/issues/415; also issue in https://github.com/vllm-project/vllm/pull/21539.	2025-07-30 13:37:32 -07:00
Eliott C.	7087d68aaf	Export hmac function in thin wasm (#427 ) For global dedup, since we download shards with an hmac key, client-side we need to hmac local hashes to compare them to those in the shards	2025-07-30 18:53:31 +02:00
Hoyt Koepke	803e6b7bcf	Logging improvements (#428 ) Added logging improvements to hf_xet: - When HF_XET_LOG_FILE is set to a valid file path, then events are logged to that file in a non-blocking manner. - When HF_XET_LOG_FORMAT is set to either "json" or "text", it overrides the log format default. By default, when done to a file, logging is done in json, and to console it is done using pretty printing. - Added in debug log statements in the upload and download paths to help trace activity.	2025-07-29 14:44:47 -07:00
Eliott C.	e78df8ccda	add whether chunk should be checked against global dedup (#423 ) https://huggingface.slack.com/archives/C07KCK52LBY/p1753369740405649 It's not trivial in JS - need to either create a dataview on the Uint8Arr (we have string) or make little endian conversion manually on the specific two bytes involved from the hash At the moment it's simpler to output this information from the WASM --------- Co-authored-by: Assaf Vayner <assafvayner@gmail.com>	2025-07-28 14:35:37 -07:00
Hoyt Koepke	d0bb17c44c	Errors on shard reading are now logged and ignored. (#424 ) Reading the lookup hashes from a shard could cause an error to be propagated when it should simply be ignored; This PR logs the error but ignores it.	2025-07-24 16:02:41 -07:00
Assaf Vayner	5e6284b628	remove footer from upload shard payload (#419 ) Since the upload shard API hardening the server no longer uses the shard footer. This PR truncates the footer from the body of the upload shard api call when sending to CAS server.	2025-07-24 13:41:45 -07:00
Assaf Vayner	84ce436aa5	set shard size limit as max, not target min (#420 ) Enforce that shards are cut at <= target_shard_max_size (64 MiB) Previously we were enforcing that shards are cut after exceeding this limit. This enabled in theory shards <128MiB, all shards after this PR will be <= 64 MiB	2025-07-24 13:41:36 -07:00
Assaf Vayner	66edd71e8e	use v1 api paths (#421 ) Using API paths as specified in new API updates with `/v1` prefix and no shard hash. https://docs.google.com/document/d/14CkiKiwX_y6m4oboh9rUrRCvVZtNqUTZbGzK8M5fh3o/edit?usp=sharing - removes use of salts all around	2025-07-23 13:57:38 -07:00
Assaf Vayner	85869f4fe8	add verification hash and file hash functions (#416 ) Adding more functions to the thin wasm export, needed for forming shards in particular CC @coyotte508 After the first draft, binary size is 97K for me	2025-07-17 11:03:07 -07:00
Hoyt Koepke	5afd1eeef1	Move MDB v1 to reference test code; add standalone hash functions (#414 ) This PR removes the old merkledb/ code, extracting the remaining functions still used for calculating the aggregate hash functions and moves them to merklehash. This provides a significant gain in code clarity while also providing a speedup to hash computation as the entire merkle tree is not built to compute the hash. Existing tests ensure correctness.	2025-07-16 09:46:15 -07:00
Assaf Vayner	b2fc01d479	thin wasm (#411 ) Re-adding a thin wasm crate for JS client development. checks build in ci job build_and_test-wasm only includes a wrapper over a chunker and function to compute xorb hash at the moment.	2025-07-15 10:11:38 -07:00
Hoyt Koepke	9439193f17	Enabling proxy support for reqwest (#413 ) Adds the system-proxy feature for reqwest in order to handle proxies specified on the system. This required a minor version upgrade to support. Hopefully this is a fix for https://github.com/huggingface/xet-core/issues/400.	2025-07-14 15:04:24 -07:00
Hoyt Koepke	5f13f91d1c	Add correctness tests for aggregate hash functions. (#412 ) This PR adds reference correctness tests for the cas and file node aggregate test functions, including corner cases. It hardcodes the values in a way that allow other implementations to test correctness and guards against future changes altering the hash values.	2025-07-14 13:32:40 -07:00
Hoyt Koepke	225f4b0e9b	Simplified Client interface. (#408 ) Currently, the Client trait has numerous small traits underneath it, but only the umbrella Client trait is ever used. This PR simplifies this interface by dropping the unneeded fine-grained traits. It also consolidates the multiple routes for the global dedup query into a single function that returns an in-memory shard to further reduce the complexity. There should be no functionality change, just code moving around and trait simplification.	2025-07-10 11:13:04 -07:00
Jared Sulzdorf	8fbe9684fc	Updating chunk and shard cache default sizes (#406 ) Updating the chunk and shard cache default size to use powers of 10 instead of powers of 2. See https://github.com/huggingface/huggingface_hub/pull/3190#discussion_r2176840066 for background. Note: This is my first real PR to the repo! I didn't do anything outside of: 1. Change the constants to these new values 2. Verify the existing docs 3. Run `cargo test` and check to make sure there were no failing tests. If there were other steps I should've taken to validate the change, let me know! Sidenote: Noticed while working on this that I need to update the existing huggingface_hub docs as https://linear.app/xet/issue/XET-602/document-hf-xet-shard-cache-size-limit grew out of date with the default shard cache size.	2025-07-07 17:34:29 -07:00
Joseph Godlewski	948c7b6920	Adding buffer to JWT token expiration check (#405 ) Should help reduce the number of HTTP-401 users see by refreshing the tokens earlier than their expiry time.	2025-07-02 12:25:50 -07:00
Hoyt Koepke	45c90b566f	Fix for retry failure due to non-clonability (#402 ) The current main has an issue where the retry versions of the client are passed into the retry wrapper code, causing a failure for cloning.	2025-07-01 16:39:46 -06:00
Hoyt Koepke	642e8b7e52	Generic retry wrapper to consolidate and streamline retry logic. (#397 ) This PR consolidates the retry logic for the http connections around a single utility, RetryWrapper, and integrates it more cleanly with the logging and parsing logic. The goal is to replace the retry middleware with this, which doesn't work with streaming connections. This PR introduces this and replaces the simpler connection paths in RemoteClient with this wrapper but leaves the previous logic intact. The next step is to fully switch the remaining cases over to this wrapper to clean up the code. --------- Co-authored-by: Assaf Vayner <assaf@huggingface.co>	2025-07-01 15:23:47 -06:00
Di Xiao	9fbd234328	wasm poc (#272 ) This implements uploading through Xet protocol in WASM environment, and makes necessary changes to make dependent crates WASM compatible. 1. Uploading through Xet protocol is done in hf_xet_wasm crate; 2. Separate Cas Client trait definitions into upload and download functionality groups and disable download for WASM; 3. Disable Cas Client request retry in WASM environment, which isn't critical for a POC (until we have a retry strategy that doesn't depends on time); 4. Disable async CasObject deserialization; 5. Enable in-memory global dedup; --------- Co-authored-by: Assaf Vayner <assaf@huggingface.co>	2025-06-25 12:08:48 -07:00
Assaf Vayner	8061c2c1fd	streaming shard interface updates (#392 ) Changes the streaming_shard/MDBMinimalShard implementation with the following changes: - minimal shard now holds vec's of MDBFileInfoView's and MDBCASInfoView's - Each View type holds a bytes::Bytes object instead of `Arc<[u8]>` and an offset - enables passing in a custom callback taking a reference when deserializing a MinimalShard from an `AsyncRead` - this will be used by CAS server. Also removed deserialize_async functions defined in #382 .	2025-06-24 09:22:23 -07:00
Rajat Arya	d55c6a27e9	Cargo.toml+lock version update (#395 ) v1.1.5 v1.1.5-rc0	2025-06-20 14:09:14 -07:00
Hoyt Koepke	2cdc186775	Switch cert loading to use load_native_certs(); (#393 )	2025-06-20 10:29:09 -07:00
Assaf Vayner	7d6301ff2b	fix MDBFileInfo::deserialize_async in case of no verification entries (#388 ) fixes issue in #382, where if a file info has not verification info then deserialization would be be incorrect.	2025-06-17 12:57:54 -07:00
Hoyt Koepke	7f89855147	Background loading for shards (#384 ) Currently, loading all of the shards is done in a blocking manner, which means that a large number of shards causes the call to upload_files to take a long time to get started. This PR optimizes this path by loading the lookup table sections of the shard directories in the background while the chunking and file reading can get started. It also introduces a new utility class, RwTaskLock, which provides a RwLock-like interface around a value that can either be specified by the value itself or by a future that resolves to the value. This makes it easy to background tasks when values like lookup tables are held behind an rwlock-like interface. This utility is self-contained and unit tests are provided.	2025-06-17 11:00:07 -07:00
Assaf Vayner	8c2bbaa8d0	Shard interface updates (#382 ) Changes to be used potentially in a CAS server PR. - consistent usage of futures::io::AsyncRead and import pattern - add deserialize_async variants to cas info and file info used structs. The only difference is the use of async readers, but we still just read the whole struct worth (expect top level) of data and deserialize from slice. - constants exports	2025-06-17 10:01:50 -07:00
Assaf Vayner	8f7e9c8d47	hf_xet Cargo.toml 1.1.4 (#387 ) https://www.notion.so/huggingface2/CM-20250616-hf_xet-1-1-4-release-2141384ebcac80d69926e0203f55ee08?source=copy_link v1.1.4-rc0 v1.1.4	2025-06-16 13:51:58 -07:00
Hoyt Koepke	5d94e17813	Change download currency limit from local to global. (#385 ) This PR changes the limit on the number of simultaneous range fetches from a per-file limit to a global limit. It defaults to 128. Currently, each TCP connection creates a socket, which uses up a file handle. On OSX, this is limited to 256 by default, which means that downloading multiple large files would quickly exhaust this. This is also the cause of the dns resolver errors in https://github.com/huggingface/xet-core/issues/373. The tokio semaphore is fair, which means that permits are released in the order in which they were requested. Thus this shouldn't have any behavior change from the existing, except for the cap on the number of simultaneous fetches. v1.1.4-rc5	2025-06-12 18:31:00 -07:00
Hugo Larcher	99e27b69c1	Fix/dns resolution (#383 ) It seems that some DNS resolvers struggle to correctly resolve CAS server address and fallback to local search domain (https://github.com/huggingface/huggingface_hub/issues/3155). This may be due to a wrong DNS resolver configuration (possibly a `ndots>2`. This PR implements a custom DNS resolver to force absolute DNS name resolution. v1.1.4-rc3	2025-06-12 11:11:06 -07:00
Hoyt Koepke	713164b419	Remove hickory-dns and use system dns provider (#380 ) On custom environments, hickory-dns doesn't access a system-configured dns server, which means it's not going to work universally.	2025-06-11 10:33:25 -07:00
Hoyt Koepke	62a6739452	Update shard cache default size. (#381 ) After some back-of-the-envelop calculations and looking at some of our users and what they're uploading, I think a 16GB limit on the shard cache size is more appropriate. This effectively allows dedup against 16 TB of data while not being a huge burden relative to other aspects of the hugging face cache.	2025-06-11 10:33:11 -07:00
Assaf Vayner	80c0a7ffc9	add ci steps to check cargo.lock is up to date (#377 ) We keep having out of date hf_xet/Cargo.lock, likely people are not building hf_xet 100% of the time they are pushing to the repo. This PR enforces that hf_xet/Cargo.lock and the root Cargo.lock must be up to date, a CI job will fail if this is not true.	2025-06-11 10:25:28 -07:00
Hoyt Koepke	c42173ecfc	Switch reqwest to rustls-tls from default; use hickory-dns for dns resolution. (#378 ) This PR switches reqwest to use the native-tls package, which provides a more reliable abstraction over linux openssl implementation oddities. In addition, it enables the trust-dns feature to embed a more robust dns resolution path, which hopefully will fix the dns name resolution errors such as Now, the cas_client package exposes three features. The native-tls and native-tls-vendored are passed on to hf_xet for building wheels, while rustls-tls is the default. ``` # rustls-ssl embeds all ssl stuff in a rust package. This is the most portable option, but also may not respect local # network configurations. Default. rustls-tls = ["reqwest/rustls-tls"] # Uses native tls in the request package; this uses the native-tls package to wrap openssl, which is a more robust and portable # way of ensuring that tls just works. native-tls = ["reqwest/native-tls"] # This uses the above, but statically compiles in openssl, which makes the result more portable at the expense of # library size. native-tls-vendored = ["reqwest/native-tls-vendored"] ``` To enable this, some dependencies were moved around so that the hf_xet package now depends directly on cas_client, which is the only package to depend on reqwest. --------- Co-authored-by: marked23 <@marked23> v1.1.4-rc1	2025-06-10 16:03:50 -07:00
Hoyt Koepke	07a6507272	Small optimizations for chunking / upload path (#371 ) This PR converts the fundamental data type in the Chunk class from Arc<[u8]> to bytes::Bytes. The latter functions like the former, but allows us to avoid a copy of the data on creation as conversion from a vector is trivial. In addition, it allows creating slices by reference, so chunks can just refer to their original data without a copy. This is now okay, as we convert the xorbs to their compressed form quickly so chunks do not hang around long, meaning the extra memory overhead of this is negligible.	2025-06-09 11:47:44 -07:00
Jared Sulzdorf	dd9541299b	Adding issue templates to repo (#374 ) This PR adds: * bug-report.yml - an issue form for bug reports (will likely need to be extended/changed as new clients are added; for now it is very Python-centric) * feature-request.yml - an issue form for feature requests * config.yml - additional links that people will see when clicking on "New Issue" - see https://github.com/huggingface/huggingface_hub/issues -> `New Issue` for an example The `.yml` files and directory structure follow the issue form syntax/structure described here https://docs.github.com/en/communities/using-templates-to-encourage-useful-issues-and-pull-requests/syntax-for-issue-forms A lot of this was lifted from the way the `huggingface_hub` repository structures these templates - see here: https://github.com/huggingface/huggingface_hub/tree/main/.github/ISSUE_TEMPLATE - and then modified for some of the more common things we ask (machine info; HF_XET envvars, etc) This symmetry between `huggingface_hub` and this repo was intentional so community members aren't started by a completely different format for providing information when opening an issue/asking for a feature.	2025-06-06 07:34:12 -07:00
Assaf Vayner	7c83812242	remove footer serialized from upload xorb payload on remote_client (#372 ) Updates the xorb serialized format to not include the xorb footer when specified. This is not the cleanest solution necessarily, because the cas_object tests still require that we serialize the footer (those tests are useful for the cas server where xorbs will later actually contain the footer) but are not relevant for the remote_client upload path. Likewise the LocalClient relies on the xorb footer being serialized. We may want to refactor this code to remove the footer from xet-core entirely. We would then have to re-implement LocalClient somehow (might be worth enough reason to keep the footer in xet-core).	2025-06-05 14:59:15 -07:00
Hoyt Koepke	cf03296027	Update chunker to separate out calculation of next boundary (#368 ) Currently, the chunker has the logic for calculating the next boundary woven in with building the next chunk. This code, anticipating several planned optimizations, separates this functionality into a separate function.	2025-06-04 10:28:25 -07:00
Rajat Arya	437f5fcc09	Release 1.1.3 version bump (#370 ) Confirmed Cargo.lock updated as well. v1.1.3-rc0 v1.1.3	2025-06-03 17:18:31 -07:00
Hoyt Koepke	e770ac79c9	Reference correctness tests for chunker (#366 ) This PR adds in a collection of correctness tests for the base gearhash-based chunker code. It's meant to also provide reference values for other implementations of the core algorithm and is written in a way that can be easily ported.	2025-06-03 15:00:30 -07:00
Hoyt Koepke	5f611bd7de	Rename Chunk in the MerkleDB implementation to ChunkInfo. (#367 ) This PR renames the struct Chunk that only contained the hash and length used in the old MerkleDB implementation to ChunkInfo to avoid confusion with the Chunk class in the Chunker. No functionality change.	2025-06-03 14:48:11 -07:00
Brian Ronan	b39e0a02ab	Debug Symbol cleanup and instructions (#348 ) A few adjustments to the debug symbol build and release process: 1. Consolidating the debug symbols into a single zip file to clean up the release asset list and the contents of each zip. 2. Baking the platforms into the names, so we don't have heterogenous layers of zip files. 3. Added instructions for how to use the debug symbols to the top-level readme. Note: for Linux, since we use the fully qualified platform name in the symbol linking phase, this will work. Mac doesn't do any name matching for dSYM files, it only matters that the relocation file is correct. Windows is unchanged. Resolves XET-582 and XET-571.	2025-06-03 14:46:58 -07:00

1 2 3 4 5 ...

328 Commits