xet-core

mirror of https://github.com/huggingface/xet-core.git synced 2026-06-04 13:30:29 +08:00

Author	SHA1	Message	Date
Di Xiao	75952ae618	Better support "xet-write-token" API authorization model and LFS Batch Api change (#498 ) 1. This PR updates the hub client xet access token request to use custom rev in addition to the default "main". This better supports the "xet-write-token" API authorization model: Clients can get a xet write token, if - the "rev" is a regular branch, with a HF write token; - the "rev" is a pr branch with an corresponding open PR, with a HF write or read token; - it intends to create a pr and repo is enabled for discussion, with a HF write or read token. 2. Fixed a bug when getting the current branch name in a repo, which didn't parse branch names with "/" correctly: change `refs_heads_branch.rsplit('/').next()` to `refs_heads_branch.strip_prefix("refs/heads/")`. 3. Also updated xet transfer agent to use the refresh route in the LFS Batch Api [response](`e3be2b3c8f/server/app/gitHostingRoutes.ts (L1713)`). 4. Use the session id in the LFS Batch Api [response](`e3be2b3c8f/server/app/gitHostingRoutes.ts (L1657)`) for token refresh and CAS requests.	2025-09-23 16:07:54 -07:00
Di Xiao	15942e295e	Fix git xet release bug (#504 ) `gh release create` creates tags and thus requires repo checkout	2025-09-22 16:57:35 -07:00
Di Xiao	55234c489b	Build and release git-xet (#499 ) The defines the workflow to build git-xet on Linux (amd64 & arm64), macOS (amd64 & arm64) and Windows (amd64). For the macOS build the compiled binary is signed and notarized in place.	2025-09-22 15:44:02 -07:00
Hoyt Koepke	610874ab04	Allow Duration and byte sizes in constants for easier use. (#495 ) - Duration: Currently, we use a lot of _MS and _SEC suffixes in constants to denote duration. This PR allows std::time::Duration to be used directly, with values such as "10sec" or "100ms" or "1d" translated directly into std::time::Duration. - ByteSize: It also introduces a new utility type, ByteSize, that simply wraps a u64 but allows the user to specify "1mb" or "45gb" as the value when setting constant values. The suffixes mb, mib, kb, kib, gb, gib, b, etc. are all supported, with the default being the raw value.	2025-09-19 10:59:11 -07:00
Rajat Arya	f612564c25	MacOS diag scripts (#497 ) Adds corresponding Diagnostics script for MacOS. Lightly tested with hf-xet 1.1.10 and MacOS 15.6.1 - correctly takes stack traces on interval and writes out to diagnostics folder as expected.	2025-09-17 15:07:35 -07:00
Di Xiao	fa030edcd5	upgrade rust edition to 2024; upgrade rustc to 1.89 (#494 ) - Upgrade Rust edition and rustc version to bring in some nice features, e.g. let chains instead of nested if block. - Fix clippy and format due to the upgrade. - Fix a bug identified by the new rustc: `6cb0a7fb4e/xet_runtime/src/runtime.rs (L195)` ``` #[cfg(not(target_family = "wasm"))] { // A new multithreaded runtime with a capped number of threads TokioRuntimeBuilder::new_multi_thread().worker_threads(get_num_tokio_worker_threads()) } ``` here the end curly bracket drops the temporary builder while a `&mut Self` to the dropped value is returned. (this may be due to a difference between compilers regarding how they treat the scope of "{...}" of `#[cfg(...))] {...}`?)	2025-09-17 10:28:50 -07:00
Hoyt Koepke	6cb0a7fb4e	Improved user-configurable constant handling (#493 ) This PR adds in two upgrades to the current configurable_constants! macro that allows for users to specify the values of configuration constants using environment variables and the like. It adds two things: - Allows bool values to be parsed by 0, 1, true, false, on, off, etc. configurable_bool_constants! is no longer needed. - Allows Option<T> to be a specified type with a default value of None, which parses the environment value as type T but puts it in Some(Value) if it's present and None if it's not specified. This allows us to determine if a value has been specified, e.g. in the case where the default depends on other things but can be overridden.	2025-09-16 12:43:33 -07:00
Hoyt Koepke	a715926cc7	Rename Threadpool class name to XetRuntime to reflect usage (#491 ) The Threadpool class does quite a bit more than just manage a threadpool; this PR simply changes the name to reflect this usage.	2025-09-15 11:30:28 -07:00
Assaf Vayner	22f86db343	Adding README to few crates for documentation (#492 ) Added README to a few crates so that we can link to the crate directory to link to individual crates and have something rendered for reference implementation when linking to a specific file doesn't make sense.	2025-09-15 11:15:05 -07:00
Assaf Vayner	81b0833965	hf_xet 1.1.10 (#490 ) update version of hf_xet v1.1.10-rc0 v1.1.10	2025-09-11 14:57:19 -07:00
Rajat Arya	c762c681ef	Diagnostic Scripts + README changes (#489 ) - Adds diagnostic scripts to root of repo and references them in README. - Also reorganizes README to make diagnostics & debugging more visible. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-09-11 14:54:22 -07:00
Hoyt Koepke	fe71e3dd54	Updated chunker to eliminate spurious boundary triggering. (#487 ) We must enforce that the next boundary is actually past the minimum chunk size. In rare cases, a boundary could be triggered before the minimum chunk size has passed, and this triggering would be based not on the content of the file but on the previous state of the rolling hash function. This PR fixes this case.	2025-09-10 16:52:48 -07:00
Joseph Godlewski	f24c97eedb	Adding retry for unhandled io errors when sending requests (#468 ) Allows us to retry errors when we receive I/O errors when sending requests. For example, on macOS, when we see: ``` No buffer space available (os error 55) ``` when downloading from S3, we should wait and retry once the system has more network resources.	2025-09-10 16:44:05 -07:00
Hoyt Koepke	263020646f	Fix wheel upload for linux for dev/alpha/beta tags (#379 ) When using beta tags, the upload process doesn't have the wheels in the correct directory. This fixes that.	2025-09-09 10:29:27 -07:00
Di Xiao	e2f7861809	Drop "GaiResolverWithAbsolute" (#486 ) Removes this custom DNS resolver that forces domain names to be absolute. Using relative domain names is a valid practice in Kubernetes clusters to configure proxy servers, e.g. https://github.com/huggingface/huggingface_hub/issues/3323 The original [PR](https://github.com/huggingface/xet-core/pull/383) that added this resolver was trying to resolve https://github.com/huggingface/huggingface_hub/issues/3155 later appeared to be a problem of user local misconfiguration.	2025-09-09 10:26:56 -07:00
Di Xiao	0e1f9f4cf0	Git-Xet: LFS custom transfer agent with Xet protocol (#425 ) This PR builds a Git integration called `git-xet` that enables users to upload files using the Xet protocol as part of a standard git push. This integration builds on the Git LFS custom transfer adapter protocol, the same mechanism we now use to handle Git LFS uploads for files larger than 5 GB through multipart PUT. To enable uploads to Xet, users run `git-xet install`, which writes the following configuration to the Git config file at a selected scope [`--system`, `--global` (default), or `--local`]: ``` [lfs "customtransfer.xet"] path = git-xet args = transfer concurrent = true ``` This setup registers a new transfer adapter named xet, allowing Git to delegate LFS file transfers to the git-xet binary when applicable. On the server side, support is rolled out in two stages: Stage 1 (Upload): The Git LFS batch API for the "upload" operation is updated. - If a repo is Xet enabled but users didn't run git-xet install, moon-landing rejects the request when users initiated git push and returns an instruction to install git-xet. - If a repo is Xet enabled and users have git-xet configured correctly, moon-landing accepts the request and replies with CAS server URL and access token, which git-xet will use to upload files to Xet. - If a repo is NOT Xet enabled, upload goes through the LFS path.	2025-09-08 16:08:50 -07:00
Assaf Vayner	e01896e074	use u64 rather than usize in file hashing paths (#485 ) Using the file hashing components in WASM found a bug that using 32 but usize causes errors when hashing the file. This PR enforces the use of u64 everywhere along that path (and also pins the wasm-bindgen version)	2025-09-08 14:27:58 -07:00
Hoyt Koepke	4d948d1a76	Rename xet_threadpool to xet_runtime to reflect usage (#484 ) The xet_threadpool subdirectory is increasingly the place for utilities related to the runtime, managing file handle limits, etc. This PR simply renames this directory to reflect this switch.	2025-09-08 13:32:48 -07:00
Assaf Vayner	6203653ecf	update api paths to use plural nouns (#482 ) Updates paths used by the clients to use latest CAS paths as defined in the spec. All paths now use plural nouns and shard upload no longer uses the hash, removes the prefix and hash from the client trait upload_shard function.	2025-09-08 13:02:49 -07:00
Eliott C.	3ff4eb2d56	Thin wasm: do not automatically set is_dedup to true for first chunk (#481 ) Related to https://github.com/huggingface/huggingface.js/pull/1718 We'll want to edit parts of file while loading old data's dedup info In those case we don't always want to load dedup info for the first chunk (since it may not be at the beginning of the file) So the is_dedup = true for first chunk is handled client side	2025-09-06 09:31:58 +02:00
Rajat Arya	cc247a9d5a	Add input params to Run name in GH Workflow UI (#478 )	2025-08-29 09:21:24 -07:00
Rajat Arya	7f53907434	Bumping version to 1.1.9 (#476 ) v1.1.9-rc1 v1.1.9 1.1.9-rc1	2025-08-27 15:35:45 -07:00
Rajat Arya	003b154284	Update hf_xet/README.md for hf_xet project (#475 ) - wrote hf_xet/README.md about hf_xet - verified sdist build is successful - moved docs from hf_xet/README.md to xet-core/README.md	2025-08-27 15:27:56 -07:00
Rajat Arya	50ced6cb65	Update PyPI package metadata for `hf-xet` (#472 ) Fixes #465. Adapts #464. Verified locally using `pip-licenses` and manual inspection of metadata. Once merged will verify with PyPI through RC build. @jsulz, @hoytak, @seanses, @assafvayner: Let me know if you want a different email address as a maintainer - these tie into your PyPI user profile. --------- Co-authored-by: Jared Sulzdorf <j.sulzdorf@gmail.com>	2025-08-26 11:15:52 -07:00
Di Xiao	740887a453	CI test on macos (#473 ) We test on Ubuntu and Windows, so it seems reasonable to test on macOS too. This also gets CI prepared for git-xet tests.	2025-08-26 10:46:34 -07:00
Erik Cederstrand	9c20ec6f43	Use a valid SPDX identifier as license classifier (#464 ) This helps automatic license checkers like pip-licenses to identify the right license for this project	2025-08-25 10:14:16 -07:00
Assaf Vayner	3865e945d1	run_and_extract_custom: remove use of explicit tokio_retry without utility (#460 ) we had 1 case of using "raw" tokio_retry rather than the retry utility. This was due to using a special custom parsing logic for chunks, rather than built in json functionality. This PR adds a run_and_extract custom that let's a user specify the function to parse the response body. v1.1.9-rc0 v0.1.9-rc0	2025-08-21 10:19:36 -07:00
Hoyt Koepke	df1145f9ad	Raise soft file handle limits to hard limits on OSX. (#453 ) On OSX, raise the soft file handle limits to the hard limits (which cannot be changed in the process). --------- Co-authored-by: Assaf Vayner <assaf@huggingface.co>	2025-08-19 13:24:41 -07:00
Assaf Vayner	6beab3b197	enforce linting on hf_xet (#462 ) This PR adds an explicit lint command on the hf_xet directory. This is necessary because it is excluded from the workspace. Other excluded directories aren't touched very often and are less important for now.	2025-08-18 16:55:05 -07:00
Assaf Vayner	1578af406c	tokio console setup (#458 ) adds feature to enable deps for use with tokio console and documents how to compile and use along with tokio console	2025-08-18 15:46:56 -07:00
Assaf Vayner	39b85696b9	parutils makeover remove async_scoped (#454 ) removing async_scoped dep and creating new parallel util to replace tokio_par_for_each that relied on async_scoped. Any usage of tokio_par_for_each which is the only fn used out of parutils has been replaced with the new `tokio_run_max_concurrency_fold_result_with_semaphore` TODO: - [x] add more tests - [x] use semaphore acquired from the global semaphore provider where/if relevant.	2025-08-18 15:35:43 -07:00
Assaf Vayner	48be7b08ab	update version (#461 ) v1.1.8	2025-08-18 14:35:23 -07:00
Hoyt Koepke	e484a6b150	Limit number of idle connections (#459 ) Limit the number of idle connections maintained in the reqwest connection pool. --------- Co-authored-by: Assaf Vayner <assaf@huggingface.co> v1.1.8-rc1	2025-08-18 13:39:43 -07:00
Di Xiao	04c1e30079	Cache and reuse reqwest Client (#457 ) This PR caches the reqwest::Client in a runtime if it exists and re-uses it in all wrapped clients in the `RemoteClient` object. This effectively shares the connection pool and thus reduces opening sockets. Fix XET-704	2025-08-15 16:39:06 -07:00
Di Xiao	f4800863d3	Clean up dependencies (no functionality change) (#456 ) Update entries in cas_client/cargo.toml that specifies dependencies from crates.io to inheriting from the workspace. Also sort all dependency entries.	2025-08-15 11:06:54 -07:00
Di Xiao	4870043d87	Fix DataHash hex string serde to little endian (#445 ) Different programming languages and platforms may have different byte order layout of a u64 in memory, this is not an issue when we typecast a [u8;32] to [u64;4] (which is essentially a pointer typecasting and thus doesn't reorder bytes at all), until these bytes are actually used as u64s. For such cases we make it explicit to use the little-endian order.	2025-08-14 15:05:40 -07:00
Di Xiao	36f1138a6f	Add back retry for connection setup and sending request (#455 ) This PR fixes the regression that part of the retry logic for downloading was accidentally removed. The added back retry logic complements the retry for deserializing the data stream of responses.	2025-08-14 15:01:09 -07:00
Joseph Godlewski	6553ade9cb	fix: singleflight owner task not removing Call from Group if dropped (#447 ) When a singleflight `Group` is called by many tasks for a particular key, one of these tasks is chosen as the `owner` to actually perform the work. The other tasks are considered `waiters` and will wait until the `owner` completes the work. In the normal case, the `owner` runs the work, takes the result and provides it to any `waiter` tasks. It is also the responsibility of the `owner` to clean up the state it added to the singleflight `Group`, namely, the `Call` record in the `Group`'s `CallMap`. However, if the `owner` is `drop`'ed while waiting for the work to complete (e.g. by its task being canceled), then the work will still finish and any waiter tasks will be notified (as those actions are part of a separately spawned task), but the state the owner added to the `Group` is not cleaned up (i.e. there is still a lingering mapping of key -> `Call`). What this means is that if the `owner` is dropped before it can remove the mapping, then the results of the work are permanently "cached" in the singleflight `Group` and any subsequent tasks looking for the key will always get back those results until the `Group` is dropped (usually on process restart). Since much of the work we use singleflight for is downloading immutable blobs of data, "caching" the `Call` results doesn't sound that terrible (we already cache some stuff outside of singleflight). However, this is caching the `Result`, not just the `Ok(…)` variant, so if the work returned an `Error`, then that becomes what is permanently cached, which, for intermittent networking issues can cause problems. This PR fixes the issue by adding a RAII guard struct when the `Call` is added to the `CallMap` so that the `work()` function will remove the `Call` when the owner exits the function (either successfully, due to error/panic, or if it is dropped). See the `test_dropped_owner` test for an example of how this situation can occur.	2025-08-11 17:11:15 -07:00
Hoyt Koepke	9bbc0c65ab	Updated version to v1.1.7 (#443 ) v1.1.7	2025-08-05 17:25:31 -07:00
Hoyt Koepke	3260e0852c	Changed default number of parallel downloads from 64 to 48. (#442 ) Some more testing found that 64 parallel range gets can still possibly exhaust the existing file handles on OSX; this addressed this issue without (hopefully) impacting the download performance.	2025-08-05 17:24:31 -07:00
Hoyt Koepke	0c810fa3d0	Remove telemetry code; eliminate Mutex on logging setup. (#441 ) This PR removes the unused telemetry code from hf_xet. In addition, it also removes the Mutex around the logging setup, which appears to cause an intermittent hang when os.fork() gets involved. v1.1.7-rc0	2025-08-05 16:41:01 -07:00
Hoyt Koepke	7becae3fde	Updated version to 1.1.6. (#440 ) v1.1.6	2025-08-05 15:40:15 -07:00
Hoyt Koepke	d575b593d4	Make hf_xet safe(ish) across python os.fork() (#437 ) This PR ensures that none of the tokio thread state exists through a call to python's os.fork() as used in the multiprocessing library. For an explanation of the issue, see https://github.com/vllm-project/vllm/blob/main/docs/design/multiprocessing.md#tradeoffs. It does this by offloading all the async calls to a separate and transient OS thread, which would not exist after the spawn process. Thus any possible restart of the tokio runtime due to a spawn would occur in a clean environment and without thread-local storage causing issues. To accomplish this, this PR refactors the hf_xet logging layer to separate it out from the python runtime, as the python runtime is not Send/Sync. This also simplifies this layer somewhat and isolates the telemetry reporting logic so that only the background sending thread of the telemetry logic is restarted after a spawn. In addition, this PR removes the use of parking_lot, both in singleflight.rs and as part of tokio. The library is not safe across fork(); in particular, note `9c810e4a11/core/src/parking_lot.rs (L51)`.	2025-08-05 15:23:48 -07:00
Assaf Vayner	2645b96eb5	only debug log 416 on get recon (#439 )	2025-08-05 11:56:35 -07:00
Assaf Vayner	fdfff55726	halve num concurrent range gets (#438 )	2025-08-05 11:44:56 -07:00
Eliott C.	663c3a7c7d	Remove logging from wasm lib (#434 )	2025-08-01 19:10:32 +02:00
Hoyt Koepke	a2562c9476	Associate static semaphores with runtime (#433 ) If a parent process spawns a child process while permits from a static semaphore are issued, the number of permits available to the child process will be reduced for the entirety of the child's lifetime, even when the parent process permits are returned. This could potentially cause a deadlock or painful slowdown on upload or download. This PR moves all our static semaphores to ones associated with the runtime, so after a spawn they are reset.	2025-07-31 11:03:14 -07:00
Hoyt Koepke	1f80b9ec7b	Respect XDG_CACHE_HOME and ~/ when setting cache directory. (#426 ) Currently, we default the cache directory to `home_dir()/.cache`, but as pointed out in https://github.com/huggingface/xet-core/issues/417, this is the incorrect behavior. This PR switches this behavior to use the [cache_dir()](https://docs.rs/dirs/latest/dirs/fn.cache_dir.html) function to properly determine this. The side effect of this, however, is that on OSX the cache dir will change to the more standard `$HOME/Library/Caches/huggingface/xet`, which means nothing in `~/.cache/huggingface/xet` will be valid anymore. Fixes https://github.com/huggingface/xet-core/issues/417.	2025-07-30 19:13:26 -07:00
Hoyt Koepke	90036fbbc2	Limit number of async worker threads on large CPUs (#431 ) Currently, tokio spins up async worker threads equal to the number of cores, which can be quite large on huge machines, e.g. 128. This isn't needed to keep everything running; we already offload much of the compute to blocking threads and so here the number is a significant overkill, especially if hf_xet is used for downloading only a few files. This PR limits the number of async worker threads to 32 unless TOKIO_WORKER_THREADS is set, in which case that value is used. It also removes the cap on the number of blocking threads tokio can spin up as needed as there is no real reason to not use tokio's default value there. --------- Co-authored-by: Di Xiao <di@huggingface.co>	2025-07-30 18:12:56 -07:00
Assaf Vayner	725973bccc	revert use of v1 api paths (#432 ) Avoiding using the v1 api paths in the next release until the spec is fixed. reverting to the old API paths.	2025-07-30 15:07:36 -07:00

1 2 3 4 5 ...

368 Commits