xet-core

mirror of https://github.com/huggingface/xet-core.git synced 2026-06-04 13:30:29 +08:00

Author	SHA1	Message	Date
dependabot[bot]	0d0f4883ad	Bump time from 0.3.44 to 0.3.47 in /hf_xet_wasm (#654 ) Bumps [time](https://github.com/time-rs/time) from 0.3.44 to 0.3.47. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/time-rs/time/releases">time's releases</a>.</em></p> <blockquote> <h2>v0.3.47</h2> <p>See the <a href="https://github.com/time-rs/time/blob/main/CHANGELOG.md">changelog</a> for details.</p> <h2>v0.3.46</h2> <p>See the <a href="https://github.com/time-rs/time/blob/main/CHANGELOG.md">changelog</a> for details.</p> <h2>v0.3.45</h2> <p>See the <a href="https://github.com/time-rs/time/blob/main/CHANGELOG.md">changelog</a> for details.</p> </blockquote> </details> <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/time-rs/time/blob/main/CHANGELOG.md">time's changelog</a>.</em></p> <blockquote> <h2>0.3.47 [2026-02-05]</h2> <h3>Security</h3> <ul> <li> <p>The possibility of a stack exhaustion denial of service attack when parsing RFC 2822 has been eliminated. Previously, it was possible to craft input that would cause unbounded recursion. Now, the depth of the recursion is tracked, causing an error to be returned if it exceeds a reasonable limit.</p> <p>This attack vector requires parsing user-provided input, with any type, using the RFC 2822 format.</p> </li> </ul> <h3>Compatibility</h3> <ul> <li>Attempting to format a value with a well-known format (i.e. RFC 3339, RFC 2822, or ISO 8601) will error at compile time if the type being formatted does not provide sufficient information. This would previously fail at runtime. Similarly, attempting to format a value with ISO 8601 that is only configured for parsing (i.e. <code>Iso8601::PARSING</code>) will error at compile time.</li> </ul> <h3>Added</h3> <ul> <li>Builder methods for format description modifiers, eliminating the need for verbose initialization when done manually.</li> <li><code>date!(2026-W01-2)</code> is now supported. Previously, a space was required between <code>W</code> and <code>01</code>.</li> <li><code>[end]</code> now has a <code>trailing_input</code> modifier which can either be <code>prohibit</code> (the default) or <code>discard</code>. When it is <code>discard</code>, all remaining input is ignored. Note that if there are components after <code>[end]</code>, they will still attempt to be parsed, likely resulting in an error.</li> </ul> <h3>Changed</h3> <ul> <li>More performance gains when parsing.</li> </ul> <h3>Fixed</h3> <ul> <li>If manually formatting a value, the number of bytes written was one short for some components. This has been fixed such that the number of bytes written is always correct.</li> <li>The possibility of integer overflow when parsing an owned format description has been effectively eliminated. This would previously wrap when overflow checks were disabled. Instead of storing the depth as <code>u8</code>, it is stored as <code>u32</code>. This would require multiple gigabytes of nested input to overflow, at which point we've got other problems and trivial mitigations are available by downstream users.</li> </ul> <h2>0.3.46 [2026-01-23]</h2> <h3>Added</h3> <ul> <li>All possible panics are now documented for the relevant methods.</li> <li>The need to use <code>#[serde(default)]</code> when using custom <code>serde</code> formats is documented. This applies only when deserializing an <code>Option<T></code>.</li> <li><code>Duration::nanoseconds_i128</code> has been made public, mirroring <code>std::time::Duration::from_nanos_u128</code>.</li> </ul> <!-- raw HTML omitted --> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href="`d5144cd287`"><code>d5144cd</code></a> v0.3.47 release</li> <li><a href="`f6206b050f`"><code>f6206b0</code></a> Guard against integer overflow in release mode</li> <li><a href="`1c63dc7985`"><code>1c63dc7</code></a> Avoid denial of service when parsing Rfc2822</li> <li><a href="`5940df6e72`"><code>5940df6</code></a> Add builder methods to avoid verbose construction</li> <li><a href="`00881a4da1`"><code>00881a4</code></a> Manually format macros everywhere</li> <li><a href="`bb723b6d82`"><code>bb723b6</code></a> Add <code>trailing_input</code> modifier to <code>end</code></li> <li><a href="`31c4f8e0b5`"><code>31c4f8e</code></a> Permit <code>W12</code> in <code>date!</code> macro</li> <li><a href="`490a17bf30`"><code>490a17b</code></a> Mark error paths in well-known formats as cold</li> <li><a href="`6cb1896a60`"><code>6cb1896</code></a> Optimize <code>Rfc2822</code> parsing</li> <li><a href="`6d264d59c2`"><code>6d264d5</code></a> Remove erroneous <code>#[inline(never)]</code> attributes</li> <li>Additional commits viewable in <a href="https://github.com/time-rs/time/compare/v0.3.44...v0.3.47">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=time&package-manager=cargo&previous-version=0.3.44&new-version=0.3.47)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/huggingface/xet-core/network/alerts). </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-02-24 20:21:32 -08:00
Rajat Arya	438045a19e	Version bump for hf-xet 1.3.1 release (#665 ) v1.3.1	2026-02-24 15:36:20 -08:00
Rajat Arya	8808f9e64e	Add Windows ARM64 build support (#662 ) ## Summary Closes #588 - Add `win11-arm` runner with `aarch64-pc-windows-msvc` target to the hf-xet Python wheel release pipeline - Add `win11-arm` runner with `aarch64` target to the git-xet CLI release pipeline, parameterizing the WiX installer `-arch` flag ## Test plan - [x] Trigger a workflow_dispatch run of the Release workflow and verify `windows` matrix includes both `x64` and `aarch64` entries - [x] Verify ARM64 wheels and .pdb debug symbols are built and uploaded - [ ] Trigger a workflow_dispatch run of the git-xet Release workflow and verify ARM64 binary and MSI installer are produced 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-02-24 15:23:42 -08:00
Di Xiao	99105937f3	Upgrade hf-xet to 1.3.0 (#664 ) v1.3.0	2026-02-23 15:26:30 -08:00
Brian Ronan	17e900a70e	Feat: optional `request_headers` on hf_xet API calls (#661 ) Adding support for setting an optional `request_header` map on the hf_xet upload and download API calls. This map is augmented with the hf_xet user agent string and is passed along with the requests to xetcas. This PR also adds some unit tests for testing the map merging behavior to `hf_xet/lib.rs` and adds support for running these with cargo test and in github actions CI step.	2026-02-23 14:43:58 -08:00
Hoyt Koepke	b3c5d05fb7	Make specifying the file size at the beginning of an upload optional. (#651 ) Currently, the progress and dependency tracking in the upload path requires that the total size of a file be specified at the start. This PR changes this so that in cases where the upload is streamed and the total size is not known, it's updated as soon as new data is processed. Both routes now work and correctly track the file sizes.	2026-02-23 10:31:09 -08:00
Hoyt Koepke	2176e5d3ed	FileDownloadGroup (#652 ) This PR adds a FileDownloadSession struct that parallels the FileUploadSession struct, replacing the FileDownloader. It's an intermediate step in preparation for a session-based API that integrates well with interfaces other than the python interface in hf_xet.	2026-02-19 17:43:35 -08:00
Hoyt Koepke	21bc6cfdc3	Removed incorrectly included AGENTS.md. (#660 ) The AGENTS.md file was incorrectly checked into the repository (part of a claude process to prepare and check a diff for PR). This PR removes that.	2026-02-19 11:32:12 -08:00
Hoyt Koepke	5d6371a296	Progress reporting for downloads. (#645 ) This PR adds detailed progress reporting to the download path. - Transfer progress is reported as soon as the download streams start; actual bytes written are reported as the reconstructed file is written out. - Currently, each call to download_file creates a separate progress tracker, but this sets up for download groups with grouped download progress tracking. To support this, the UploadProgressStream was split into three classes; a common StreamProgressReporter and download and upload specific versions. This also allows us to simplify the API to RetryWrapper. More tracking was added to the file reconstruction paths to properly report progress.	2026-02-19 11:06:42 -08:00
Rajat Arya	a7661a7e63	Removing pyproject.toml from repo root (#659 ) Not being used to build hf-xet package anyway this is confusing the pip wheel command. Fixes #658	2026-02-17 15:13:14 -08:00
Hoyt Koepke	9d9fc72d40	XetCommon struct in the runtime to hold global counters, semaphores. (#650 ) This PR simplifies the current process of working with runtime-associated resources such as a cached Client instance or global resource semaphores. Instead of using macros, all of these are moved into a XetCommon struct that holds them explicitly. The runtime holds an instance of this, and it's initialized with a config struct. In addition, to make the logic around the memory limiting semaphore in file_reconstructor clearer, we added a ResourceLimiter struct that wraps the tokio semaphore but scales the total permits and permit requests appropriately if the total resource quantity is larger than u32::MAX, as can be the case easily. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-02-12 16:47:07 -08:00
Di Xiao	87d2ac9bcd	Update git-xet install URls (#655 )	2026-02-12 16:40:40 -08:00
Di Xiao	23f68bb798	Upgrade git-xet to 0.2.1 (#653 ) git-xet-v0.2.1	2026-02-12 15:45:34 -08:00
Di Xiao	7d7582c3dd	TemplatedPathBuf utility (#643 ) Implements a utility for configuring path-like parameters. This folds inside the existing function `fn normalized_path_from_user_string` that expands `~` to home directory and converts to absolute paths, and evaluates a path template by substituting case-insensitive placeholders with corresponding values: - `{pid}` for process ID, - `{timestamp}` for ISO 8601 local timestamp with offset For example, ``` let template = TemplatedPathBuf::new("~/logs/app_{PID}_{TIMESTAMP}.txt"); let path = template.as_path(); /// Returns an absolute path like "/home/user/logs/app_12345_2024-01-15T10-30-45-0500.txt" ``` or to be used directly in config groups: ``` crate::config_group!({ ref log_path: Option<TemplatedPathBuf> = None; } ```	2026-02-11 14:51:16 -08:00
Hoyt Koepke	cca8f39699	Clippy / fmt / test cleanup. (#649 ) - Skip install/uninstall tests when git-lfs unavailable; formatting fixes in wasm crates - Add `git_lfs_available()` helper to skip install/uninstall tests in environments where git-lfs is not installed - Apply latest nightly rustfmt formatting fixes and clippy fixes. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-02-11 14:19:42 -08:00
Hoyt Koepke	e443ee9260	Upgrade package dependencies (#644 ) This PR updates all the package dependencies that would not cause significant API breakages to the current version. The package versions in hf_xet_wasm and hf_xet are also updated to match the versions in the base package. There should be no functional change.	2026-02-11 12:19:29 -08:00
Di Xiao	59219bfcfc	Fix writev exceeding limits (#641 ) On macOS and Linux, `writev(int fildes, const struct iovec *iov, int iovcnt)` may return EINVAL if - the sum of the iov_len values in the iov array overflows a 32-bit integer (macOS) or an ssize_t value (Linux); - iovcnt is less than or equal to 0, or greater than UIO_MAXIOV (POSIX standard IOV_MAX, value 1024); and specially on Linux, the glibc wrapper functions do some extra work if they detect that the underlying kernel system call failed because this limit was exceeded. The wrapper function would allocate a temporary buffer large enough for all of the items specified by iov, copies data from iov to this buffer, and passes the buffer in a call to write(). To avoid these potential syscall failures or performance degradation, we put a limit on the total number of bytes and number of slices to call `writev()`. Also adding unit tests for these two limits.	2026-02-10 19:17:32 -08:00
Di Xiao	3a17d667dc	Change default download buffer size (#637 ) Reduces the default download buffer size to a more friendly number. ### Benchmark ### Download benchmark with different default memory configs on three scenarios: • Comparable disk write speed and network ingress speed (i4i.xlarge, up to 10 Gbps ingress, stable 550 MB/s write SSD) • Faster disk write (i4i.xlarge, ingress limited to 1 Gbps using "tc" command, stable 550 MB/s write SSD) • Faster network ingress (m5d.xlarge, up to 10 Gbps ingress, stable 150 MB/s write SSD) Benchmark results are at https://docs.google.com/spreadsheets/d/1ozpk0kU7uM8SGODXxXtXQauc3l5CEoWG/edit?usp=sharing&ouid=108235600614994105911&rtpof=true&sd=true, implying no substantial improvement with buffer size over 2 GB, with the default download parallelism of 8 set by huggingface-hub. Setting total download buffer size to 8GB gives each parallel download task in average `2 GB / 8 = 256 MB` pending bytes to write, or `256 MB / 64 MB = 4` or even more pending terms if the network speed is comparable to that of the disk, and keeps the disk writer always busy.	2026-02-10 18:49:39 -08:00
Di Xiao	a336df4d02	Bake in openssl for git-xet macOS built in Github Action (#626 ) Fix issue https://github.com/huggingface/xet-core/issues/621. Fix XET-819. The script https://github.com/huggingface/xet-core/blob/main/git_xet/install.sh installs the git-xet built in Github Actions, and when git-xet is built for macOS it is linked to `homebrew/openssl@3` because the `git2` crate depends on openssl. For users who don't have homebrew and `homebrew/openssl` installed (so why would prefer the installation script) running this git-xet on their system immediately crashes. This PR - adds a feature "git2-vendored-openssl" that enables "git2/vendored-openssl" which bakes openssl statically into git-xet, - updates Github Actions CI to build git-xet with this feature for macOS version. This would increase the git-xet binary size from ~9MB to ~13MB and drops the `homebrew/openssl` linkage (comparing output of `otool -L git-xet`, left is from `git-xet` before this change): <img width="1456" height="390" alt="Screenshot 2026-01-28 at 3 33 39 PM" src="https://github.com/user-attachments/assets/5c779d78-a042-45d8-99e5-95394db6e774" /> The homebrew official bottles for git-xet will not be affected and still uses `hombrew/openssl` because they build from source code (the above feature not enabled).	2026-02-04 11:03:07 -08:00
Hoyt Koepke	0ddc268757	CTRL-C interruption for spawn_blocking threads. (#632 ) This PR enables easy checking of CTRL-C cancellation in spawn_blocking threads, such as the background writer in the file reconstruction path for downloads. It also adds that capability in two places that would hold up CTRL-C interruption, namely the background loading of shard files and the serial writer in the new adaptive concurrency file reconstruction path.	2026-02-03 17:26:06 -08:00
Di Xiao	4c7046e387	Fix wheel pack renaming issue (#629 ) When we `wheel unpack` a wheel to strip debug symbols and then `wheel pack` to re-pack the wheel, the result wheel name can be different from that of the one generated by `maturin build`, leading to two wheel files published to PyPI for one Python-platform combination. Depends on their package manager version, people may accidentally install the larger wheel. We only want to publish one wheel for a specific Python-platform combination, so removing all wheels before re-packing.	2026-01-29 17:52:09 -08:00
Hoyt Koepke	5cf2bf4b3b	Full fix for session resume test failure. (#625 ) There is an issue in main where test_multiple_resume fails due to a LocalClient trying to open a database tied to a directory at the same time another client is dropping that database on the same thread. This fixes it by switching to LocalTestServer that is persistent through the whole test.	2026-01-28 12:58:03 -10:00
Hoyt Koepke	396e11745c	Unix Socket Proxy support. (#618 ) This PR updates https://github.com/huggingface/xet-core/pull/598, which adds Unix domain socket support for the RemoteClient interface. This version adds extensive testing using the LocalTestServer interface and creating a local proxy for it. Use the environment variable HF_XET_CLIENT_UNIX_SOCKET_PATH=/path/to/socket to route all traffic through a Unix socket. This is useful when running a XET client in a sandbox that doesn't have direct access to the network or for using tools like noxious_client to simulate bad network conditions. --------- Co-authored-by: Frank Denis <github@pureftpd.org>	2026-01-28 12:27:41 -10:00
dependabot[bot]	c9a29ffb9e	Bump oneshot from 0.1.11 to 0.1.12 (#616 ) Bumps [oneshot](https://github.com/faern/oneshot) from 0.1.11 to 0.1.12. <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/faern/oneshot/blob/main/CHANGELOG.md">oneshot's changelog</a>.</em></p> <blockquote> <h2>[0.1.12] - 2026-01-25</h2> <h3>Fixed</h3> <ul> <li>Fix race condition that could lead to use-after-free if the <code>Receiver</code> was polled asynchronously, but then dropped before completion. <a href="https://redirect.github.com/faern/oneshot/pull/74">faern/oneshot#74</a></li> <li>Fix race conditions/UB around atomic memory orderings. These were found by running tests under miri. <a href="https://redirect.github.com/faern/oneshot/pull/72">faern/oneshot#72</a></li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="`537d5de4b6`"><code>537d5de</code></a> Bump version to 0.1.12 and fix changelog</li> <li><a href="`9cc3153a7d`"><code>9cc3153</code></a> Merge branch 'improve-start_recv_ref'</li> <li><a href="`cc3d6a2b96`"><code>cc3d6a2</code></a> Improve start_recv_ref to be more like regular recv method</li> <li><a href="`78c7476979`"><code>78c7476</code></a> Merge branch 'update-documentation'</li> <li><a href="`38d7f6f2cd`"><code>38d7f6f</code></a> Add clarifying documentation on sender observing RECEIVING state</li> <li><a href="`21e0310074`"><code>21e0310</code></a> Synchronize readme with crate documentation in lib.rs</li> <li><a href="`def74fc6fe`"><code>def74fc</code></a> Fix spelling and grammar errors in documentation</li> <li><a href="`70031a4282`"><code>70031a4</code></a> Add documentation about how send and receive are synchronized</li> <li><a href="`d1a1506010`"><code>d1a1506</code></a> Merge branch 'fix-async-recv-drop-use-after-free'</li> <li><a href="`f19ff7c3bf`"><code>f19ff7c</code></a> Fix Receiver::drop bug causing a race when dropping a polled receiver</li> <li>Additional commits viewable in <a href="https://github.com/faern/oneshot/compare/v0.1.11...v0.1.12">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=oneshot&package-manager=cargo&previous-version=0.1.11&new-version=0.1.12)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/huggingface/xet-core/network/alerts). </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-01-28 10:28:04 -10:00
Hoyt Koepke	d9998ee131	Update environment variables to fully enable adaptive concurrency. (#623 ) Previous PRs enabled the adaptive concurrency by default (with a longer beta rollout), but the config variables didn't get updated to reflect that. This PR removes the deprecated config variables. Adaptive concurrency can still be switched to fixed mode by setting the min and max connection ranges to the same value in the config struct, or by setting the environment variables HF_XET_FIXED_UPLOAD_CONCURRENCY or HF_XET_FIXED_DOWNLOAD_CONCURRENCY to a set value.	2026-01-28 10:19:29 -10:00
Di Xiao	cbdd8534df	Wait for heed db closing (#622 ) Unit test `test_multiple_resume()` contains a loop where each iteration internally opens and closes a heed DB that is used for simulating a global dedup table. If the DB is not fully closed before any call to open it again, an error is returned leading this unit test to fail: ``` thread 'tests::test_multiple_resume' panicked at data/tests/test_session_resume.rs:140:18: called `Result::unwrap()` on an `Err` value: CasClientError(Other("Error opening db at \"/var/folders/kg/7q73ww8s3llgyl61c9z_j5g40000gn/T/.tmp09NYmC/xet/xorbs/global_dedup_lookup.db\": database is in a closing phase, you can't open it at the same time")) ``` This PR actually waits for the DB to close in the simulation client's Drop function.	2026-01-28 11:27:08 -08:00
Hoyt Koepke	cdef2cef9e	Updates to exports of pass-through hashing capabilities (#620 ) This PR updates the exports of the pass-through hashing.	2026-01-28 07:15:51 -10:00
Rajat Arya	8ac31a1eac	Add hash_files() function to hf_xet Python package (#614 ) ## Summary Adds a `hash_files(file_paths: List[str])` function to the hf_xet Python package that computes xet hashes for files without uploading them. This enables fast, local-only file hashing without requiring authentication or server connection. ## Key Features - No authentication or server connection required - Pure local computation - no deduplication queries or network I/O - Results in same order as input file paths - API consistency - returns `PyXetUploadInfo` like `upload_files` ## Implementation - Added `hash_single_file()` in data/src/data_client.rs for single file hashing - Added `hash_files_async()` for parallel processing of multiple files - Added Python binding `hash_files()` in hf_xet/src/lib.rs - Reuses existing `Chunker` and `file_hash` infrastructure - Uses `CONCURRENT_FILE_INGESTION_LIMITER` for controlled concurrency ## Usage Example ```python import hf_xet # Compute hashes without uploading file_paths = ["/path/to/file1.txt", "/path/to/file2.txt"] results = hf_xet.hash_files(file_paths) for path, info in zip(file_paths, results): print(f"File: {path}, Hash: {info.hash}, Size: {info.file_size}") ``` 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-01-27 12:35:56 -08:00
Salman Chishti	adbd4fa433	Upgrade GitHub Actions to latest versions (#615 ) ## Summary This PR upgrades GitHub Actions to their latest versions for Node.js 24 compatibility and security updates. ## Changes \| Action \| Old Version(s) \| New Version \| Files \| \|--------\|---------------\|-------------\|-------\| \| actions/attest-build-provenance \| v1 \| v3 \| release.yml \| ## Why these changes? - Keeps actions up to date with latest stable releases - Updated actions include security fixes and new features ## Testing These changes only update action versions and don't modify workflow logic. --------- Signed-off-by: Salman Muin Kayser Chishti <13schishti@gmail.com>	2026-01-26 10:22:45 -10:00
Di Xiao	513c380a2a	Fold "git lfs install" into "git xet install" (#613 ) This PR - folds "git lfs install" commands into the "git xet install" command, so if users installs git-xet using `brew install git-xet`, they only need to run "git xet install", skipping "git lfs install". Context: https://github.com/huggingface-internal/moon-landing/pull/16402 - updates the "install.sh" script so if "git-lfs" is not install, it downloads from git-lfs releases on GitHub and installs and configures it properly.	2026-01-23 11:05:52 -08:00
Hoyt Koepke	a6630293bb	Hash table with pass-through hasher for MerkleHashes (#611 ) Currently, the rust HashMap uses a randomized hasher for input, which prevents hash collision attacks. However, in our code, we don't need that protection in the client, and a MerkleHash is already a cryptographic hash. This PR adds a MerkleHashMap type that just passes the hash through to the HashMap, providing a substantial speedup: ``` ================================================================= PERFORMANCE SUMMARY (times in ms, lower is better) ================================================================= Test HashMap PassThrough ----------------------------------------------------------------- --- 100K --- Insert 2.1 0.7 Lookup 2.1 1.3 Insert+Lookup 4.4 1.6 Serialize 1.6 0.9 Deserialize 4.3 1.2 --- 10M --- Insert 433.2 204.1 Lookup 615.3 255.5 Insert+Lookup 951.6 460.4 Serialize 117.2 93.4 Deserialize 599.5 89.3 ================================================================= ``` It also replaces HashMap<MerkleHash, ...> everywhere in the code to provide an across-the-board improvement.	2026-01-22 10:42:53 -10:00
Hoyt Koepke	db2a0e722b	Remove need for xorb and shard footers in testing code. (#610 ) Currently, the remote server rebuilds the footers for xorbs and shards from streaming versions of the uploaded data, but the testing code doesn't actually follow the same pattern. This updates the local testing code to do the same, unifying the API.	2026-01-21 13:47:30 -10:00
Hoyt Koepke	128fb6fc42	File download and reconstruction V2 (#603 ) This PR rewrites the download and file reconstruction path. The new version: - Separates the Client connection from the reconstruction, using a new FileReconstructor class to manage the reconstruction. This FileReconstructor is now in the file_reconstructor package. The old version is still present in the client but moved to file_reconstruction_v1/; using V1 or V2 is controlled by reconstruction.use_v1_reconstructor. - Uses a global buffer memory limiter so the space used for downloading all files never exceeds a configurable limit, set to 8gb by default. - Automatically tunes the download parallelism to adapt to the connection conditions. - Automatically tunes the number of terms fetched in order to target all terms downloading within a certain window. - Uses vectored write (configurable) to speed writing to a single file. - Moves the URL refresh logic into the RetryWrapper class. - Uses a for loop with futures to make the logic behind the reconstruction process easier to understand. - Adds extensive testing against the LocalTestServer and LocalClient to cover all the code paths. - Completely removed the retry logic level from the reqwest middleware. Next steps after this: - Implement resume on partial download. - Interface to caching layer. - Add partial-term progress reporting to match the upload path.	2026-01-14 21:02:53 -08:00
Rajat Arya	8a01f4c311	SIGINT handler registration controller (supports deregister) (#587 ) - Removes ctrlc crate dependency and directly creates Windows signal handler using winapi crate - This ensures in Windows that the Python signal handler will continue to get called after CTRL+C signal is processed. Fixes #585 Coded with assistance from AI (Cursor/Composer).	2026-01-13 04:03:11 +05:30
Hoyt Koepke	aa1d6977e8	In-memory CAS server for testing and benchmarking (#607 ) This PR adds a MemoryClient alongside the LocalClient that simulates an extremely lightweight CAS client for testing and benchmarking the download paths. The LocalServer also now optionally uses the MemoryClient as the backend. The goal here is to allow for simple stress-testing of the download paths to ensure efficiency, error recovery, and correctness.	2026-01-12 14:10:59 -08:00
Hoyt Koepke	9332ff28b7	Mock CAS server built on LocalClient for testing and simulation. (#602 ) This PR adds a fully functional CAS server built around a LocalClient instance. This allows full testing of the RemoteClient interface without hitting the actual CAS backend. For testing, it can either be run as a standalone executable, or it can be started using a LocalTestServer instance that exposes both a RemoteClient interface as client, or direct access to the state through a stored LocalClient instance. Numerous tests are added to also cover existing functionality as well as the new server functioning. (Also, it exposed that when using a lot of tests with wiremock or this server, the testing would often hit a "Too many open files" error; this was fixed by consolidating these tests to reduce the number of separate testing servers running at once.	2026-01-09 12:39:52 -08:00
Salman Chishti	8ae8501cea	Upgrade GitHub Actions for Node 24 compatibility (#600 ) Upgrade GitHub Actions to their latest versions to ensure compatibility with Node 24, as Node 20 will reach end-of-life in April 2026, per [GitHub's announcement](https://github.blog/changelog/2025-09-19-deprecation-of-node-20-on-github-actions-runners/). --------- Signed-off-by: Salman Muin Kayser Chishti <13schishti@gmail.com> Co-authored-by: di <di@huggingface.co>	2026-01-06 10:11:06 -08:00
Hoyt Koepke	9c8d947416	Renamed cas_client/test to cas_client/tests per rust convention. (#601 ) The adaptive concurrency tests were in cas_client/test, but the rust convention is to use tests/ (plural. This simply does this rename.	2025-12-22 13:30:20 -07:00
Hoyt Koepke	b370cbea10	Move get_file_term_data and get_reconstruction into Client interface. (#591 ) This function adds get_file_term_data and get_reconstruction to the Client interface, including adding those functions to the LocalClient. It also rearranges some things in the RemoteClient struct to call the methods from there. This is the first step in a larger effort to separate out the file writer methods from the RemoteClient class in order to allow for more thorough testing and simulation using the LocalClient backend.	2025-12-15 20:24:22 -07:00
Di Xiao	d15295eff3	Clean up dependencies (#595 ) - Remove dependencies from Cargo.toml files that are not used. - Move dependencies directly referencing crates.io from crate level Cargo.toml to the workspace Cargo.toml. - Fix using RemoteClient in WASM: AdaptiveConcurrencyController uses `tokio::time::Instant` which wraps `std::time::Instant` and is not available in WASM. - Add [cargo-machete](https://github.com/bnjbvr/cargo-machete) to CI to check unused dependencies. No functionality change.	2025-12-15 15:26:02 -08:00
Di Xiao	b719b6e2ff	Mute chatty xet in migration service (#594 ) The repo scanner migration services calls xet-core with tracing subscriber set to logging at `INFO` level. As a result, xet-core has been too chatty for repo scanner because we log at `INFO` level at many places as we wanted to log more info in hf-xet to help diagnose problems. This PR employs a build feature `elevated_information_level` which is enabled only for hf-xet, such that the information we want to log in hf-xet are still emitted at `INFO` level, but for others emitted at `DEBUG` level, so not to clutter the repo scanner log.	2025-12-12 14:46:27 -08:00
Di Xiao	74d7c5926c	Clean up dead code (#593 ) There have been many dead code left in xet-core due to `#![allow(dead_code)]` at a couple of places. This PR removes them and fix the corresponding linting errors. No functionality change.	2025-12-11 10:55:28 -08:00
Di Xiao	23f691b26d	Fix hf-xet release workflow (#586 ) hf-xet release workflow has been failing (see the last two commits into `main`) due to outdated package list. This updates the package list before installing a package.	2025-12-03 11:40:49 -08:00
Rupesh	e94270f709	Update RUSTFLAGS for wasm build script (#584 ) Shared memory with Wasm atomics is disabled by default since https://github.com/rust-lang/rust/pull/147225. This adds the necessary flags to enable it.	2025-12-02 16:09:47 -08:00
Di Xiao	5ac92d49e1	Update git-xet install instructions (#583 ) Removes the custom tap after accepted into core: https://github.com/Homebrew/homebrew-core/pull/255977 Adds winget install after accepted into winget-pkgs: https://github.com/microsoft/winget-pkgs/pull/316392	2025-12-02 16:04:20 -08:00
Hoyt Koepke	9cf0e1e35e	Automatic concurrency adjustment for transfers (#410 ) Adaptive Concurrency Controller This PR introduces adaptive concurrency control for transfers based on an adaptive ML model of the network connection. It is currently implemented only for the upload path and gated behind the environment variable HF_XET_ENABLE_ADAPTIVE_CONCURRENCY, which is set to false by default. Future PRs will integrate this into the download path and then enable it by default with sufficient testing. The `AdaptiveConcurrencyController` struct dynamically adjusts concurrency for upload and download operations by continuously adapting to network conditions. It tracks two key signals: 1. Observed bandwidth via an online linear regression predictor 2. Success ratio of recent transfers using configurable success/failure thresholds Transfers are considered successful if they complete within a statistically reasonable time given the model (less than the 90% quantile) and below the configured max RTT for healthy operation (by default 90s). The model then increases the concurrency when the success ratio is high (>0.8) and the RTT prediction stays below a target RTT (60s default). It decreases the concurrency when the success ratio drops below a threshold (<0.5) or the transfers exceed a maximum healthy RTT (90s default). To prevent oscillations, it also enforces a minimum delay between adjustments, set to 500ms by default. The RTT prediction is implemented using an exponentially-weighted online linear regression model that predicts round-trip time (RTT) based on transfer size and concurrency level. The model fits: ``` duration_secs ≈ a + b * (size_bytes * concurrency) ``` Internally this is implemented using `ExpWeightedOnlineLinearRegression`, which maintains exponentially-decaying sufficient statistics to predict the mean and standard deviation of the RTT. The exponential decay of the process, with the half-life of an observation set to 60 data points, allows it to adapt to slowly changing network conditions. This model is used to predict whether adding concurrency will cause a large transfer of 64MB to take longer than 60s to complete, in which case no concurrency is added. Upon a successful transfer, this model is used to assess whether congestion might be causing completed transfer to take longer than expected; if the actual RTT is in the 90% quantile, then it's reported as a failure to the success tracker; a statistically significant number of recent failures will prevent the concurrency from increasing, and a string of failures will cause the controller to lower the concurrency. The controller tracks the success ratio (fraction of successful transfers) using an exponentially weighted moving average with a default half-life of 8 observations. This allows us to determine whether recent transfers have hit congestion, as long RTTs are recorded as failures. 80% of the recent transfers have to be successes to lower the concurrency, and if less than 50% are successful, the concurrency is dropped. By default, the model starts at the minimum concurrency and increases as soon as data reliably predicts the RTT. All bounds are controlled by config variables.	2025-12-01 16:43:24 -08:00
Di Xiao	a4c7f2c9eb	Update git-xet README (#577 ) - Change "git-xet [command]" to "git xet [command]" - Update Windows binary and installer URL to v0.2.0	2025-11-25 10:31:54 -08:00
Di Xiao	3cbd8b10c7	Fix chunk cache disabling regression (#579 ) The chunk cache was silently turned on by default because the TranslatorConfig chunk cache size now uses `xet_config().chunk_cache.size_bytes` which is set to 10 GB by default.	2025-11-21 18:14:32 -08:00
Di Xiao	76fd5dab6e	Update git-xet install script (#575 ) - Update download URLs to release version 0.2.0 - Make the script `sh` compliant	2025-11-21 15:39:40 -08:00
Di Xiao	ec51474691	Upgrade hf-xet version to 1.2.1 (#578 ) patch release to bring in two bug fixes due to disabling disk cache at 1.2.0	2025-11-21 15:16:23 -08:00

1 2 3 4 5 ...

571 Commits