Commit Graph

571 Commits

Author SHA1 Message Date
Di Xiao
ada0476c01 Update git-xet Windows installer package version (#576) 2025-11-21 12:56:27 -08:00
Di Xiao
eeee211e59 Upgrade git-xet version (#574) git-xet-v0.2.0 2025-11-21 10:05:02 -08:00
Di Xiao
a0632bd3e5 git-xet install script checks commands and git-lfs are available (#573)
Update the git-xet install script to check
- required commands are available or quit
- git-lfs is installed or asks the user to install but finishes
correctly
2025-11-21 09:56:42 -08:00
Di Xiao
4baae2f006 No re-computation of sha256 if provided (#570)
This PR let File cleaner skip re-computing sha256 if provided for files
-- sha256 computation isn't blazing fast, and there's no need to
re-compute. The sha256 is not verified anywhere, only serving as a
foreign key in the Xet file table to the file id in moon-landing
lfs-files table.

- Updates git-xet and migration utility to pass in the sha256 which is
already available.
- Will update huggingface_hub for the `upload_large_folder` case later.
- Related PR in repo-scanner:
https://github.com/huggingface-internal/repository-scanner/pull/368
(update the commit id when this merges).

Fix XET-200
2025-11-20 14:21:15 -08:00
Di Xiao
b5563ecd93 Better support of authentication through SSH (#553)
This PR finally enables `git-xet` on Windows authenticating to remote
Git server using SSH URL. This is a crucial part as access tokens to the
CAS server expire every 900 s and `git-xet` needs to re-authenticate
with the Git server by itself during push/pull (whereas the first
authentication is handled by `git-lfs`).

This uses the same SSH connect utility to authenticate over SSH repo
remote URL on both *nix OS and Windows.

Resolves XET-731
2025-11-20 12:09:57 -08:00
Di Xiao
5f77ffc46a Integration test for ssh access on Windows (#566)
This PR builds on top of
https://github.com/huggingface/xet-core/pull/565 and builds an
integration test to test access to "ssh" and "sh" on Windows through the
"git" (-> "git-lfs") -> "git-xet" call chain.

Out of all the ssh variants, access to programs like "plink", "putty",
"tortoiseplink" or "simple" should be given by the env var
`$GIT_SSH_COMMAND` or `$GIT_SSH`, or by git config entry
`core.sshCommand`. Direct access to the mostly used utility "ssh" and
in-direct access to "ssh" via "sh -c" on Windows is provided by the
"git" (-> "git-lfs") -> "git-xet" call chain, see
git_xet/tests/test_ssh.rs for details.
2025-11-20 03:22:19 -08:00
Di Xiao
075a9c96c0 Add ssh connect utility according to git standard (#565)
This implements an utility to help set up SSH connection according to
Git standards.

1. Env vars `$GIT_SSH_COMMAND`, `$GIT_SSH` and git config entry
`core.sshCommand` define
which ssh executable to use for an SSH connection. `$GIT_SSH_COMMAND`
takes precedence over `core.sshCommand` and both are interpreted by the
shell (e.g. `GIT_SSH_COMMAND = "ssh -i ~/.ssh/key"`), which allows
additional arguments to be included. They both takes precedence over
`$GIT_SSH`, which on the other hand must be just the path to a program
(which can be a wrapper shell script, if additional arguments are
needed). When none of these is given, the default ssh program to use is
`ssh`.

2. Env var `$GIT_SSH_VARIANT` takes precedence over git config entry
`ssh.variant` and they both define whether
`$GIT_SSH`/`$GIT_SSH_COMMAND`/`core.sshCommand` refer to OpenSSH,
plink/putty or tortoiseplink, or instruct git to automatically detect
the ssh program type. Valid values are "ssh" (to use OpenSSH options),
"plink", "putty",
"tortoiseplink", "simple" (no options except the host and remote
command). The default auto-detection can be
explicitly requested using the value "auto". Any other value is treated
as "ssh".


This implementation follows the git standard and how the same
functionality is handled in
git-lfs
(071e19e8ea/ssh/ssh.go (L41)).
2025-11-19 12:43:16 -08:00
Hoyt Koepke
a5ea819ccb Rework of the constant configuration system. (#564) 2025-11-19 11:58:53 -08:00
Assaf Vayner
2d25452eea add v1 to api paths (#568)
Not sure how this got reverted or if it wasn't added during the spec
writing.

All API paths used by xet-core clients should use the latest API path
documented. This includes adding the version to the path.
2025-11-18 11:05:52 -08:00
Di Xiao
37f35fb827 "git xet track" command (#567)
As requested, this implements the `git xet track` command to replace the
`git lfs track` command (but actually calls it underneath with all args
passed over) to unify branding.
Context:
https://huggingface.slack.com/archives/C087TU2FE3G/p1763392346782079?thread_ts=1763392297.946059&cid=C087TU2FE3G,
and already used in https://github.com/huggingface/blog/pull/3170.

Tests:
```
di@di-mbp ~/tt % git xet track --help
Start tracking the given patterns(s) through Git LFS. This directly calls the "git lfs track" command with the following options and args

Usage: git-xet track [ARGS_TO_GIT_LFS_TRACK]...

Arguments:
  [ARGS_TO_GIT_LFS_TRACK]...  

Options:
  -h, --help     Print help
  -V, --version  Print version

di@di-mbp ~/tt % git xet track "*.gif"
Tracking "*.gif"

di@di-mbp ~/tt % git xet track --lockable "*.psd"
Tracking "*.psd"

di@di-mbp ~/tt % git xet track --filename "project [1].psd"
Tracking "project \\[1\\].psd"

di@di-mbp ~/tt % git xet track
Listing tracked patterns
    *.7z (.gitattributes)
    *.arrow (.gitattributes)
    *.bin (.gitattributes)
    *.bz2 (.gitattributes)
    *.ckpt (.gitattributes)
    *.ftz (.gitattributes)
    *.gz (.gitattributes)
    *.h5 (.gitattributes)
    *.joblib (.gitattributes)
    *.lfs.* (.gitattributes)
    *.mlmodel (.gitattributes)
    *.model (.gitattributes)
    *.msgpack (.gitattributes)
    *.npy (.gitattributes)
    *.npz (.gitattributes)
    *.onnx (.gitattributes)
    *.ot (.gitattributes)
    *.parquet (.gitattributes)
    *.pb (.gitattributes)
    *.pickle (.gitattributes)
    *.pkl (.gitattributes)
    *.pt (.gitattributes)
    *.pth (.gitattributes)
    *.rar (.gitattributes)
    *.safetensors (.gitattributes)
    saved_model/**/* (.gitattributes)
    *.tar.* (.gitattributes)
    *.tar (.gitattributes)
    *.tflite (.gitattributes)
    *.tgz (.gitattributes)
    *.wasm (.gitattributes)
    *.xz (.gitattributes)
    *.zip (.gitattributes)
    *.zst (.gitattributes)
    *tfevents* (.gitattributes)
    zz (.gitattributes)
    zz1 (.gitattributes)
    *.gif (.gitattributes)
    *.psd [lockable] (.gitattributes)
    project[[:space:]]\[1\].psd (.gitattributes)
Listing excluded patterns
```

Fix XET-785
2025-11-18 09:20:27 -08:00
Di Xiao
f9f944064a Rename temp.rs to temp_home.rs (#563)
This file contains a test utility "TempHome" to create a temporary HOME
environment. "temp" is a really bad file name..
2025-11-13 12:30:33 -08:00
Assaf Vayner
c86550d6ef deduplicate downloads for sequential output mode (#560)
This PR ensures that we only download each fetch info term once no
matter how many separate ranges it fulfills in sequential output mode.
The data is interned in memory until the last usage but is dropped after
the last reference to it is dropped.

CC @co42 I confirmed this reduces the 10GB downloaded for the 6GB file
(on my instance this wasn't faster but I was writing to EBS so that was
likely my bottleneck).
2025-11-13 12:29:56 -08:00
Di Xiao
fd6db3b106 Augment process wrapping helpers (#552)
- Extends process wrapping helpers to run any programs, this prepares
the utility for the upcoming SSH authentication PR.
- Add unit tests for the helper functions.
- Rename the process wrapping module.
- Update references and error names accordingly.
2025-11-13 09:06:16 -08:00
Hoyt Koepke
ad9d18ccc0 Bookkeeping: Move env and cwd guard code out of file_paths.rs (#561)
This PR simply moves the EnvVarGuard and CwdGuard structs out of
file_paths.rs and into their own utils/src/guard.rs file as they are
used more places than just the parts dealing with file paths. It also
adds tests and comments. No functionality change.
2025-11-12 15:29:41 -08:00
Hoyt Koepke
3904178351 Disable disk cache in hf_xet by default through cargo system. (#559)
This PR disables the disk cache by default in hf_xet using cargo
features instead of in-code logic.

Reverts https://github.com/huggingface/xet-core/pull/535
2025-11-10 13:52:09 -08:00
Hoyt Koepke
10e9f50058 Detect local:// in endpoint string and direct to local CAS server. (#557)
For testing, we use a mock CAS server instance running in a local
directory. This is a fully functional server, but currently used only
for testing. This PR has the regular config path detect local:// as a
prefix, allowing a directory to be passed with "local://<dir>" as the
endpoint to use for testing and simulation.
2025-11-06 09:24:06 -08:00
Assaf Vayner
499d9a1dc8 refactor to rewrite shard header with footer_len set to 0 (#551)
related to https://github.com/huggingface/xet-core/issues/547

When xet-core uploads the shard the header field where the footer length
is specified is set to 200 where it should be 0 according to the
specification.

Note: this value is ignored by the server today, but ideally we would
set this right since it can be useful to know if there is a footer on
the shard when reading the shard as a whole in a non-streaming fashion.
2025-11-04 13:33:49 -08:00
Rajat Arya
ffa9faac68 Adds User-Agent header in cas_client requests to CAS (#546)
Adds User-Agent when making requests to CAS.

* sets to (project) / (version)
* version is picked from Cargo.toml
* project is hf-xet crates, git-xet crates, (also hard-coded xtool)

The reason for this change is to add better observability on the server
- so we can segment reqeusts by client and understand client versions in
the wild.
2025-11-03 15:12:47 -08:00
Di Xiao
aac19ee383 Remove incorrect clippy allow (#550)
The attribute `#[allow(clippy::new_ret_no_self)]` is not appropriate for
functions that return `Arc<Self>`.
2025-11-03 13:55:18 -08:00
Assaf Vayner
cd64baa6ca separating output providers, sequential output providers (#528)
This PR does a refactor of how we pass in the catch all "OutputProvider"
to the download mechanism.

It separates the download system to supporting "Sequential" and
"Seeking" operations:

- Seeking e.g. opening the file multiple times and seeking to location
-- this is the standard writing mechanism hf_xet uses today.
- Sequential e.g. opening a file once and writing data in order -- this
is to be used in a set of upcoming PR's/features to use the
parallel-download/sequential-write mechanism to support writing to
Stdout and to a channel buffer in memory.

To support an in memory channel with backpressure the Channel{Writer,
Stream, Reader} are introduced (re-introduced?) in utils. This
particularly could be useful in the mount functionality.
2025-10-29 14:12:24 -07:00
Hoyt Koepke
2fc772e6d0 Shard utilities needed for GC pass and server-side xorb rewriting. (#532)
This PR adds a utility that rewrites a shard to include only the
relevant xorb information, dropping unreferenced file information.

In addition, to preserve the global dedup tracking information
associated with the files, this PR also adds a backwards-compatible flag
to the chunk metadata that marks a specific chunk as global dedup
eligible. This allows the global dedup information to be tracked
independently of the file metadata.
2025-10-29 12:10:57 -07:00
Rajat Arya
03c190325f Updated diagnostics scripts to collect logs (#542)
- also updated README
- added analysis script to load latest dump collected

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-10-29 08:48:47 -07:00
Joseph Godlewski
85b5ba5fa7 Force cas_client to only use http1 when uploading/downloading (#543)
While investigating perf issues server-side, our usage of http2 with
internal systems led to a throughput bottleneck as all requests would
get multiplexed on a single connection. CAS (and the backing S3 storage)
are designed to be able to accept/produce large volumes of data that
should be spread across multiple connections (see [S3 perf
considerations](https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance-guidelines.html#optimizing-performance-guidelines-scale)).
Although there are changes going into the backend services to ensure
http1.1 communication (i.e. to spread requests across separate
connections), it would be good to also have cas_client only use http1.1
when uploading and downloading to maximize network throughput.
2025-10-28 12:01:28 -07:00
Di Xiao
ce3d6128c0 Update git-xet brew installation instruction (#541)
Users need to run `git-xet install` after `brew install git-xet`. This
instruction is printed out at the end of the `brew install git-xet` step
but people may not pay attention to that.
2025-10-25 10:06:39 -07:00
Rajat Arya
50c69409c5 Update Python classifiers (#540) v1.2.0-rc1 v1.2.0 2025-10-24 11:01:31 -07:00
Rajat Arya
193f300db9 version bump to 1.2.0 for release (#539) 2025-10-24 10:39:53 -07:00
crStiv
7e483970c8 fix: small typo (#534)
nothing fancy, just fixed a small typo
2025-10-24 10:26:00 -07:00
Hoyt Koepke
3096b3f9c3 Test suite for directory logging functionality (#536) 2025-10-24 10:06:26 -07:00
Hoyt Koepke
54aef8095f Improved logging for cas_client crate (#537) 2025-10-24 10:05:57 -07:00
Rajat Arya
f2d1587b82 Disable DiskCache in hf_xet, continue to use it in git_xet (#535)
1. hf_xet : disables DiskCache by default.
2. git_xet : continues to use DiskCache by default, set to 10GB as
before.

Tested manually locally.
2025-10-23 16:28:15 -07:00
Hoyt Koepke
f0777fcf2c Fix clippy issues with new rust version. (#533)
This PR fixes the clippy issues detected with rust 1.90 and
1.92+nightly.
2025-10-23 11:43:45 -07:00
omahs
84c9f4d412 Fix typos (#508)
Fix typos
2025-10-23 07:38:04 -07:00
Rajat Arya
7315f6d6da Add fallback if unable to get git commit (#531)
Specifies the fallback option on the `git_version!` macro, as
written in the docs for git-version-macro crate.
2025-10-22 08:46:42 -07:00
Rajat Arya
500ee6165d Adding python-version 3.13t and 3.14t to builds (#524)
Add python 3.13t and 3.14t to Release builds, they don't follow the standard abi, i.e. "abi3" so need to be build separately.
Update the build and debug symbol generation and publishing workflow to allow different debug symbols for builds targeting different python version.
Update the diagnosis scripts so they find the correct debug symbols.

---------

Co-authored-by: di <di@huggingface.co>
2025-10-20 17:14:54 -07:00
Hoyt Koepke
9fde4b72c0 Fix breaking build changes due to git safety checks. (#530)
Fix build break due to git safety checks. 

The recent logging PR added the automatic logging of the git repository
commit hash that was used to build the wheel. However, this requires
querying for this hash at build time, which requires running git. It
turns out that in the CI, the user checking out the repository and the
user building the wheel are different, which makes git upset. This PR
adds code to tell git everything will be ok.

---------

Co-authored-by: di <di@huggingface.co>
2025-10-20 11:56:15 -07:00
Hoyt Koepke
69f23d630e Logging to directory + log file management; default to log directory for hf_xet (#502)
This PR switches the default logging to log events to a file in
'~/.cache/huggingface/xet/logs' (or 'xet/logs' under the specified cache
directory if not `~/.cache/huggingface/`).

In this directory, log files older than 2 weeks are cleaned up on
process start, and if the total size of files in the directory is larger
than 1gb, then log files are deleted by age to get the directory size
under 1gb. Log files are named with a timestamp and PID; by default,
logs newer than 1 day or logs with an active associated PID are never
deleted. All of these are user configurable constants.
2025-10-20 14:35:43 +02:00
Di Xiao
2eec20baf1 Create README.md for Git-Xet (#529)
Added a README file for Git-Xet, detailing installation and
functionality.

---------

Co-authored-by: Jared Sulzdorf <j.sulzdorf@gmail.com>
2025-10-16 15:08:13 -07:00
Sam Horradarn
89e549089a fix: explicitly specify main branch for hub client in migration utility (#522)
The current hub client does not pass revision into the argument, which
causes the moon-landing call to append `create_pr=1` query param to the
token API and returns 403 error.
2025-10-03 12:16:19 -07:00
Assaf Vayner
9f69239322 openapi spec and Makefile for it (#518)
fix XET-741

Creates an openapi specification for all CAS API's following the first
version of the protocol specification.

Makefile to generate different language clients for CAS APIs.
2025-10-03 10:24:44 -07:00
Di Xiao
a31df60741 Upgrade macos-13 to macos-15-intel due to closing down (#521)
Upgrade the Intel macos runner due to [GitHub Actions: macOS 13 runner
image is closing
down](https://github.blog/changelog/2025-09-19-github-actions-macos-13-runner-image-is-closing-down/).
2025-10-02 14:55:18 -07:00
Qubitium-ModelCloud
975e867b96 Fix python 314 compat (#520)
Fix python 3.14 build compat

1. `pyo3` depend updated to `0.26`: this is required or else it can't be
compiled for python 3.14
2. version update to `1.1.11-dev0`
```
# removed
Fix compat with Rustc 1.86.0
3. change rust conditions that was throwing `unstable` `errors` in `rustc 1.86.0 (05f9846f8 2025-03-31)` (fairly new version, not the latest) _
```
@hoytak  @seanses  @assafvayner
2025-10-02 14:25:10 -07:00
Di Xiao
6fbde98e5e git-xet Windows installer and code signing (#519)
This PR
- adds to the git-xet release workflow to code-sign Windows executable
"git-xet.exe" using the Microsoft Trusted Signing Service;
- builds a Windows installer for git-xet to place "git-xet.exe" in the
system, modify the system PATH environment variable, and run the command
"git-xet install" to configure git-xet; On uninstallation from "Control
Panel\Programs\Programs and Features", it first runs "git-xet uninstall
--all" so it is deregistered from git-lfs custom transfer.
- signs the built Windows installer msi file.
git-xet-v0.1.0
2025-10-02 12:40:09 -07:00
Assaf Vayner
28dd760892 rm all docs files (#517)
Removes docs in xet-core in favor of docs in hub-docs for the xet
protocol: https://github.com/huggingface/hub-docs/pull/1963
2025-09-30 18:12:03 -07:00
Assaf Vayner
5565b37e03 integrate docs debugging (#516)
change docs upload to xet-protocol rather than xet-upload
2025-09-29 15:38:13 -07:00
Assaf Vayner
f0895142cb move spec to docs (#515)
publish to hub docs out of xet-core for xet-spec. Need to merge this
first before iterating to get the github workflows working right.
2025-09-29 12:37:21 -07:00
Hoyt Koepke
4176674a7e Added lazy evaluation functionality to error printer. (#510)
Currently, the full message given to a function in error_printer must be
always created. This causes extra work when there are no errors if the
message should contain additional data.

This PR introduces `_fn` versions of the existing functions that call a
given function on demand to obtain the message. Thus `.warn_error_fn(||
format!("Error processing {context}"))` would only allocate and create
the error string when an error needs to be logged.
2025-09-29 11:23:39 -07:00
Assaf Vayner
0958579c40 spec draft (#422)
fix XET-681

XET protocol specification initial draft

- documentation of core procedures required for file uploads and
downloads
- format specifications for shards and xorbs
2025-09-29 10:25:25 -07:00
SuperKenVery
94fa9449bb Enable socks5 proxy support (#474)
Tested on user's machine with the socks5 proxy specified in `all_proxy` env var.

Co-authored-by: Hoyt Koepke <hoytak@huggingface.co>
2025-09-26 14:25:23 -07:00
Assaf Vayner
c55fabb6bf hashing and chunking example tools (#496)
Adds some basic examples tools (compiled with `cargo build --examples`
on `data` crate) to compute hashes and chunk boundaries.
2025-09-26 12:49:55 -07:00
Di Xiao
8ee0a5c958 Cache rust build in actions (#513)
In response to [A Joint Statement on Sustainable
Stewardship](https://openssf.org/blog/2025/09/23/open-infrastructure-is-not-free-a-joint-statement-on-sustainable-stewardship/)
and [Rust Foundation Signs Joint Statement on Open Source Infrastructure
Stewardship](https://rustfoundation.org/media/rust-foundation-signs-joint-statement-on-open-source-infrastructure-stewardship/),
implements caching of dependency and build artifact, and reduces some CI
runtime. Cache entry keys are formed by `os_type`-`arch_type`-`hash of
Cargo.lock`, cache configuration adapts from
https://docs.github.com/en/actions/tutorials/build-and-test-code/rust#caching-dependencies.
2025-09-26 11:16:04 -07:00