Update the git-xet install script to check
- required commands are available or quit
- git-lfs is installed or asks the user to install but finishes
correctly
This PR let File cleaner skip re-computing sha256 if provided for files
-- sha256 computation isn't blazing fast, and there's no need to
re-compute. The sha256 is not verified anywhere, only serving as a
foreign key in the Xet file table to the file id in moon-landing
lfs-files table.
- Updates git-xet and migration utility to pass in the sha256 which is
already available.
- Will update huggingface_hub for the `upload_large_folder` case later.
- Related PR in repo-scanner:
https://github.com/huggingface-internal/repository-scanner/pull/368
(update the commit id when this merges).
Fix XET-200
This PR finally enables `git-xet` on Windows authenticating to remote
Git server using SSH URL. This is a crucial part as access tokens to the
CAS server expire every 900 s and `git-xet` needs to re-authenticate
with the Git server by itself during push/pull (whereas the first
authentication is handled by `git-lfs`).
This uses the same SSH connect utility to authenticate over SSH repo
remote URL on both *nix OS and Windows.
Resolves XET-731
This PR builds on top of
https://github.com/huggingface/xet-core/pull/565 and builds an
integration test to test access to "ssh" and "sh" on Windows through the
"git" (-> "git-lfs") -> "git-xet" call chain.
Out of all the ssh variants, access to programs like "plink", "putty",
"tortoiseplink" or "simple" should be given by the env var
`$GIT_SSH_COMMAND` or `$GIT_SSH`, or by git config entry
`core.sshCommand`. Direct access to the mostly used utility "ssh" and
in-direct access to "ssh" via "sh -c" on Windows is provided by the
"git" (-> "git-lfs") -> "git-xet" call chain, see
git_xet/tests/test_ssh.rs for details.
This implements an utility to help set up SSH connection according to
Git standards.
1. Env vars `$GIT_SSH_COMMAND`, `$GIT_SSH` and git config entry
`core.sshCommand` define
which ssh executable to use for an SSH connection. `$GIT_SSH_COMMAND`
takes precedence over `core.sshCommand` and both are interpreted by the
shell (e.g. `GIT_SSH_COMMAND = "ssh -i ~/.ssh/key"`), which allows
additional arguments to be included. They both takes precedence over
`$GIT_SSH`, which on the other hand must be just the path to a program
(which can be a wrapper shell script, if additional arguments are
needed). When none of these is given, the default ssh program to use is
`ssh`.
2. Env var `$GIT_SSH_VARIANT` takes precedence over git config entry
`ssh.variant` and they both define whether
`$GIT_SSH`/`$GIT_SSH_COMMAND`/`core.sshCommand` refer to OpenSSH,
plink/putty or tortoiseplink, or instruct git to automatically detect
the ssh program type. Valid values are "ssh" (to use OpenSSH options),
"plink", "putty",
"tortoiseplink", "simple" (no options except the host and remote
command). The default auto-detection can be
explicitly requested using the value "auto". Any other value is treated
as "ssh".
This implementation follows the git standard and how the same
functionality is handled in
git-lfs
(071e19e8ea/ssh/ssh.go (L41)).
Not sure how this got reverted or if it wasn't added during the spec
writing.
All API paths used by xet-core clients should use the latest API path
documented. This includes adding the version to the path.
This PR ensures that we only download each fetch info term once no
matter how many separate ranges it fulfills in sequential output mode.
The data is interned in memory until the last usage but is dropped after
the last reference to it is dropped.
CC @co42 I confirmed this reduces the 10GB downloaded for the 6GB file
(on my instance this wasn't faster but I was writing to EBS so that was
likely my bottleneck).
- Extends process wrapping helpers to run any programs, this prepares
the utility for the upcoming SSH authentication PR.
- Add unit tests for the helper functions.
- Rename the process wrapping module.
- Update references and error names accordingly.
This PR simply moves the EnvVarGuard and CwdGuard structs out of
file_paths.rs and into their own utils/src/guard.rs file as they are
used more places than just the parts dealing with file paths. It also
adds tests and comments. No functionality change.
For testing, we use a mock CAS server instance running in a local
directory. This is a fully functional server, but currently used only
for testing. This PR has the regular config path detect local:// as a
prefix, allowing a directory to be passed with "local://<dir>" as the
endpoint to use for testing and simulation.
related to https://github.com/huggingface/xet-core/issues/547
When xet-core uploads the shard the header field where the footer length
is specified is set to 200 where it should be 0 according to the
specification.
Note: this value is ignored by the server today, but ideally we would
set this right since it can be useful to know if there is a footer on
the shard when reading the shard as a whole in a non-streaming fashion.
Adds User-Agent when making requests to CAS.
* sets to (project) / (version)
* version is picked from Cargo.toml
* project is hf-xet crates, git-xet crates, (also hard-coded xtool)
The reason for this change is to add better observability on the server
- so we can segment reqeusts by client and understand client versions in
the wild.
This PR does a refactor of how we pass in the catch all "OutputProvider"
to the download mechanism.
It separates the download system to supporting "Sequential" and
"Seeking" operations:
- Seeking e.g. opening the file multiple times and seeking to location
-- this is the standard writing mechanism hf_xet uses today.
- Sequential e.g. opening a file once and writing data in order -- this
is to be used in a set of upcoming PR's/features to use the
parallel-download/sequential-write mechanism to support writing to
Stdout and to a channel buffer in memory.
To support an in memory channel with backpressure the Channel{Writer,
Stream, Reader} are introduced (re-introduced?) in utils. This
particularly could be useful in the mount functionality.
This PR adds a utility that rewrites a shard to include only the
relevant xorb information, dropping unreferenced file information.
In addition, to preserve the global dedup tracking information
associated with the files, this PR also adds a backwards-compatible flag
to the chunk metadata that marks a specific chunk as global dedup
eligible. This allows the global dedup information to be tracked
independently of the file metadata.
While investigating perf issues server-side, our usage of http2 with
internal systems led to a throughput bottleneck as all requests would
get multiplexed on a single connection. CAS (and the backing S3 storage)
are designed to be able to accept/produce large volumes of data that
should be spread across multiple connections (see [S3 perf
considerations](https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance-guidelines.html#optimizing-performance-guidelines-scale)).
Although there are changes going into the backend services to ensure
http1.1 communication (i.e. to spread requests across separate
connections), it would be good to also have cas_client only use http1.1
when uploading and downloading to maximize network throughput.
Users need to run `git-xet install` after `brew install git-xet`. This
instruction is printed out at the end of the `brew install git-xet` step
but people may not pay attention to that.
Add python 3.13t and 3.14t to Release builds, they don't follow the standard abi, i.e. "abi3" so need to be build separately.
Update the build and debug symbol generation and publishing workflow to allow different debug symbols for builds targeting different python version.
Update the diagnosis scripts so they find the correct debug symbols.
---------
Co-authored-by: di <di@huggingface.co>
Fix build break due to git safety checks.
The recent logging PR added the automatic logging of the git repository
commit hash that was used to build the wheel. However, this requires
querying for this hash at build time, which requires running git. It
turns out that in the CI, the user checking out the repository and the
user building the wheel are different, which makes git upset. This PR
adds code to tell git everything will be ok.
---------
Co-authored-by: di <di@huggingface.co>
This PR switches the default logging to log events to a file in
'~/.cache/huggingface/xet/logs' (or 'xet/logs' under the specified cache
directory if not `~/.cache/huggingface/`).
In this directory, log files older than 2 weeks are cleaned up on
process start, and if the total size of files in the directory is larger
than 1gb, then log files are deleted by age to get the directory size
under 1gb. Log files are named with a timestamp and PID; by default,
logs newer than 1 day or logs with an active associated PID are never
deleted. All of these are user configurable constants.
The current hub client does not pass revision into the argument, which
causes the moon-landing call to append `create_pr=1` query param to the
token API and returns 403 error.
fix XET-741
Creates an openapi specification for all CAS API's following the first
version of the protocol specification.
Makefile to generate different language clients for CAS APIs.
Fix python 3.14 build compat
1. `pyo3` depend updated to `0.26`: this is required or else it can't be
compiled for python 3.14
2. version update to `1.1.11-dev0`
```
# removed
Fix compat with Rustc 1.86.0
3. change rust conditions that was throwing `unstable` `errors` in `rustc 1.86.0 (05f9846f8 2025-03-31)` (fairly new version, not the latest) _
```
@hoytak @seanses @assafvayner
This PR
- adds to the git-xet release workflow to code-sign Windows executable
"git-xet.exe" using the Microsoft Trusted Signing Service;
- builds a Windows installer for git-xet to place "git-xet.exe" in the
system, modify the system PATH environment variable, and run the command
"git-xet install" to configure git-xet; On uninstallation from "Control
Panel\Programs\Programs and Features", it first runs "git-xet uninstall
--all" so it is deregistered from git-lfs custom transfer.
- signs the built Windows installer msi file.
Currently, the full message given to a function in error_printer must be
always created. This causes extra work when there are no errors if the
message should contain additional data.
This PR introduces `_fn` versions of the existing functions that call a
given function on demand to obtain the message. Thus `.warn_error_fn(||
format!("Error processing {context}"))` would only allocate and create
the error string when an error needs to be logged.
fix XET-681
XET protocol specification initial draft
- documentation of core procedures required for file uploads and
downloads
- format specifications for shards and xorbs