mirror of
https://github.com/huggingface/xet-core.git
synced 2026-06-04 13:30:29 +08:00
feat: smoke tests using hf CLI with bucket and large-file coverage (#710)
## Summary - Rewrites smoke tests to drive everything through the `hf` CLI rather than the huggingface_hub Python API, covering the actual user-facing surface area of hf-xet - Moves smoke tests and diagnostic scripts into a `scripts/` directory for cleaner repo layout - Adds storage bucket test suite exercising the full bucket lifecycle - Adds 50 MB and 100 MB files to repo upload/download tests ## Test matrix (14 tests, all passing) **Repository tests** (`hf upload` / `hf download`) - Upload single file, upload folder - Download individual files + SHA-256 verify - Download entire repo + SHA-256 verify - Overwrite file and verify new content served - Delete file and confirm absent **Bucket tests** (`hf buckets`) - `cp` upload / download + verify - `sync` upload / download + verify - Recursive list confirms expected paths - Overwrite via `cp` + verify - `sync --delete` removes extraneous remote files - `rm` + confirm absent from listing ## Test plan - [x] Run `HF_TOKEN=... ./scripts/smoke_tests/run.sh` and confirm all 14 tests pass - [x] Run `./scripts/smoke_tests/run.sh --skip-buckets` for repo-only path - [x] Run with `--hf-xet-version <version>` to confirm PyPI cache bypass works 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
182
README.md
182
README.md
@@ -43,180 +43,34 @@ Please join us in making xet-core better. We value everyone's contributions. Cod
|
||||
|
||||
## Issues, Diagnostics & Debugging
|
||||
|
||||
If you encounter an issue when using `hf-xet` please help us fix the issue by collecting diagnostic information and attaching that when creating a [new Issue](https://github.com/huggingface/xet-core/issues/new/choose). Download the [hf-xet-diag-linux.sh](hf-xet-diag-linux.sh), [hf-xet-diag-macos.sh](hf-xet-diag-macos.sh), or [hf-xet-diag-windows.sh](hf-xet-diag-windows.sh) script based on your operating system and then re-run the python command that resulted in the issue. The diagnostic scripts will download and install debug symbols, setup up logging, and take periodic stack traces throughout process execution in a diagnostics directory that is easy to analyze, package, and upload.
|
||||
If you encounter an issue with `hf-xet`, please collect diagnostic information
|
||||
and attach it when creating a [new Issue](https://github.com/huggingface/xet-core/issues/new/choose).
|
||||
|
||||
### Diagnostics - Linux (`hf-xet-diag-linux.sh`)
|
||||
The [`scripts/diag/`](scripts/diag/) directory contains platform-specific scripts
|
||||
that download debug symbols, configure logging, and capture periodic stack traces
|
||||
and core dumps:
|
||||
|
||||
* Uses `gdb` + `gcore` to periodically snapshot stacks and produce core dumps.
|
||||
* Supports optional ptrace preload helper for debugging.
|
||||
* Downloads and installs the appropriate `hf_xet-*.dbg` symbol file automatically.
|
||||
|
||||
**Requirements:**
|
||||
| OS | Script |
|
||||
|----|--------|
|
||||
| Linux | [`scripts/diag/hf-xet-diag-linux.sh`](scripts/diag/hf-xet-diag-linux.sh) |
|
||||
| macOS | [`scripts/diag/hf-xet-diag-macos.sh`](scripts/diag/hf-xet-diag-macos.sh) |
|
||||
| Windows (Git-Bash) | [`scripts/diag/hf-xet-diag-windows.sh`](scripts/diag/hf-xet-diag-windows.sh) |
|
||||
|
||||
```bash
|
||||
sudo apt-get install gdb build-essential
|
||||
# prefix your failing command with the script for your OS, e.g.:
|
||||
./scripts/diag/hf-xet-diag-macos.sh -- python my-script.py
|
||||
```
|
||||
|
||||
**Example usage:**
|
||||
See [**scripts/diag/README.md**](scripts/diag/README.md) for full usage, output layout, dump analysis instructions, and how to install debug symbols manually.
|
||||
|
||||
Quick debugging environment variables:
|
||||
|
||||
```bash
|
||||
./hf-xet-diag-linux.sh -- python hf-download.py "Qwen/Qwen2.5-VL-3B-Instruct"
|
||||
RUST_BACKTRACE=full # full Rust backtraces on panic
|
||||
RUST_LOG=info # enable hf-xet logging
|
||||
HF_XET_LOG_FILE=/tmp/xet.log # write logs to a file (defaults to stdout)
|
||||
```
|
||||
|
||||
### Windows (Git-Bash) (`hf-xet-diag-windows.sh`)
|
||||
|
||||
* Runs in **Git-Bash**, keeping usage consistent with Linux.
|
||||
* Uses **Sysinternals ProcDump** for periodic mini dumps (`-mp`).
|
||||
* Auto-downloads `procdump.exe` if not found.
|
||||
* Downloads and installs the matching `hf_xet.pdb` debug symbol into the package directory.
|
||||
|
||||
**Requirements:**
|
||||
|
||||
* Git-Bash (from [Git for Windows](https://gitforwindows.org/))
|
||||
* Python installed
|
||||
* Internet access (first run downloads ProcDump and debug symbols)
|
||||
|
||||
**Example usage:**
|
||||
|
||||
```bash
|
||||
./hf-xet-diag-windows.sh -- python hf-download.py "Qwen/Qwen2.5-VL-3B-Instruct"
|
||||
```
|
||||
|
||||
### Diagnostics - MacOS (`hf-xet-diag-macos.sh`)
|
||||
|
||||
* Uses `sample` + `lldb` to periodically snapshot stacks and produce core dumps.
|
||||
* Downloads and installs the appropriate `hf_xet-*.dbg` symbol file automatically.
|
||||
|
||||
**Requirements:**
|
||||
|
||||
```bash
|
||||
sudo xcode-select --install
|
||||
```
|
||||
|
||||
**Example usage:**
|
||||
|
||||
```bash
|
||||
./hf-xet-diag-macos.sh -- python hf-download.py "Qwen/Qwen2.5-VL-3B-Instruct"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Output Layout
|
||||
|
||||
The diagnostic scripts produce a diagnostics directory named:
|
||||
|
||||
```
|
||||
diag_<command>_<timestamp>/
|
||||
├── console.log # Combined stdout/stderr of the process
|
||||
├── env.log # System/environment info
|
||||
├── pid # Child PID file
|
||||
├── stacks/ # Periodic stack traces / dumps
|
||||
└── dumps/ # (Linux only) full gcore dumps
|
||||
```
|
||||
|
||||
This unified layout makes it easier to compare diagnostics across platforms.
|
||||
|
||||
---
|
||||
|
||||
### Analyzing Dumps
|
||||
|
||||
Use the [hf-xet-diag-analyze-latest.sh](hf-xet-diag-analyze-latest.sh) script to automatically find and open the most recent dump in the appropriate debugger for your platform.
|
||||
|
||||
**Usage:**
|
||||
|
||||
```bash
|
||||
./hf-xet-diag-analyze-latest.sh
|
||||
```
|
||||
|
||||
* Auto-detects your OS (Linux, macOS, or Windows)
|
||||
* Finds the most recent `diag_*` directory
|
||||
* Opens the latest dump in the platform-appropriate debugger:
|
||||
* **Linux:** `gdb` with core dumps from `dumps/`
|
||||
* **macOS:** `lldb` with `.core` files from `dumps/`
|
||||
* **Windows (Git-Bash):** `windbg` with `.dmp` files from `stacks/`
|
||||
|
||||
You can also specify a diagnostics directory:
|
||||
|
||||
```bash
|
||||
./hf-xet-diag-analyze-latest.sh diag_python_hfxet_test_20250127120000
|
||||
```
|
||||
|
||||
**Manual Analysis**
|
||||
|
||||
If you prefer to analyze dumps manually:
|
||||
|
||||
**Linux**
|
||||
* Stack traces: `stacks/*.txt` (plain text, captured periodically)
|
||||
* Core dumps: `dumps/core_*`
|
||||
* Analysis:
|
||||
```bash
|
||||
gdb python dumps/core_<timestamp>.<pid>
|
||||
(gdb) bt # backtrace of current thread
|
||||
(gdb) thread apply all bt # backtrace of all threads
|
||||
(gdb) info threads # list all threads
|
||||
```
|
||||
* Ensure debug symbols (`hf_xet-*.so.dbg`) are in the `hf_xet` package directory
|
||||
|
||||
**macOS**
|
||||
* Stack traces: `stacks/*.txt` (from `sample` command)
|
||||
* Core dumps: `dumps/dump_<pid>_<timestamp>.core`
|
||||
* Analysis:
|
||||
```bash
|
||||
lldb -c dumps/dump_<pid>_<timestamp>.core python3
|
||||
(lldb) bt # backtrace of current thread
|
||||
(lldb) thread backtrace all # backtrace of all threads
|
||||
(lldb) thread list # list all threads
|
||||
```
|
||||
* Ensure debug symbols (`hf_xet-*.dylib.dSYM`) are in the `hf_xet` package directory
|
||||
|
||||
**Windows**
|
||||
* Dumps: `stacks/dump_<timestamp>.dmp`
|
||||
* Install [WinDbg via Windows SDK](https://developer.microsoft.com/en-us/windows/downloads/windows-sdk/)
|
||||
* Analysis:
|
||||
```cmd
|
||||
windbg -z stacks\dump_<timestamp>.dmp
|
||||
```
|
||||
* Common WinDbg commands:
|
||||
```
|
||||
!analyze -v # automatic analysis
|
||||
~* kb # backtrace of all threads
|
||||
~ # list all threads
|
||||
lm # list loaded modules (verify hf_xet.pdb loaded)
|
||||
```
|
||||
* Ensure debug symbols (`hf_xet.pdb`) are in the `hf_xet` package directory
|
||||
|
||||
---
|
||||
|
||||
⚠️ **Tip:** Share the full `diag_<command>_<timestamp>/` directory when reporting issues — it contains logs, environment info, and dumps needed to reproduce and diagnose problems.
|
||||
|
||||
|
||||
### Debugging
|
||||
|
||||
To limit the size our our built binaries, we are releasing python wheels with binaries that are stripped of debugging symbols. If you encounter a panic while running hf-xet, you can use the debug symbols to help identify the part of the library that failed.
|
||||
|
||||
Here are the recommended steps:
|
||||
|
||||
1. Download and unzip our [debug symbols package](https://github.com/huggingface/xet-core/releases/download/latest/dbg-symbols.zip).
|
||||
2. Determine the location of the hf-xet package using `pip show hf-xet`. The `Location` field will show the location of all the site packages. The `hf_xet` package will be within that directory.
|
||||
3. Determine the symbols to copy based on the system you are running:
|
||||
* Windows: use `hf_xet.pdb`
|
||||
* Mac: use `libhf_xet-macosx-x86_64.dylib.dSYM` for Intel based Macs and `libhf_xet-macosx-aarch64.dylib.dSYM` for Apple Silicon.
|
||||
* Linux: the choice will depend on the architecture and wheel distribution used. To get this information, `cat` the `WHEEL` file name within the `hf_xet.dist-info` directory in your site packages. The wheel file will have the linux build and architecture in the file name. Eg: `cat /home/ubuntu/.venv/lib/python3.12/site-packages/hf_xet-*.dist-info/WHEEL`. You will use the file named `hf_xet-<manylinux | musllinux>-<x86_64 | arm64>.abi3.so.dbg` choosing the distribution and platform that matches your wheel. Eg: `hf_xet-manylinux-x86_64.abi3.so.dbg`.
|
||||
4. Copy the symbols to the site package path from step 2 above + `hf_xet`. Eg: `cp -r hf_xet-1.1.2-manylinux-x86_64.abi3.so.dbg /home/ubuntu/.venv/lib/python3.12/site-packages/hf_xet`
|
||||
5. Run your python binary with `RUST_BACKTRACE=full` and recreate your failure.
|
||||
|
||||
#### Debugging Environment Variables
|
||||
|
||||
To enable logging and see more debugging / diagnostics information, set the following:
|
||||
|
||||
```
|
||||
RUST_BACKTRACE=full
|
||||
RUST_LOG=info
|
||||
HF_XET_LOG_FILE=/tmp/xet.log
|
||||
```
|
||||
|
||||
Note: HF_XET_LOG_FILE expects a full writable path. If one isn't found it will use stdout console for logging.
|
||||
|
||||
## Local Development
|
||||
|
||||
### Repo Organization - Rust Crates
|
||||
|
||||
193
scripts/diag/README.md
Normal file
193
scripts/diag/README.md
Normal file
@@ -0,0 +1,193 @@
|
||||
# hf-xet Diagnostic Scripts
|
||||
|
||||
Scripts for collecting diagnostics when `hf-xet` hangs, crashes, or behaves
|
||||
unexpectedly. They download debug symbols, configure logging, and periodically
|
||||
capture stack traces / core dumps into a self-contained directory that is easy
|
||||
to zip and attach to a [GitHub issue](https://github.com/huggingface/xet-core/issues/new/choose).
|
||||
|
||||
## Quick start
|
||||
|
||||
Pick the script for your OS and prefix your failing command with it:
|
||||
|
||||
| OS | Script |
|
||||
|----|--------|
|
||||
| Linux | `scripts/diag/hf-xet-diag-linux.sh` |
|
||||
| macOS | `scripts/diag/hf-xet-diag-macos.sh` |
|
||||
| Windows (Git-Bash) | `scripts/diag/hf-xet-diag-windows.sh` |
|
||||
|
||||
```bash
|
||||
# Linux
|
||||
./scripts/diag/hf-xet-diag-linux.sh -- python my-script.py
|
||||
|
||||
# macOS
|
||||
./scripts/diag/hf-xet-diag-macos.sh -- python my-script.py
|
||||
|
||||
# Windows (Git-Bash)
|
||||
./scripts/diag/hf-xet-diag-windows.sh -- python my-script.py
|
||||
```
|
||||
|
||||
## Per-platform details
|
||||
|
||||
### Linux (`hf-xet-diag-linux.sh`)
|
||||
|
||||
* Uses `gdb` + `gcore` to periodically snapshot stacks and produce core dumps.
|
||||
* Supports optional ptrace preload helper for debugging.
|
||||
* Downloads and installs the appropriate `hf_xet-*.dbg` symbol file automatically.
|
||||
|
||||
**Requirements:**
|
||||
|
||||
```bash
|
||||
sudo apt-get install gdb build-essential
|
||||
```
|
||||
|
||||
**Example:**
|
||||
|
||||
```bash
|
||||
./scripts/diag/hf-xet-diag-linux.sh -- python hf-download.py "Qwen/Qwen2.5-VL-3B-Instruct"
|
||||
```
|
||||
|
||||
### macOS (`hf-xet-diag-macos.sh`)
|
||||
|
||||
* Uses `sample` + `lldb` to periodically snapshot stacks and produce core dumps.
|
||||
* Downloads and installs the appropriate `hf_xet-*.dbg` symbol file automatically.
|
||||
|
||||
**Requirements:**
|
||||
|
||||
```bash
|
||||
sudo xcode-select --install
|
||||
```
|
||||
|
||||
**Example:**
|
||||
|
||||
```bash
|
||||
./scripts/diag/hf-xet-diag-macos.sh -- python hf-download.py "Qwen/Qwen2.5-VL-3B-Instruct"
|
||||
```
|
||||
|
||||
### Windows / Git-Bash (`hf-xet-diag-windows.sh`)
|
||||
|
||||
* Runs in **Git-Bash**, keeping usage consistent with Linux/macOS.
|
||||
* Uses **Sysinternals ProcDump** for periodic mini dumps (`-mp`).
|
||||
* Auto-downloads `procdump.exe` if not found.
|
||||
* Downloads and installs the matching `hf_xet.pdb` debug symbol into the package directory.
|
||||
|
||||
**Requirements:**
|
||||
|
||||
* Git-Bash (from [Git for Windows](https://gitforwindows.org/))
|
||||
* Python installed
|
||||
* Internet access (first run downloads ProcDump and debug symbols)
|
||||
|
||||
**Example:**
|
||||
|
||||
```bash
|
||||
./scripts/diag/hf-xet-diag-windows.sh -- python hf-download.py "Qwen/Qwen2.5-VL-3B-Instruct"
|
||||
```
|
||||
|
||||
## Output layout
|
||||
|
||||
Each run produces a directory named `diag_<command>_<timestamp>/`:
|
||||
|
||||
```
|
||||
diag_<command>_<timestamp>/
|
||||
├── console.log # Combined stdout/stderr of the process
|
||||
├── env.log # System/environment info
|
||||
├── pid # Child PID file
|
||||
├── stacks/ # Periodic stack traces / mini dumps
|
||||
└── dumps/ # (Linux/macOS) full core dumps
|
||||
```
|
||||
|
||||
> **Tip:** Zip and attach the entire `diag_<command>_<timestamp>/` directory
|
||||
> when filing an issue — it contains everything needed to reproduce and diagnose
|
||||
> the problem.
|
||||
|
||||
## Analyzing dumps
|
||||
|
||||
Use `hf-xet-diag-analyze-latest.sh` to automatically open the most recent dump
|
||||
in the appropriate debugger:
|
||||
|
||||
```bash
|
||||
# Auto-detect latest diag_* directory
|
||||
./scripts/diag/hf-xet-diag-analyze-latest.sh
|
||||
|
||||
# Or specify a directory explicitly
|
||||
./scripts/diag/hf-xet-diag-analyze-latest.sh diag_python_hfxet_test_20250127120000
|
||||
```
|
||||
|
||||
The script:
|
||||
* Auto-detects your OS (Linux, macOS, or Windows)
|
||||
* Finds the most recent `diag_*` directory
|
||||
* Opens the latest dump in the platform-appropriate debugger:
|
||||
* **Linux:** `gdb` with core dumps from `dumps/`
|
||||
* **macOS:** `lldb` with `.core` files from `dumps/`
|
||||
* **Windows (Git-Bash):** `windbg` with `.dmp` files from `stacks/`
|
||||
|
||||
### Manual analysis
|
||||
|
||||
**Linux**
|
||||
|
||||
```bash
|
||||
gdb python dumps/core_<timestamp>.<pid>
|
||||
(gdb) bt # backtrace of current thread
|
||||
(gdb) thread apply all bt # backtrace of all threads
|
||||
(gdb) info threads # list all threads
|
||||
```
|
||||
|
||||
Debug symbols: `hf_xet-*.so.dbg` must be in the `hf_xet` package directory.
|
||||
|
||||
**macOS**
|
||||
|
||||
```bash
|
||||
lldb -c dumps/dump_<pid>_<timestamp>.core python3
|
||||
(lldb) bt # backtrace of current thread
|
||||
(lldb) thread backtrace all # backtrace of all threads
|
||||
(lldb) thread list # list all threads
|
||||
```
|
||||
|
||||
Debug symbols: `hf_xet-*.dylib.dSYM` must be in the `hf_xet` package directory.
|
||||
|
||||
**Windows**
|
||||
|
||||
```cmd
|
||||
windbg -z stacks\dump_<timestamp>.dmp
|
||||
```
|
||||
|
||||
Useful WinDbg commands:
|
||||
|
||||
```
|
||||
!analyze -v # automatic analysis
|
||||
~* kb # backtrace of all threads
|
||||
~ # list all threads
|
||||
lm # list loaded modules (verify hf_xet.pdb loaded)
|
||||
```
|
||||
|
||||
Debug symbols: `hf_xet.pdb` must be in the `hf_xet` package directory.
|
||||
|
||||
## Installing debug symbols manually
|
||||
|
||||
The diagnostic scripts install symbols automatically, but you can also do it
|
||||
manually:
|
||||
|
||||
1. Download and unzip the [debug symbols package](https://github.com/huggingface/xet-core/releases/download/latest/dbg-symbols.zip).
|
||||
2. Find the `hf_xet` package location: `pip show hf-xet` — look at the `Location` field.
|
||||
3. Choose the right symbol file for your platform:
|
||||
* **Windows:** `hf_xet.pdb`
|
||||
* **macOS (Apple Silicon):** `libhf_xet-macosx-aarch64.dylib.dSYM`
|
||||
* **macOS (Intel):** `libhf_xet-macosx-x86_64.dylib.dSYM`
|
||||
* **Linux:** match your wheel distribution and arch — check with:
|
||||
```bash
|
||||
cat /path/to/site-packages/hf_xet-*.dist-info/WHEEL
|
||||
```
|
||||
then use `hf_xet-<manylinux|musllinux>-<x86_64|arm64>.abi3.so.dbg`.
|
||||
4. Copy the symbol file into the `hf_xet` package directory:
|
||||
```bash
|
||||
cp -r hf_xet-1.1.2-manylinux-x86_64.abi3.so.dbg \
|
||||
/path/to/site-packages/hf_xet/
|
||||
```
|
||||
5. Re-run with `RUST_BACKTRACE=full` to get a full backtrace.
|
||||
|
||||
## Useful environment variables
|
||||
|
||||
```bash
|
||||
RUST_BACKTRACE=full # full Rust backtraces on panic
|
||||
RUST_LOG=info # enable hf-xet logging
|
||||
HF_XET_LOG_FILE=/tmp/xet.log # write logs to a file (defaults to stdout)
|
||||
```
|
||||
63
scripts/smoke_tests/README.md
Normal file
63
scripts/smoke_tests/README.md
Normal file
@@ -0,0 +1,63 @@
|
||||
# hf-xet Smoke Tests
|
||||
|
||||
End-to-end tests that exercise the full hf-xet upload/download path against the
|
||||
real HuggingFace Hub. They use the `hf` CLI for all Hub operations and verify
|
||||
content integrity with SHA-256 checksums.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- [`uv`](https://docs.astral.sh/uv/)
|
||||
- [`hf` CLI](https://huggingface.co/docs/huggingface_hub/en/guides/cli) — `uv tool install huggingface_hub`
|
||||
- `HF_TOKEN` environment variable with write access
|
||||
|
||||
## Usage
|
||||
|
||||
```bash
|
||||
# Test latest hf_xet from PyPI
|
||||
./scripts/smoke_tests/run.sh
|
||||
|
||||
# Test a specific version (bypasses uv cache, fetches directly from PyPI)
|
||||
./scripts/smoke_tests/run.sh --hf-xet-version 1.4.0
|
||||
|
||||
# Test a local wheel
|
||||
HF_XET_WHEEL=./dist/hf_xet-1.4.0.whl ./scripts/smoke_tests/run.sh
|
||||
|
||||
# Skip storage bucket tests
|
||||
./scripts/smoke_tests/run.sh --skip-buckets
|
||||
|
||||
# Keep the test repo/bucket after the run (useful for debugging)
|
||||
./scripts/smoke_tests/run.sh --keep-repo
|
||||
```
|
||||
|
||||
## What's tested
|
||||
|
||||
### Repository tests (`hf upload` / `hf download`)
|
||||
|
||||
Uploads test files of varying sizes (1 KB → 100 MB) to a temporary private
|
||||
model repo, then downloads and verifies every file's SHA-256 hash.
|
||||
|
||||
| Test | Description |
|
||||
|------|-------------|
|
||||
| Upload single file | `hf upload` of a single file |
|
||||
| Upload folder | `hf upload` of an entire directory tree |
|
||||
| Download individual files | Per-file `hf download` + hash check |
|
||||
| Download all files | Full-repo `hf download` + hash check |
|
||||
| Overwrite and re-download | Confirms updated content is served after overwrite |
|
||||
| Delete file | `hf repos delete-files` + confirms file is absent |
|
||||
|
||||
### Bucket tests (`hf buckets`)
|
||||
|
||||
Creates a temporary private bucket and exercises the full bucket lifecycle.
|
||||
|
||||
| Test | Description |
|
||||
|------|-------------|
|
||||
| cp upload | `hf buckets cp` single file upload |
|
||||
| sync upload | `hf buckets sync` directory upload |
|
||||
| list | Recursive listing confirms all expected paths |
|
||||
| cp download | `hf buckets cp` download + hash check |
|
||||
| sync download | `hf buckets sync` directory download + hash check |
|
||||
| Overwrite | cp overwrite + re-download confirms new content |
|
||||
| sync --delete | Extraneous remote files removed when absent locally |
|
||||
| rm | `hf buckets rm` + confirms file absent from listing |
|
||||
|
||||
All temporary repos and buckets are deleted after the run unless `--keep-repo` is set.
|
||||
56
scripts/smoke_tests/run.sh
Executable file
56
scripts/smoke_tests/run.sh
Executable file
@@ -0,0 +1,56 @@
|
||||
#!/bin/bash
|
||||
set -euo pipefail
|
||||
|
||||
# Smoke test runner for hf-xet upload/download via the hf CLI.
|
||||
#
|
||||
# Prerequisites:
|
||||
# - uv (https://docs.astral.sh/uv/)
|
||||
# - hf CLI (pip install huggingface_hub, or uv tool install huggingface_hub)
|
||||
# - HF_TOKEN env var with write access
|
||||
#
|
||||
# Usage:
|
||||
# ./scripts/smoke_tests/run.sh # latest hf_xet from PyPI
|
||||
# ./scripts/smoke_tests/run.sh --hf-xet-version 1.4.0 # specific version (bypasses uv cache)
|
||||
# ./scripts/smoke_tests/run.sh --skip-buckets # skip storage bucket tests
|
||||
# ./scripts/smoke_tests/run.sh --keep-repo # leave test repo/bucket after run
|
||||
# HF_XET_WHEEL=./dist/hf_xet-1.4.0.whl ./scripts/smoke_tests/run.sh # local wheel
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
|
||||
|
||||
if [ -z "${HF_TOKEN:-}" ]; then
|
||||
echo "ERROR: HF_TOKEN environment variable is required" >&2
|
||||
echo " export HF_TOKEN=hf_..." >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
if ! command -v uv &> /dev/null; then
|
||||
echo "ERROR: uv is required. Install: curl -LsSf https://astral.sh/uv/install.sh | sh" >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
if ! command -v hf &> /dev/null; then
|
||||
echo "ERROR: hf CLI is required. Install: uv tool install huggingface_hub" >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Parse --hf-xet-version from args (if present)
|
||||
HF_XET_VERSION=""
|
||||
for arg in "$@"; do
|
||||
if [[ "${prev_arg:-}" == "--hf-xet-version" ]]; then
|
||||
HF_XET_VERSION="$arg"
|
||||
fi
|
||||
prev_arg="$arg"
|
||||
done
|
||||
|
||||
echo "Running hf-xet smoke tests..."
|
||||
echo ""
|
||||
|
||||
if [ -n "${HF_XET_WHEEL:-}" ]; then
|
||||
echo "Using local wheel: ${HF_XET_WHEEL}"
|
||||
uv run --with "${HF_XET_WHEEL}" "${SCRIPT_DIR}/test_upload_download.py" "$@"
|
||||
elif [ -n "${HF_XET_VERSION}" ]; then
|
||||
echo "Using hf_xet version: ${HF_XET_VERSION} (fetching from PyPI)"
|
||||
uv run --with "hf_xet==${HF_XET_VERSION}" --refresh-package hf_xet "${SCRIPT_DIR}/test_upload_download.py" "$@"
|
||||
else
|
||||
uv run "${SCRIPT_DIR}/test_upload_download.py" "$@"
|
||||
fi
|
||||
410
scripts/smoke_tests/test_upload_download.py
Normal file
410
scripts/smoke_tests/test_upload_download.py
Normal file
@@ -0,0 +1,410 @@
|
||||
"""
|
||||
Smoke test for hf-xet using the `hf` CLI for upload/download through both
|
||||
HuggingFace model repositories and storage buckets.
|
||||
|
||||
Creates temporary resources, exercises upload/download paths, verifies content
|
||||
integrity, then cleans up. Requires HF_TOKEN with write access.
|
||||
|
||||
Usage:
|
||||
uv run scripts/smoke_tests/test_upload_download.py
|
||||
uv run scripts/smoke_tests/test_upload_download.py --hf-xet-version 1.4.0
|
||||
uv run scripts/smoke_tests/test_upload_download.py --keep-repo
|
||||
uv run scripts/smoke_tests/test_upload_download.py --skip-buckets
|
||||
"""
|
||||
|
||||
# /// script
|
||||
# requires-python = ">=3.10"
|
||||
# dependencies = [
|
||||
# "huggingface_hub>=1.0.0",
|
||||
# "hf_xet",
|
||||
# ]
|
||||
# ///
|
||||
|
||||
import argparse
|
||||
import hashlib
|
||||
import os
|
||||
import secrets
|
||||
import shutil
|
||||
import subprocess
|
||||
import sys
|
||||
import tempfile
|
||||
import time
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Helpers
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def run(cmd: list[str], check: bool = True) -> str:
|
||||
"""Run a CLI command, return stdout. Raises RuntimeError on failure."""
|
||||
result = subprocess.run(cmd, capture_output=True, text=True)
|
||||
if check and result.returncode != 0:
|
||||
raise RuntimeError(
|
||||
f"Command failed: {' '.join(cmd)}\n"
|
||||
f"stdout: {result.stdout.strip()}\n"
|
||||
f"stderr: {result.stderr.strip()}"
|
||||
)
|
||||
return result.stdout.strip()
|
||||
|
||||
|
||||
def sha256_bytes(data: bytes) -> str:
|
||||
return hashlib.sha256(data).hexdigest()
|
||||
|
||||
|
||||
def sha256_file(path: str | Path) -> str:
|
||||
h = hashlib.sha256()
|
||||
with open(path, "rb") as f:
|
||||
for chunk in iter(lambda: f.read(65536), b""):
|
||||
h.update(chunk)
|
||||
return h.hexdigest()
|
||||
|
||||
|
||||
def generate_file(path: str | Path, size_bytes: int) -> str:
|
||||
"""Write random bytes to path; return sha256 hex."""
|
||||
data = secrets.token_bytes(size_bytes)
|
||||
Path(path).parent.mkdir(parents=True, exist_ok=True)
|
||||
with open(path, "wb") as f:
|
||||
f.write(data)
|
||||
return sha256_bytes(data)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Test runner
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class Results:
|
||||
def __init__(self):
|
||||
self.passed = 0
|
||||
self.failed = 0
|
||||
self.errors: list[tuple[str, str]] = []
|
||||
|
||||
def run(self, name: str, fn):
|
||||
print(f"\n{'='*60}")
|
||||
print(f"TEST: {name}")
|
||||
print(f"{'='*60}")
|
||||
try:
|
||||
fn()
|
||||
self.passed += 1
|
||||
print(f"PASSED: {name}")
|
||||
except Exception as e:
|
||||
self.failed += 1
|
||||
self.errors.append((name, str(e)))
|
||||
print(f"FAILED: {name}: {e}", file=sys.stderr)
|
||||
|
||||
def summary(self):
|
||||
print(f"\n{'='*60}")
|
||||
print(f"RESULTS: {self.passed} passed, {self.failed} failed")
|
||||
print(f"{'='*60}")
|
||||
if self.errors:
|
||||
for name, err in self.errors:
|
||||
print(f" FAIL: {name}: {err}")
|
||||
sys.exit(1)
|
||||
else:
|
||||
print("All smoke tests passed!")
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Main
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="Smoke test hf-xet via hf CLI")
|
||||
parser.add_argument("--hf-xet-version", help="Expected hf_xet version (display/warn only)")
|
||||
parser.add_argument("--keep-repo", action="store_true", help="Skip cleanup of test repo/bucket")
|
||||
parser.add_argument("--repo-prefix", default="smoke-test-xet", help="Prefix for temp resource names")
|
||||
parser.add_argument("--skip-buckets", action="store_true", help="Skip storage bucket tests")
|
||||
args = parser.parse_args()
|
||||
|
||||
# --- preflight checks ---
|
||||
token = os.environ.get("HF_TOKEN")
|
||||
if not token:
|
||||
print("ERROR: HF_TOKEN environment variable is required", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
if not shutil.which("hf"):
|
||||
print("ERROR: hf CLI not found. Install: pip install huggingface_hub", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
# --- print environment ---
|
||||
import huggingface_hub
|
||||
print(f"huggingface_hub version: {huggingface_hub.__version__}")
|
||||
try:
|
||||
from importlib.metadata import version as pkg_version
|
||||
installed_xet = pkg_version("hf_xet")
|
||||
print(f"hf_xet version: {installed_xet}")
|
||||
if args.hf_xet_version and installed_xet != args.hf_xet_version:
|
||||
print(f"WARNING: hf_xet version mismatch: got {installed_xet}, expected {args.hf_xet_version}")
|
||||
except Exception:
|
||||
print("hf_xet version: unknown")
|
||||
print(f"hf CLI: {run(['hf', 'version'])}")
|
||||
|
||||
# --- resolve username ---
|
||||
from huggingface_hub import HfApi
|
||||
user = HfApi(token=token).whoami()["name"]
|
||||
|
||||
suffix = secrets.token_hex(4)
|
||||
repo_id = f"{user}/{args.repo_prefix}-{suffix}"
|
||||
bucket_id = f"{user}/{args.repo_prefix}-bucket-{suffix}"
|
||||
|
||||
print(f"\nTest repo: {repo_id}")
|
||||
if not args.skip_buckets:
|
||||
print(f"Test bucket: {bucket_id}")
|
||||
|
||||
results = Results()
|
||||
|
||||
# ===================================================================== #
|
||||
# Repository tests (hf upload / hf download)
|
||||
# ===================================================================== #
|
||||
|
||||
repo_test_files = {
|
||||
"small.bin": 1024, # 1 KB — below chunk size
|
||||
"medium.bin": 256 * 1024, # 256 KB — a few chunks
|
||||
"large.bin": 5 * 1024 * 1024, # 5 MB — multiple chunks
|
||||
"xlarge.bin": 50 * 1024 * 1024,# 50 MB — large multi-xorb
|
||||
"xxlarge.bin": 100 * 1024 * 1024,# 100 MB — stress test
|
||||
"subdir/nested.bin": 128 * 1024, # 128 KB — subdirectory
|
||||
}
|
||||
|
||||
repo_created = False
|
||||
try:
|
||||
print(f"\nCreating repo {repo_id}...")
|
||||
run(["hf", "repos", "create", repo_id, "--private"])
|
||||
repo_created = True
|
||||
|
||||
with tempfile.TemporaryDirectory() as tmpdir:
|
||||
upload_dir = Path(tmpdir) / "upload"
|
||||
download_dir = Path(tmpdir) / "download"
|
||||
upload_dir.mkdir()
|
||||
download_dir.mkdir()
|
||||
|
||||
expected = {}
|
||||
for rel, size in repo_test_files.items():
|
||||
expected[rel] = generate_file(upload_dir / rel, size)
|
||||
print(f" Generated {rel} ({size:,} bytes)")
|
||||
|
||||
# -- 1. upload single file --
|
||||
def test_repo_upload_single():
|
||||
t = time.time()
|
||||
run(["hf", "upload", repo_id,
|
||||
str(upload_dir / "small.bin"), "small.bin", "--quiet"])
|
||||
print(f" Uploaded small.bin in {time.time()-t:.1f}s")
|
||||
results.run("Repo: upload single file", test_repo_upload_single)
|
||||
|
||||
# -- 2. upload folder --
|
||||
def test_repo_upload_folder():
|
||||
t = time.time()
|
||||
run(["hf", "upload", repo_id, str(upload_dir), ".", "--quiet"])
|
||||
print(f" Uploaded folder in {time.time()-t:.1f}s")
|
||||
results.run("Repo: upload folder", test_repo_upload_folder)
|
||||
|
||||
# -- 3. download individual files and verify --
|
||||
def test_repo_download_single():
|
||||
out = str(download_dir / "single")
|
||||
for rel in repo_test_files:
|
||||
t = time.time()
|
||||
run(["hf", "download", repo_id, rel, "--local-dir", out, "--quiet"])
|
||||
actual = sha256_file(Path(out) / rel)
|
||||
assert actual == expected[rel], (
|
||||
f"Hash mismatch for {rel}: "
|
||||
f"expected {expected[rel][:16]}..., got {actual[:16]}..."
|
||||
)
|
||||
print(f" Downloaded+verified {rel} in {time.time()-t:.1f}s")
|
||||
results.run("Repo: download and verify individual files", test_repo_download_single)
|
||||
|
||||
# -- 4. download entire repo and verify --
|
||||
def test_repo_download_all():
|
||||
out = str(download_dir / "all")
|
||||
t = time.time()
|
||||
run(["hf", "download", repo_id, "--local-dir", out, "--quiet"])
|
||||
print(f" Downloaded all files in {time.time()-t:.1f}s")
|
||||
for rel in repo_test_files:
|
||||
p = Path(out) / rel
|
||||
assert p.exists(), f"Missing file: {rel}"
|
||||
actual = sha256_file(p)
|
||||
assert actual == expected[rel], (
|
||||
f"Hash mismatch for {rel}: "
|
||||
f"expected {expected[rel][:16]}..., got {actual[:16]}..."
|
||||
)
|
||||
print(f" Verified {rel}")
|
||||
results.run("Repo: download all files and verify", test_repo_download_all)
|
||||
|
||||
# -- 5. overwrite a file and verify new content --
|
||||
def test_repo_overwrite():
|
||||
new_hash = generate_file(upload_dir / "small.bin", 2048)
|
||||
run(["hf", "upload", repo_id,
|
||||
str(upload_dir / "small.bin"), "small.bin", "--quiet"])
|
||||
out = str(download_dir / "overwrite")
|
||||
run(["hf", "download", repo_id, "small.bin",
|
||||
"--local-dir", out, "--force-download", "--quiet"])
|
||||
actual = sha256_file(Path(out) / "small.bin")
|
||||
assert actual == new_hash, (
|
||||
f"Overwrite mismatch: expected {new_hash[:16]}..., got {actual[:16]}..."
|
||||
)
|
||||
print(" Overwrite verified: new content downloaded correctly")
|
||||
results.run("Repo: upload overwrite and verify", test_repo_overwrite)
|
||||
|
||||
# -- 6. delete files from repo --
|
||||
def test_repo_delete_files():
|
||||
run(["hf", "repos", "delete-files", repo_id, "small.bin"])
|
||||
# Re-download all; small.bin should be absent
|
||||
out = str(download_dir / "post-delete")
|
||||
run(["hf", "download", repo_id, "--local-dir", out, "--quiet"])
|
||||
assert not (Path(out) / "small.bin").exists(), \
|
||||
"small.bin still present after deletion"
|
||||
print(" small.bin confirmed absent after delete")
|
||||
results.run("Repo: delete file from repo", test_repo_delete_files)
|
||||
|
||||
finally:
|
||||
if repo_created and not args.keep_repo:
|
||||
print(f"\nCleaning up repo {repo_id}...")
|
||||
try:
|
||||
run(["hf", "repos", "delete", repo_id])
|
||||
print(" Deleted.")
|
||||
except Exception as e:
|
||||
print(f" Warning: failed to delete repo: {e}", file=sys.stderr)
|
||||
|
||||
# ===================================================================== #
|
||||
# Storage bucket tests (hf buckets)
|
||||
# ===================================================================== #
|
||||
|
||||
if args.skip_buckets:
|
||||
results.summary()
|
||||
return
|
||||
|
||||
# Check that hf buckets is available
|
||||
bucket_check = run(["hf", "buckets", "--help"], check=False)
|
||||
if "buckets" not in bucket_check.lower() and "error" in bucket_check.lower():
|
||||
print("\nWARNING: hf buckets not available in this hf CLI version — skipping bucket tests")
|
||||
results.summary()
|
||||
return
|
||||
|
||||
bucket_created = False
|
||||
try:
|
||||
print(f"\nCreating bucket {bucket_id}...")
|
||||
run(["hf", "buckets", "create", bucket_id, "--private"])
|
||||
bucket_created = True
|
||||
handle = f"hf://buckets/{bucket_id}"
|
||||
|
||||
with tempfile.TemporaryDirectory() as tmpdir:
|
||||
upload_dir = Path(tmpdir) / "upload"
|
||||
download_dir = Path(tmpdir) / "download"
|
||||
(upload_dir / "subdir").mkdir(parents=True)
|
||||
download_dir.mkdir()
|
||||
|
||||
# Files used in bucket tests
|
||||
single_hash = generate_file(upload_dir / "single.bin", 512 * 1024)
|
||||
subdir1_hash = generate_file(upload_dir / "subdir/file1.bin", 256 * 1024)
|
||||
subdir2_hash = generate_file(upload_dir / "subdir/file2.bin", 256 * 1024)
|
||||
print(f" Generated single.bin, subdir/file1.bin, subdir/file2.bin")
|
||||
|
||||
# -- 1. cp: upload single file --
|
||||
def test_bucket_cp_upload():
|
||||
t = time.time()
|
||||
run(["hf", "buckets", "cp",
|
||||
str(upload_dir / "single.bin"), f"{handle}/single.bin"])
|
||||
print(f" Uploaded single.bin via cp in {time.time()-t:.1f}s")
|
||||
results.run("Bucket: cp upload single file", test_bucket_cp_upload)
|
||||
|
||||
# -- 2. sync: upload directory --
|
||||
def test_bucket_sync_upload():
|
||||
t = time.time()
|
||||
run(["hf", "buckets", "sync",
|
||||
str(upload_dir / "subdir"), f"{handle}/subdir"])
|
||||
print(f" Synced subdir/ up in {time.time()-t:.1f}s")
|
||||
results.run("Bucket: sync upload directory", test_bucket_sync_upload)
|
||||
|
||||
# -- 3. list files (recursive quiet) --
|
||||
def test_bucket_list():
|
||||
out = run(["hf", "buckets", "list", bucket_id, "-R", "--quiet"])
|
||||
listed = set(out.splitlines())
|
||||
for path in ("single.bin", "subdir/file1.bin", "subdir/file2.bin"):
|
||||
assert path in listed, f"Expected {path!r} in listing, got: {listed}"
|
||||
print(f" Listed {len(listed)} file(s): {sorted(listed)}")
|
||||
results.run("Bucket: list files (recursive)", test_bucket_list)
|
||||
|
||||
# -- 4. cp: download single file and verify --
|
||||
def test_bucket_cp_download():
|
||||
out_path = download_dir / "single.bin"
|
||||
t = time.time()
|
||||
run(["hf", "buckets", "cp", f"{handle}/single.bin", str(out_path)])
|
||||
actual = sha256_file(out_path)
|
||||
assert actual == single_hash, (
|
||||
f"Hash mismatch: expected {single_hash[:16]}..., got {actual[:16]}..."
|
||||
)
|
||||
print(f" Downloaded+verified single.bin in {time.time()-t:.1f}s")
|
||||
results.run("Bucket: cp download and verify", test_bucket_cp_download)
|
||||
|
||||
# -- 5. sync: download directory and verify --
|
||||
def test_bucket_sync_download():
|
||||
out_dir = download_dir / "subdir"
|
||||
t = time.time()
|
||||
run(["hf", "buckets", "sync", f"{handle}/subdir", str(out_dir)])
|
||||
print(f" Synced subdir/ down in {time.time()-t:.1f}s")
|
||||
for fname, expected_hash in (
|
||||
("file1.bin", subdir1_hash),
|
||||
("file2.bin", subdir2_hash),
|
||||
):
|
||||
p = out_dir / fname
|
||||
assert p.exists(), f"Missing: {p}"
|
||||
actual = sha256_file(p)
|
||||
assert actual == expected_hash, (
|
||||
f"Hash mismatch for {fname}: "
|
||||
f"expected {expected_hash[:16]}..., got {actual[:16]}..."
|
||||
)
|
||||
print(f" Verified subdir/{fname}")
|
||||
results.run("Bucket: sync download and verify", test_bucket_sync_download)
|
||||
|
||||
# -- 6. overwrite via cp and verify new content --
|
||||
def test_bucket_overwrite():
|
||||
new_hash = generate_file(upload_dir / "single.bin", 1024 * 1024)
|
||||
run(["hf", "buckets", "cp",
|
||||
str(upload_dir / "single.bin"), f"{handle}/single.bin"])
|
||||
out_path = download_dir / "single_overwrite.bin"
|
||||
run(["hf", "buckets", "cp", f"{handle}/single.bin", str(out_path)])
|
||||
actual = sha256_file(out_path)
|
||||
assert actual == new_hash, (
|
||||
f"Overwrite mismatch: expected {new_hash[:16]}..., got {actual[:16]}..."
|
||||
)
|
||||
print(" Overwrite verified: new content downloaded correctly")
|
||||
results.run("Bucket: cp overwrite and verify", test_bucket_overwrite)
|
||||
|
||||
# -- 7. sync --delete: remove files absent from source --
|
||||
def test_bucket_sync_delete():
|
||||
# Local subdir now only has file1.bin; sync --delete should remove file2.bin
|
||||
(upload_dir / "subdir" / "file2.bin").unlink()
|
||||
run(["hf", "buckets", "sync",
|
||||
str(upload_dir / "subdir"), f"{handle}/subdir", "--delete"])
|
||||
out = run(["hf", "buckets", "list", bucket_id, "-R", "--quiet"])
|
||||
listed = set(out.splitlines())
|
||||
assert "subdir/file2.bin" not in listed, \
|
||||
f"subdir/file2.bin still present after sync --delete: {listed}"
|
||||
assert "subdir/file1.bin" in listed, \
|
||||
f"subdir/file1.bin missing after sync --delete: {listed}"
|
||||
print(f" sync --delete verified: remaining files: {sorted(listed)}")
|
||||
results.run("Bucket: sync --delete removes extraneous files", test_bucket_sync_delete)
|
||||
|
||||
# -- 8. rm: delete a file and confirm it's gone --
|
||||
def test_bucket_rm():
|
||||
run(["hf", "buckets", "rm", f"{bucket_id}/single.bin", "--yes"])
|
||||
out = run(["hf", "buckets", "list", bucket_id, "-R", "--quiet"])
|
||||
listed = set(out.splitlines())
|
||||
assert "single.bin" not in listed, \
|
||||
f"single.bin still present after rm: {listed}"
|
||||
print(f" rm verified: remaining files: {sorted(listed)}")
|
||||
results.run("Bucket: rm file", test_bucket_rm)
|
||||
|
||||
finally:
|
||||
if bucket_created and not args.keep_repo:
|
||||
print(f"\nCleaning up bucket {bucket_id}...")
|
||||
try:
|
||||
run(["hf", "buckets", "delete", bucket_id, "--yes"])
|
||||
print(" Deleted.")
|
||||
except Exception as e:
|
||||
print(f" Warning: failed to delete bucket: {e}", file=sys.stderr)
|
||||
|
||||
results.summary()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
Reference in New Issue
Block a user