feat: smoke tests using hf CLI with bucket and large-file coverage (#710)

## Summary

- Rewrites smoke tests to drive everything through the `hf` CLI rather
than the huggingface_hub Python API, covering the actual user-facing
surface area of hf-xet
- Moves smoke tests and diagnostic scripts into a `scripts/` directory
for cleaner repo layout
- Adds storage bucket test suite exercising the full bucket lifecycle
- Adds 50 MB and 100 MB files to repo upload/download tests

## Test matrix (14 tests, all passing)

**Repository tests** (`hf upload` / `hf download`)
- Upload single file, upload folder
- Download individual files + SHA-256 verify
- Download entire repo + SHA-256 verify
- Overwrite file and verify new content served
- Delete file and confirm absent

**Bucket tests** (`hf buckets`)
- `cp` upload / download + verify
- `sync` upload / download + verify
- Recursive list confirms expected paths
- Overwrite via `cp` + verify
- `sync --delete` removes extraneous remote files
- `rm` + confirm absent from listing

## Test plan
- [x] Run `HF_TOKEN=... ./scripts/smoke_tests/run.sh` and confirm all 14
tests pass
- [x] Run `./scripts/smoke_tests/run.sh --skip-buckets` for repo-only
path
- [x] Run with `--hf-xet-version <version>` to confirm PyPI cache bypass
works

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Rajat Arya
2026-03-17 19:07:05 -07:00
committed by GitHub
parent 69c714c01d
commit c0f7980616
9 changed files with 740 additions and 164 deletions

182
README.md
View File

@@ -43,180 +43,34 @@ Please join us in making xet-core better. We value everyone's contributions. Cod
## Issues, Diagnostics & Debugging
If you encounter an issue when using `hf-xet` please help us fix the issue by collecting diagnostic information and attaching that when creating a [new Issue](https://github.com/huggingface/xet-core/issues/new/choose). Download the [hf-xet-diag-linux.sh](hf-xet-diag-linux.sh), [hf-xet-diag-macos.sh](hf-xet-diag-macos.sh), or [hf-xet-diag-windows.sh](hf-xet-diag-windows.sh) script based on your operating system and then re-run the python command that resulted in the issue. The diagnostic scripts will download and install debug symbols, setup up logging, and take periodic stack traces throughout process execution in a diagnostics directory that is easy to analyze, package, and upload.
If you encounter an issue with `hf-xet`, please collect diagnostic information
and attach it when creating a [new Issue](https://github.com/huggingface/xet-core/issues/new/choose).
### Diagnostics - Linux (`hf-xet-diag-linux.sh`)
The [`scripts/diag/`](scripts/diag/) directory contains platform-specific scripts
that download debug symbols, configure logging, and capture periodic stack traces
and core dumps:
* Uses `gdb` + `gcore` to periodically snapshot stacks and produce core dumps.
* Supports optional ptrace preload helper for debugging.
* Downloads and installs the appropriate `hf_xet-*.dbg` symbol file automatically.
**Requirements:**
| OS | Script |
|----|--------|
| Linux | [`scripts/diag/hf-xet-diag-linux.sh`](scripts/diag/hf-xet-diag-linux.sh) |
| macOS | [`scripts/diag/hf-xet-diag-macos.sh`](scripts/diag/hf-xet-diag-macos.sh) |
| Windows (Git-Bash) | [`scripts/diag/hf-xet-diag-windows.sh`](scripts/diag/hf-xet-diag-windows.sh) |
```bash
sudo apt-get install gdb build-essential
# prefix your failing command with the script for your OS, e.g.:
./scripts/diag/hf-xet-diag-macos.sh -- python my-script.py
```
**Example usage:**
See [**scripts/diag/README.md**](scripts/diag/README.md) for full usage, output layout, dump analysis instructions, and how to install debug symbols manually.
Quick debugging environment variables:
```bash
./hf-xet-diag-linux.sh -- python hf-download.py "Qwen/Qwen2.5-VL-3B-Instruct"
RUST_BACKTRACE=full # full Rust backtraces on panic
RUST_LOG=info # enable hf-xet logging
HF_XET_LOG_FILE=/tmp/xet.log # write logs to a file (defaults to stdout)
```
### Windows (Git-Bash) (`hf-xet-diag-windows.sh`)
* Runs in **Git-Bash**, keeping usage consistent with Linux.
* Uses **Sysinternals ProcDump** for periodic mini dumps (`-mp`).
* Auto-downloads `procdump.exe` if not found.
* Downloads and installs the matching `hf_xet.pdb` debug symbol into the package directory.
**Requirements:**
* Git-Bash (from [Git for Windows](https://gitforwindows.org/))
* Python installed
* Internet access (first run downloads ProcDump and debug symbols)
**Example usage:**
```bash
./hf-xet-diag-windows.sh -- python hf-download.py "Qwen/Qwen2.5-VL-3B-Instruct"
```
### Diagnostics - MacOS (`hf-xet-diag-macos.sh`)
* Uses `sample` + `lldb` to periodically snapshot stacks and produce core dumps.
* Downloads and installs the appropriate `hf_xet-*.dbg` symbol file automatically.
**Requirements:**
```bash
sudo xcode-select --install
```
**Example usage:**
```bash
./hf-xet-diag-macos.sh -- python hf-download.py "Qwen/Qwen2.5-VL-3B-Instruct"
```
---
### Output Layout
The diagnostic scripts produce a diagnostics directory named:
```
diag_<command>_<timestamp>/
├── console.log # Combined stdout/stderr of the process
├── env.log # System/environment info
├── pid # Child PID file
├── stacks/ # Periodic stack traces / dumps
└── dumps/ # (Linux only) full gcore dumps
```
This unified layout makes it easier to compare diagnostics across platforms.
---
### Analyzing Dumps
Use the [hf-xet-diag-analyze-latest.sh](hf-xet-diag-analyze-latest.sh) script to automatically find and open the most recent dump in the appropriate debugger for your platform.
**Usage:**
```bash
./hf-xet-diag-analyze-latest.sh
```
* Auto-detects your OS (Linux, macOS, or Windows)
* Finds the most recent `diag_*` directory
* Opens the latest dump in the platform-appropriate debugger:
* **Linux:** `gdb` with core dumps from `dumps/`
* **macOS:** `lldb` with `.core` files from `dumps/`
* **Windows (Git-Bash):** `windbg` with `.dmp` files from `stacks/`
You can also specify a diagnostics directory:
```bash
./hf-xet-diag-analyze-latest.sh diag_python_hfxet_test_20250127120000
```
**Manual Analysis**
If you prefer to analyze dumps manually:
**Linux**
* Stack traces: `stacks/*.txt` (plain text, captured periodically)
* Core dumps: `dumps/core_*`
* Analysis:
```bash
gdb python dumps/core_<timestamp>.<pid>
(gdb) bt # backtrace of current thread
(gdb) thread apply all bt # backtrace of all threads
(gdb) info threads # list all threads
```
* Ensure debug symbols (`hf_xet-*.so.dbg`) are in the `hf_xet` package directory
**macOS**
* Stack traces: `stacks/*.txt` (from `sample` command)
* Core dumps: `dumps/dump_<pid>_<timestamp>.core`
* Analysis:
```bash
lldb -c dumps/dump_<pid>_<timestamp>.core python3
(lldb) bt # backtrace of current thread
(lldb) thread backtrace all # backtrace of all threads
(lldb) thread list # list all threads
```
* Ensure debug symbols (`hf_xet-*.dylib.dSYM`) are in the `hf_xet` package directory
**Windows**
* Dumps: `stacks/dump_<timestamp>.dmp`
* Install [WinDbg via Windows SDK](https://developer.microsoft.com/en-us/windows/downloads/windows-sdk/)
* Analysis:
```cmd
windbg -z stacks\dump_<timestamp>.dmp
```
* Common WinDbg commands:
```
!analyze -v # automatic analysis
~* kb # backtrace of all threads
~ # list all threads
lm # list loaded modules (verify hf_xet.pdb loaded)
```
* Ensure debug symbols (`hf_xet.pdb`) are in the `hf_xet` package directory
---
⚠️ **Tip:** Share the full `diag_<command>_<timestamp>/` directory when reporting issues — it contains logs, environment info, and dumps needed to reproduce and diagnose problems.
### Debugging
To limit the size our our built binaries, we are releasing python wheels with binaries that are stripped of debugging symbols. If you encounter a panic while running hf-xet, you can use the debug symbols to help identify the part of the library that failed.
Here are the recommended steps:
1. Download and unzip our [debug symbols package](https://github.com/huggingface/xet-core/releases/download/latest/dbg-symbols.zip).
2. Determine the location of the hf-xet package using `pip show hf-xet`. The `Location` field will show the location of all the site packages. The `hf_xet` package will be within that directory.
3. Determine the symbols to copy based on the system you are running:
* Windows: use `hf_xet.pdb`
* Mac: use `libhf_xet-macosx-x86_64.dylib.dSYM` for Intel based Macs and `libhf_xet-macosx-aarch64.dylib.dSYM` for Apple Silicon.
* Linux: the choice will depend on the architecture and wheel distribution used. To get this information, `cat` the `WHEEL` file name within the `hf_xet.dist-info` directory in your site packages. The wheel file will have the linux build and architecture in the file name. Eg: `cat /home/ubuntu/.venv/lib/python3.12/site-packages/hf_xet-*.dist-info/WHEEL`. You will use the file named `hf_xet-<manylinux | musllinux>-<x86_64 | arm64>.abi3.so.dbg` choosing the distribution and platform that matches your wheel. Eg: `hf_xet-manylinux-x86_64.abi3.so.dbg`.
4. Copy the symbols to the site package path from step 2 above + `hf_xet`. Eg: `cp -r hf_xet-1.1.2-manylinux-x86_64.abi3.so.dbg /home/ubuntu/.venv/lib/python3.12/site-packages/hf_xet`
5. Run your python binary with `RUST_BACKTRACE=full` and recreate your failure.
#### Debugging Environment Variables
To enable logging and see more debugging / diagnostics information, set the following:
```
RUST_BACKTRACE=full
RUST_LOG=info
HF_XET_LOG_FILE=/tmp/xet.log
```
Note: HF_XET_LOG_FILE expects a full writable path. If one isn't found it will use stdout console for logging.
## Local Development
### Repo Organization - Rust Crates

193
scripts/diag/README.md Normal file
View File

@@ -0,0 +1,193 @@
# hf-xet Diagnostic Scripts
Scripts for collecting diagnostics when `hf-xet` hangs, crashes, or behaves
unexpectedly. They download debug symbols, configure logging, and periodically
capture stack traces / core dumps into a self-contained directory that is easy
to zip and attach to a [GitHub issue](https://github.com/huggingface/xet-core/issues/new/choose).
## Quick start
Pick the script for your OS and prefix your failing command with it:
| OS | Script |
|----|--------|
| Linux | `scripts/diag/hf-xet-diag-linux.sh` |
| macOS | `scripts/diag/hf-xet-diag-macos.sh` |
| Windows (Git-Bash) | `scripts/diag/hf-xet-diag-windows.sh` |
```bash
# Linux
./scripts/diag/hf-xet-diag-linux.sh -- python my-script.py
# macOS
./scripts/diag/hf-xet-diag-macos.sh -- python my-script.py
# Windows (Git-Bash)
./scripts/diag/hf-xet-diag-windows.sh -- python my-script.py
```
## Per-platform details
### Linux (`hf-xet-diag-linux.sh`)
* Uses `gdb` + `gcore` to periodically snapshot stacks and produce core dumps.
* Supports optional ptrace preload helper for debugging.
* Downloads and installs the appropriate `hf_xet-*.dbg` symbol file automatically.
**Requirements:**
```bash
sudo apt-get install gdb build-essential
```
**Example:**
```bash
./scripts/diag/hf-xet-diag-linux.sh -- python hf-download.py "Qwen/Qwen2.5-VL-3B-Instruct"
```
### macOS (`hf-xet-diag-macos.sh`)
* Uses `sample` + `lldb` to periodically snapshot stacks and produce core dumps.
* Downloads and installs the appropriate `hf_xet-*.dbg` symbol file automatically.
**Requirements:**
```bash
sudo xcode-select --install
```
**Example:**
```bash
./scripts/diag/hf-xet-diag-macos.sh -- python hf-download.py "Qwen/Qwen2.5-VL-3B-Instruct"
```
### Windows / Git-Bash (`hf-xet-diag-windows.sh`)
* Runs in **Git-Bash**, keeping usage consistent with Linux/macOS.
* Uses **Sysinternals ProcDump** for periodic mini dumps (`-mp`).
* Auto-downloads `procdump.exe` if not found.
* Downloads and installs the matching `hf_xet.pdb` debug symbol into the package directory.
**Requirements:**
* Git-Bash (from [Git for Windows](https://gitforwindows.org/))
* Python installed
* Internet access (first run downloads ProcDump and debug symbols)
**Example:**
```bash
./scripts/diag/hf-xet-diag-windows.sh -- python hf-download.py "Qwen/Qwen2.5-VL-3B-Instruct"
```
## Output layout
Each run produces a directory named `diag_<command>_<timestamp>/`:
```
diag_<command>_<timestamp>/
├── console.log # Combined stdout/stderr of the process
├── env.log # System/environment info
├── pid # Child PID file
├── stacks/ # Periodic stack traces / mini dumps
└── dumps/ # (Linux/macOS) full core dumps
```
> **Tip:** Zip and attach the entire `diag_<command>_<timestamp>/` directory
> when filing an issue — it contains everything needed to reproduce and diagnose
> the problem.
## Analyzing dumps
Use `hf-xet-diag-analyze-latest.sh` to automatically open the most recent dump
in the appropriate debugger:
```bash
# Auto-detect latest diag_* directory
./scripts/diag/hf-xet-diag-analyze-latest.sh
# Or specify a directory explicitly
./scripts/diag/hf-xet-diag-analyze-latest.sh diag_python_hfxet_test_20250127120000
```
The script:
* Auto-detects your OS (Linux, macOS, or Windows)
* Finds the most recent `diag_*` directory
* Opens the latest dump in the platform-appropriate debugger:
* **Linux:** `gdb` with core dumps from `dumps/`
* **macOS:** `lldb` with `.core` files from `dumps/`
* **Windows (Git-Bash):** `windbg` with `.dmp` files from `stacks/`
### Manual analysis
**Linux**
```bash
gdb python dumps/core_<timestamp>.<pid>
(gdb) bt # backtrace of current thread
(gdb) thread apply all bt # backtrace of all threads
(gdb) info threads # list all threads
```
Debug symbols: `hf_xet-*.so.dbg` must be in the `hf_xet` package directory.
**macOS**
```bash
lldb -c dumps/dump_<pid>_<timestamp>.core python3
(lldb) bt # backtrace of current thread
(lldb) thread backtrace all # backtrace of all threads
(lldb) thread list # list all threads
```
Debug symbols: `hf_xet-*.dylib.dSYM` must be in the `hf_xet` package directory.
**Windows**
```cmd
windbg -z stacks\dump_<timestamp>.dmp
```
Useful WinDbg commands:
```
!analyze -v # automatic analysis
~* kb # backtrace of all threads
~ # list all threads
lm # list loaded modules (verify hf_xet.pdb loaded)
```
Debug symbols: `hf_xet.pdb` must be in the `hf_xet` package directory.
## Installing debug symbols manually
The diagnostic scripts install symbols automatically, but you can also do it
manually:
1. Download and unzip the [debug symbols package](https://github.com/huggingface/xet-core/releases/download/latest/dbg-symbols.zip).
2. Find the `hf_xet` package location: `pip show hf-xet` — look at the `Location` field.
3. Choose the right symbol file for your platform:
* **Windows:** `hf_xet.pdb`
* **macOS (Apple Silicon):** `libhf_xet-macosx-aarch64.dylib.dSYM`
* **macOS (Intel):** `libhf_xet-macosx-x86_64.dylib.dSYM`
* **Linux:** match your wheel distribution and arch — check with:
```bash
cat /path/to/site-packages/hf_xet-*.dist-info/WHEEL
```
then use `hf_xet-<manylinux|musllinux>-<x86_64|arm64>.abi3.so.dbg`.
4. Copy the symbol file into the `hf_xet` package directory:
```bash
cp -r hf_xet-1.1.2-manylinux-x86_64.abi3.so.dbg \
/path/to/site-packages/hf_xet/
```
5. Re-run with `RUST_BACKTRACE=full` to get a full backtrace.
## Useful environment variables
```bash
RUST_BACKTRACE=full # full Rust backtraces on panic
RUST_LOG=info # enable hf-xet logging
HF_XET_LOG_FILE=/tmp/xet.log # write logs to a file (defaults to stdout)
```

View File

@@ -0,0 +1,63 @@
# hf-xet Smoke Tests
End-to-end tests that exercise the full hf-xet upload/download path against the
real HuggingFace Hub. They use the `hf` CLI for all Hub operations and verify
content integrity with SHA-256 checksums.
## Prerequisites
- [`uv`](https://docs.astral.sh/uv/)
- [`hf` CLI](https://huggingface.co/docs/huggingface_hub/en/guides/cli) — `uv tool install huggingface_hub`
- `HF_TOKEN` environment variable with write access
## Usage
```bash
# Test latest hf_xet from PyPI
./scripts/smoke_tests/run.sh
# Test a specific version (bypasses uv cache, fetches directly from PyPI)
./scripts/smoke_tests/run.sh --hf-xet-version 1.4.0
# Test a local wheel
HF_XET_WHEEL=./dist/hf_xet-1.4.0.whl ./scripts/smoke_tests/run.sh
# Skip storage bucket tests
./scripts/smoke_tests/run.sh --skip-buckets
# Keep the test repo/bucket after the run (useful for debugging)
./scripts/smoke_tests/run.sh --keep-repo
```
## What's tested
### Repository tests (`hf upload` / `hf download`)
Uploads test files of varying sizes (1 KB → 100 MB) to a temporary private
model repo, then downloads and verifies every file's SHA-256 hash.
| Test | Description |
|------|-------------|
| Upload single file | `hf upload` of a single file |
| Upload folder | `hf upload` of an entire directory tree |
| Download individual files | Per-file `hf download` + hash check |
| Download all files | Full-repo `hf download` + hash check |
| Overwrite and re-download | Confirms updated content is served after overwrite |
| Delete file | `hf repos delete-files` + confirms file is absent |
### Bucket tests (`hf buckets`)
Creates a temporary private bucket and exercises the full bucket lifecycle.
| Test | Description |
|------|-------------|
| cp upload | `hf buckets cp` single file upload |
| sync upload | `hf buckets sync` directory upload |
| list | Recursive listing confirms all expected paths |
| cp download | `hf buckets cp` download + hash check |
| sync download | `hf buckets sync` directory download + hash check |
| Overwrite | cp overwrite + re-download confirms new content |
| sync --delete | Extraneous remote files removed when absent locally |
| rm | `hf buckets rm` + confirms file absent from listing |
All temporary repos and buckets are deleted after the run unless `--keep-repo` is set.

56
scripts/smoke_tests/run.sh Executable file
View File

@@ -0,0 +1,56 @@
#!/bin/bash
set -euo pipefail
# Smoke test runner for hf-xet upload/download via the hf CLI.
#
# Prerequisites:
# - uv (https://docs.astral.sh/uv/)
# - hf CLI (pip install huggingface_hub, or uv tool install huggingface_hub)
# - HF_TOKEN env var with write access
#
# Usage:
# ./scripts/smoke_tests/run.sh # latest hf_xet from PyPI
# ./scripts/smoke_tests/run.sh --hf-xet-version 1.4.0 # specific version (bypasses uv cache)
# ./scripts/smoke_tests/run.sh --skip-buckets # skip storage bucket tests
# ./scripts/smoke_tests/run.sh --keep-repo # leave test repo/bucket after run
# HF_XET_WHEEL=./dist/hf_xet-1.4.0.whl ./scripts/smoke_tests/run.sh # local wheel
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
if [ -z "${HF_TOKEN:-}" ]; then
echo "ERROR: HF_TOKEN environment variable is required" >&2
echo " export HF_TOKEN=hf_..." >&2
exit 1
fi
if ! command -v uv &> /dev/null; then
echo "ERROR: uv is required. Install: curl -LsSf https://astral.sh/uv/install.sh | sh" >&2
exit 1
fi
if ! command -v hf &> /dev/null; then
echo "ERROR: hf CLI is required. Install: uv tool install huggingface_hub" >&2
exit 1
fi
# Parse --hf-xet-version from args (if present)
HF_XET_VERSION=""
for arg in "$@"; do
if [[ "${prev_arg:-}" == "--hf-xet-version" ]]; then
HF_XET_VERSION="$arg"
fi
prev_arg="$arg"
done
echo "Running hf-xet smoke tests..."
echo ""
if [ -n "${HF_XET_WHEEL:-}" ]; then
echo "Using local wheel: ${HF_XET_WHEEL}"
uv run --with "${HF_XET_WHEEL}" "${SCRIPT_DIR}/test_upload_download.py" "$@"
elif [ -n "${HF_XET_VERSION}" ]; then
echo "Using hf_xet version: ${HF_XET_VERSION} (fetching from PyPI)"
uv run --with "hf_xet==${HF_XET_VERSION}" --refresh-package hf_xet "${SCRIPT_DIR}/test_upload_download.py" "$@"
else
uv run "${SCRIPT_DIR}/test_upload_download.py" "$@"
fi

View File

@@ -0,0 +1,410 @@
"""
Smoke test for hf-xet using the `hf` CLI for upload/download through both
HuggingFace model repositories and storage buckets.
Creates temporary resources, exercises upload/download paths, verifies content
integrity, then cleans up. Requires HF_TOKEN with write access.
Usage:
uv run scripts/smoke_tests/test_upload_download.py
uv run scripts/smoke_tests/test_upload_download.py --hf-xet-version 1.4.0
uv run scripts/smoke_tests/test_upload_download.py --keep-repo
uv run scripts/smoke_tests/test_upload_download.py --skip-buckets
"""
# /// script
# requires-python = ">=3.10"
# dependencies = [
# "huggingface_hub>=1.0.0",
# "hf_xet",
# ]
# ///
import argparse
import hashlib
import os
import secrets
import shutil
import subprocess
import sys
import tempfile
import time
from pathlib import Path
# ---------------------------------------------------------------------------
# Helpers
# ---------------------------------------------------------------------------
def run(cmd: list[str], check: bool = True) -> str:
"""Run a CLI command, return stdout. Raises RuntimeError on failure."""
result = subprocess.run(cmd, capture_output=True, text=True)
if check and result.returncode != 0:
raise RuntimeError(
f"Command failed: {' '.join(cmd)}\n"
f"stdout: {result.stdout.strip()}\n"
f"stderr: {result.stderr.strip()}"
)
return result.stdout.strip()
def sha256_bytes(data: bytes) -> str:
return hashlib.sha256(data).hexdigest()
def sha256_file(path: str | Path) -> str:
h = hashlib.sha256()
with open(path, "rb") as f:
for chunk in iter(lambda: f.read(65536), b""):
h.update(chunk)
return h.hexdigest()
def generate_file(path: str | Path, size_bytes: int) -> str:
"""Write random bytes to path; return sha256 hex."""
data = secrets.token_bytes(size_bytes)
Path(path).parent.mkdir(parents=True, exist_ok=True)
with open(path, "wb") as f:
f.write(data)
return sha256_bytes(data)
# ---------------------------------------------------------------------------
# Test runner
# ---------------------------------------------------------------------------
class Results:
def __init__(self):
self.passed = 0
self.failed = 0
self.errors: list[tuple[str, str]] = []
def run(self, name: str, fn):
print(f"\n{'='*60}")
print(f"TEST: {name}")
print(f"{'='*60}")
try:
fn()
self.passed += 1
print(f"PASSED: {name}")
except Exception as e:
self.failed += 1
self.errors.append((name, str(e)))
print(f"FAILED: {name}: {e}", file=sys.stderr)
def summary(self):
print(f"\n{'='*60}")
print(f"RESULTS: {self.passed} passed, {self.failed} failed")
print(f"{'='*60}")
if self.errors:
for name, err in self.errors:
print(f" FAIL: {name}: {err}")
sys.exit(1)
else:
print("All smoke tests passed!")
# ---------------------------------------------------------------------------
# Main
# ---------------------------------------------------------------------------
def main():
parser = argparse.ArgumentParser(description="Smoke test hf-xet via hf CLI")
parser.add_argument("--hf-xet-version", help="Expected hf_xet version (display/warn only)")
parser.add_argument("--keep-repo", action="store_true", help="Skip cleanup of test repo/bucket")
parser.add_argument("--repo-prefix", default="smoke-test-xet", help="Prefix for temp resource names")
parser.add_argument("--skip-buckets", action="store_true", help="Skip storage bucket tests")
args = parser.parse_args()
# --- preflight checks ---
token = os.environ.get("HF_TOKEN")
if not token:
print("ERROR: HF_TOKEN environment variable is required", file=sys.stderr)
sys.exit(1)
if not shutil.which("hf"):
print("ERROR: hf CLI not found. Install: pip install huggingface_hub", file=sys.stderr)
sys.exit(1)
# --- print environment ---
import huggingface_hub
print(f"huggingface_hub version: {huggingface_hub.__version__}")
try:
from importlib.metadata import version as pkg_version
installed_xet = pkg_version("hf_xet")
print(f"hf_xet version: {installed_xet}")
if args.hf_xet_version and installed_xet != args.hf_xet_version:
print(f"WARNING: hf_xet version mismatch: got {installed_xet}, expected {args.hf_xet_version}")
except Exception:
print("hf_xet version: unknown")
print(f"hf CLI: {run(['hf', 'version'])}")
# --- resolve username ---
from huggingface_hub import HfApi
user = HfApi(token=token).whoami()["name"]
suffix = secrets.token_hex(4)
repo_id = f"{user}/{args.repo_prefix}-{suffix}"
bucket_id = f"{user}/{args.repo_prefix}-bucket-{suffix}"
print(f"\nTest repo: {repo_id}")
if not args.skip_buckets:
print(f"Test bucket: {bucket_id}")
results = Results()
# ===================================================================== #
# Repository tests (hf upload / hf download)
# ===================================================================== #
repo_test_files = {
"small.bin": 1024, # 1 KB — below chunk size
"medium.bin": 256 * 1024, # 256 KB — a few chunks
"large.bin": 5 * 1024 * 1024, # 5 MB — multiple chunks
"xlarge.bin": 50 * 1024 * 1024,# 50 MB — large multi-xorb
"xxlarge.bin": 100 * 1024 * 1024,# 100 MB — stress test
"subdir/nested.bin": 128 * 1024, # 128 KB — subdirectory
}
repo_created = False
try:
print(f"\nCreating repo {repo_id}...")
run(["hf", "repos", "create", repo_id, "--private"])
repo_created = True
with tempfile.TemporaryDirectory() as tmpdir:
upload_dir = Path(tmpdir) / "upload"
download_dir = Path(tmpdir) / "download"
upload_dir.mkdir()
download_dir.mkdir()
expected = {}
for rel, size in repo_test_files.items():
expected[rel] = generate_file(upload_dir / rel, size)
print(f" Generated {rel} ({size:,} bytes)")
# -- 1. upload single file --
def test_repo_upload_single():
t = time.time()
run(["hf", "upload", repo_id,
str(upload_dir / "small.bin"), "small.bin", "--quiet"])
print(f" Uploaded small.bin in {time.time()-t:.1f}s")
results.run("Repo: upload single file", test_repo_upload_single)
# -- 2. upload folder --
def test_repo_upload_folder():
t = time.time()
run(["hf", "upload", repo_id, str(upload_dir), ".", "--quiet"])
print(f" Uploaded folder in {time.time()-t:.1f}s")
results.run("Repo: upload folder", test_repo_upload_folder)
# -- 3. download individual files and verify --
def test_repo_download_single():
out = str(download_dir / "single")
for rel in repo_test_files:
t = time.time()
run(["hf", "download", repo_id, rel, "--local-dir", out, "--quiet"])
actual = sha256_file(Path(out) / rel)
assert actual == expected[rel], (
f"Hash mismatch for {rel}: "
f"expected {expected[rel][:16]}..., got {actual[:16]}..."
)
print(f" Downloaded+verified {rel} in {time.time()-t:.1f}s")
results.run("Repo: download and verify individual files", test_repo_download_single)
# -- 4. download entire repo and verify --
def test_repo_download_all():
out = str(download_dir / "all")
t = time.time()
run(["hf", "download", repo_id, "--local-dir", out, "--quiet"])
print(f" Downloaded all files in {time.time()-t:.1f}s")
for rel in repo_test_files:
p = Path(out) / rel
assert p.exists(), f"Missing file: {rel}"
actual = sha256_file(p)
assert actual == expected[rel], (
f"Hash mismatch for {rel}: "
f"expected {expected[rel][:16]}..., got {actual[:16]}..."
)
print(f" Verified {rel}")
results.run("Repo: download all files and verify", test_repo_download_all)
# -- 5. overwrite a file and verify new content --
def test_repo_overwrite():
new_hash = generate_file(upload_dir / "small.bin", 2048)
run(["hf", "upload", repo_id,
str(upload_dir / "small.bin"), "small.bin", "--quiet"])
out = str(download_dir / "overwrite")
run(["hf", "download", repo_id, "small.bin",
"--local-dir", out, "--force-download", "--quiet"])
actual = sha256_file(Path(out) / "small.bin")
assert actual == new_hash, (
f"Overwrite mismatch: expected {new_hash[:16]}..., got {actual[:16]}..."
)
print(" Overwrite verified: new content downloaded correctly")
results.run("Repo: upload overwrite and verify", test_repo_overwrite)
# -- 6. delete files from repo --
def test_repo_delete_files():
run(["hf", "repos", "delete-files", repo_id, "small.bin"])
# Re-download all; small.bin should be absent
out = str(download_dir / "post-delete")
run(["hf", "download", repo_id, "--local-dir", out, "--quiet"])
assert not (Path(out) / "small.bin").exists(), \
"small.bin still present after deletion"
print(" small.bin confirmed absent after delete")
results.run("Repo: delete file from repo", test_repo_delete_files)
finally:
if repo_created and not args.keep_repo:
print(f"\nCleaning up repo {repo_id}...")
try:
run(["hf", "repos", "delete", repo_id])
print(" Deleted.")
except Exception as e:
print(f" Warning: failed to delete repo: {e}", file=sys.stderr)
# ===================================================================== #
# Storage bucket tests (hf buckets)
# ===================================================================== #
if args.skip_buckets:
results.summary()
return
# Check that hf buckets is available
bucket_check = run(["hf", "buckets", "--help"], check=False)
if "buckets" not in bucket_check.lower() and "error" in bucket_check.lower():
print("\nWARNING: hf buckets not available in this hf CLI version — skipping bucket tests")
results.summary()
return
bucket_created = False
try:
print(f"\nCreating bucket {bucket_id}...")
run(["hf", "buckets", "create", bucket_id, "--private"])
bucket_created = True
handle = f"hf://buckets/{bucket_id}"
with tempfile.TemporaryDirectory() as tmpdir:
upload_dir = Path(tmpdir) / "upload"
download_dir = Path(tmpdir) / "download"
(upload_dir / "subdir").mkdir(parents=True)
download_dir.mkdir()
# Files used in bucket tests
single_hash = generate_file(upload_dir / "single.bin", 512 * 1024)
subdir1_hash = generate_file(upload_dir / "subdir/file1.bin", 256 * 1024)
subdir2_hash = generate_file(upload_dir / "subdir/file2.bin", 256 * 1024)
print(f" Generated single.bin, subdir/file1.bin, subdir/file2.bin")
# -- 1. cp: upload single file --
def test_bucket_cp_upload():
t = time.time()
run(["hf", "buckets", "cp",
str(upload_dir / "single.bin"), f"{handle}/single.bin"])
print(f" Uploaded single.bin via cp in {time.time()-t:.1f}s")
results.run("Bucket: cp upload single file", test_bucket_cp_upload)
# -- 2. sync: upload directory --
def test_bucket_sync_upload():
t = time.time()
run(["hf", "buckets", "sync",
str(upload_dir / "subdir"), f"{handle}/subdir"])
print(f" Synced subdir/ up in {time.time()-t:.1f}s")
results.run("Bucket: sync upload directory", test_bucket_sync_upload)
# -- 3. list files (recursive quiet) --
def test_bucket_list():
out = run(["hf", "buckets", "list", bucket_id, "-R", "--quiet"])
listed = set(out.splitlines())
for path in ("single.bin", "subdir/file1.bin", "subdir/file2.bin"):
assert path in listed, f"Expected {path!r} in listing, got: {listed}"
print(f" Listed {len(listed)} file(s): {sorted(listed)}")
results.run("Bucket: list files (recursive)", test_bucket_list)
# -- 4. cp: download single file and verify --
def test_bucket_cp_download():
out_path = download_dir / "single.bin"
t = time.time()
run(["hf", "buckets", "cp", f"{handle}/single.bin", str(out_path)])
actual = sha256_file(out_path)
assert actual == single_hash, (
f"Hash mismatch: expected {single_hash[:16]}..., got {actual[:16]}..."
)
print(f" Downloaded+verified single.bin in {time.time()-t:.1f}s")
results.run("Bucket: cp download and verify", test_bucket_cp_download)
# -- 5. sync: download directory and verify --
def test_bucket_sync_download():
out_dir = download_dir / "subdir"
t = time.time()
run(["hf", "buckets", "sync", f"{handle}/subdir", str(out_dir)])
print(f" Synced subdir/ down in {time.time()-t:.1f}s")
for fname, expected_hash in (
("file1.bin", subdir1_hash),
("file2.bin", subdir2_hash),
):
p = out_dir / fname
assert p.exists(), f"Missing: {p}"
actual = sha256_file(p)
assert actual == expected_hash, (
f"Hash mismatch for {fname}: "
f"expected {expected_hash[:16]}..., got {actual[:16]}..."
)
print(f" Verified subdir/{fname}")
results.run("Bucket: sync download and verify", test_bucket_sync_download)
# -- 6. overwrite via cp and verify new content --
def test_bucket_overwrite():
new_hash = generate_file(upload_dir / "single.bin", 1024 * 1024)
run(["hf", "buckets", "cp",
str(upload_dir / "single.bin"), f"{handle}/single.bin"])
out_path = download_dir / "single_overwrite.bin"
run(["hf", "buckets", "cp", f"{handle}/single.bin", str(out_path)])
actual = sha256_file(out_path)
assert actual == new_hash, (
f"Overwrite mismatch: expected {new_hash[:16]}..., got {actual[:16]}..."
)
print(" Overwrite verified: new content downloaded correctly")
results.run("Bucket: cp overwrite and verify", test_bucket_overwrite)
# -- 7. sync --delete: remove files absent from source --
def test_bucket_sync_delete():
# Local subdir now only has file1.bin; sync --delete should remove file2.bin
(upload_dir / "subdir" / "file2.bin").unlink()
run(["hf", "buckets", "sync",
str(upload_dir / "subdir"), f"{handle}/subdir", "--delete"])
out = run(["hf", "buckets", "list", bucket_id, "-R", "--quiet"])
listed = set(out.splitlines())
assert "subdir/file2.bin" not in listed, \
f"subdir/file2.bin still present after sync --delete: {listed}"
assert "subdir/file1.bin" in listed, \
f"subdir/file1.bin missing after sync --delete: {listed}"
print(f" sync --delete verified: remaining files: {sorted(listed)}")
results.run("Bucket: sync --delete removes extraneous files", test_bucket_sync_delete)
# -- 8. rm: delete a file and confirm it's gone --
def test_bucket_rm():
run(["hf", "buckets", "rm", f"{bucket_id}/single.bin", "--yes"])
out = run(["hf", "buckets", "list", bucket_id, "-R", "--quiet"])
listed = set(out.splitlines())
assert "single.bin" not in listed, \
f"single.bin still present after rm: {listed}"
print(f" rm verified: remaining files: {sorted(listed)}")
results.run("Bucket: rm file", test_bucket_rm)
finally:
if bucket_created and not args.keep_repo:
print(f"\nCleaning up bucket {bucket_id}...")
try:
run(["hf", "buckets", "delete", bucket_id, "--yes"])
print(" Deleted.")
except Exception as e:
print(f" Warning: failed to delete bucket: {e}", file=sys.stderr)
results.summary()
if __name__ == "__main__":
main()