Updated diagnostics scripts to collect logs (#542)

- also updated README
- added analysis script to load latest dump collected

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This commit is contained in:
Rajat Arya
2025-10-29 08:48:47 -07:00
committed by GitHub
parent 85b5ba5fa7
commit 03c190325f
5 changed files with 352 additions and 32 deletions

View File

@@ -43,7 +43,7 @@ Please join us in making xet-core better. We value everyone's contributions. Cod
## Issues, Diagnostics & Debugging
If you encounter an issue when using `hf-xet` please help us fix the issue by collecting diagnostic information and attaching that when creating a [new Issue](https://github.com/huggingface/xet-core/issues/new/choose). Download the [hf-xet-diag-linux.sh](hf-xet-diag-linux.sh) or [hf-xet-diag-windows.sh](hf-xet-diag-windows.sh) script based on your operating system and then re-run the python command that resulted in the issue. The diagnostic scripts will download and install debug symbols, setup up logging, and take periodic stack traces throughout process execution in a diagnostics directory that is easy to analyze, package, and upload.
If you encounter an issue when using `hf-xet` please help us fix the issue by collecting diagnostic information and attaching that when creating a [new Issue](https://github.com/huggingface/xet-core/issues/new/choose). Download the [hf-xet-diag-linux.sh](hf-xet-diag-linux.sh), [hf-xet-diag-macos.sh](hf-xet-diag-macos.sh), or [hf-xet-diag-windows.sh](hf-xet-diag-windows.sh) script based on your operating system and then re-run the python command that resulted in the issue. The diagnostic scripts will download and install debug symbols, setup up logging, and take periodic stack traces throughout process execution in a diagnostics directory that is easy to analyze, package, and upload.
### Diagnostics - Linux (`hf-xet-diag-linux.sh`)
@@ -103,7 +103,7 @@ sudo xcode-select --install
### Output Layout
Both scripts produce a diagnostics directory named:
The diagnostic scripts produce a diagnostics directory named:
```
diag_<command>_<timestamp>/
@@ -120,53 +120,70 @@ This unified layout makes it easier to compare diagnostics across platforms.
### Analyzing Dumps
### Usage
Use the [hf-xet-diag-analyze-latest.sh](hf-xet-diag-analyze-latest.sh) script to automatically find and open the most recent dump in the appropriate debugger for your platform.
From your repo root:
**Usage:**
```bash
./analyze-latest.sh
./hf-xet-diag-analyze-latest.sh
```
* Finds the most recent `diag_*` directory.
* Opens the latest dump inside:
* Auto-detects your OS (Linux, macOS, or Windows)
* Finds the most recent `diag_*` directory
* Opens the latest dump in the platform-appropriate debugger:
* **Linux:** `gdb` with core dumps from `dumps/`
* **macOS:** `lldb` with `.core` files from `dumps/`
* **Windows (Git-Bash):** `windbg` with `.dmp` files from `stacks/`
* **Linux:** opens `dumps/core_*` in `gdb`.
* **Windows (Git-Bash):** opens `stacks/*.dmp` in **WinDbg** (`windbg` must be on PATH).
* You can also pass a base directory if your diagnostics are stored elsewhere:
You can also specify a diagnostics directory:
```bash
./analyze-latest.sh /path/to/diagnostics
```
```bash
./hf-xet-diag-analyze-latest.sh diag_python_hfxet_test_20250127120000
```
**Manual Analysis**
If you prefer to analyze dumps manually:
**Linux**
* Stack traces are saved under `stacks/` as plain text.
* Core dumps (`dumps/core_*`) can be analyzed with gdb:
* Stack traces: `stacks/*.txt` (plain text, captured periodically)
* Core dumps: `dumps/core_*`
* Analysis:
```bash
gdb python dumps/core_<pid>
(gdb) bt # backtrace
(gdb) thread apply all bt
gdb python dumps/core_<timestamp>.<pid>
(gdb) bt # backtrace of current thread
(gdb) thread apply all bt # backtrace of all threads
(gdb) info threads # list all threads
```
* Ensure the matching debug symbols (`hf_xet-*.dbg`) are in the `hf_xet` package directory.
* Ensure debug symbols (`hf_xet-*.so.dbg`) are in the `hf_xet` package directory
**macOS**
* Stack traces: `stacks/*.txt` (from `sample` command)
* Core dumps: `dumps/dump_<pid>_<timestamp>.core`
* Analysis:
```bash
lldb -c dumps/dump_<pid>_<timestamp>.core python3
(lldb) bt # backtrace of current thread
(lldb) thread backtrace all # backtrace of all threads
(lldb) thread list # list all threads
```
* Ensure debug symbols (`hf_xet-*.dylib.dSYM`) are in the `hf_xet` package directory
**Windows**
* Dumps are saved under `stacks/` as `.dmp` files.
* Open `.dmp` files in **WinDbg** (install via [Windows SDK](https://developer.microsoft.com/en-us/windows/downloads/windows-10-sdk/)):
* Dumps: `stacks/dump_<timestamp>.dmp`
* Install [WinDbg via Windows SDK](https://developer.microsoft.com/en-us/windows/downloads/windows-sdk/)
* Analysis:
```cmd
windbg -z dump_20250101_120000.dmp
windbg -z stacks\dump_<timestamp>.dmp
```
* Common WinDbg commands:
```
!analyze -v # Automatic analysis
~* kb # Show stack traces for all threads
lm # List loaded modules (verify hf_xet.pdb loaded)
!analyze -v # automatic analysis
~* kb # backtrace of all threads
~ # list all threads
lm # list loaded modules (verify hf_xet.pdb loaded)
```
* Ensure `hf_xet.pdb` is installed in the `hf_xet` package directory so symbols load correctly.
* Ensure debug symbols (`hf_xet.pdb`) are in the `hf_xet` package directory
---

246
hf-xet-diag-analyze-latest.sh Executable file
View File

@@ -0,0 +1,246 @@
#!/usr/bin/env bash
# hf-xet-diag-analyze-latest.sh — Cross-platform dump analyzer
# Finds the latest diagnostics directory and opens the most recent dump
# in the appropriate debugger for your platform (gdb, lldb, or WinDbg).
set -Eeuo pipefail
print_usage() {
cat <<'USAGE'
Usage: hf-xet-diag-analyze-latest.sh [diagnostics-directory]
Finds and analyzes the latest dump from a diagnostics collection.
Arguments:
diagnostics-directory Path to a specific diag_* directory
(default: latest diag_* in current directory)
Examples:
./hf-xet-diag-analyze-latest.sh
./hf-xet-diag-analyze-latest.sh diag_python_hfxet_test_20250127120000
This script will:
- Auto-detect your OS (Linux, macOS, or Windows)
- Find the most recent dump file
- Launch the appropriate debugger:
* Linux: gdb with core dumps from dumps/
* macOS: lldb with .core files from dumps/
* Windows: WinDbg with .dmp files from stacks/
USAGE
}
# --- option parsing ---
if [[ "${1:-}" == "-h" || "${1:-}" == "--help" ]]; then
print_usage
exit 0
fi
DIAG_DIR="${1:-}"
# --- find diagnostics directory ---
if [[ -z "$DIAG_DIR" ]]; then
# Find the latest diag_* directory in current directory
DIAG_DIR=$(find . -maxdepth 1 -type d -name "diag_*" -print0 2>/dev/null | \
xargs -0 ls -dt 2>/dev/null | head -1 || true)
if [[ -z "$DIAG_DIR" ]]; then
echo "ERROR: No diag_* directories found in current directory."
echo "Please specify a diagnostics directory or run from a directory containing diag_* folders."
exit 1
fi
echo "Found latest diagnostics directory: $DIAG_DIR"
elif [[ ! -d "$DIAG_DIR" ]]; then
echo "ERROR: Directory not found: $DIAG_DIR"
exit 1
fi
# --- detect OS ---
OS_TYPE=""
case "${OSTYPE:-}" in
linux*) OS_TYPE="linux" ;;
darwin*) OS_TYPE="macos" ;;
msys*|mingw*|cygwin*) OS_TYPE="windows" ;;
*)
# Fallback: check uname
UNAME=$(uname -s 2>/dev/null || echo "")
case "$UNAME" in
Linux*) OS_TYPE="linux" ;;
Darwin*) OS_TYPE="macos" ;;
MINGW*|MSYS*|CYGWIN*) OS_TYPE="windows" ;;
*)
echo "ERROR: Unsupported OS: ${OSTYPE:-unknown} / ${UNAME:-unknown}"
exit 1
;;
esac
;;
esac
echo "Detected OS: $OS_TYPE"
# --- find latest dump file ---
DUMP_FILE=""
case "$OS_TYPE" in
linux)
# Linux: look for core dumps in dumps/ directory
if [[ -d "$DIAG_DIR/dumps" ]]; then
DUMP_FILE=$(find "$DIAG_DIR/dumps" -type f -name "core_*" -print0 2>/dev/null | \
xargs -0 ls -t 2>/dev/null | head -1 || true)
fi
if [[ -z "$DUMP_FILE" ]]; then
echo "ERROR: No core dumps found in $DIAG_DIR/dumps/"
echo "Core dumps should be named: core_<timestamp>.<pid>"
exit 1
fi
;;
macos)
# macOS: look for .core files in dumps/ directory
if [[ -d "$DIAG_DIR/dumps" ]]; then
DUMP_FILE=$(find "$DIAG_DIR/dumps" -type f -name "*.core" -print0 2>/dev/null | \
xargs -0 ls -t 2>/dev/null | head -1 || true)
fi
if [[ -z "$DUMP_FILE" ]]; then
echo "ERROR: No core dumps found in $DIAG_DIR/dumps/"
echo "Core dumps should be named: dump_<pid>_<timestamp>.core"
exit 1
fi
;;
windows)
# Windows: look for .dmp files in stacks/ directory
if [[ -d "$DIAG_DIR/stacks" ]]; then
DUMP_FILE=$(find "$DIAG_DIR/stacks" -type f -name "*.dmp" -print0 2>/dev/null | \
xargs -0 ls -t 2>/dev/null | head -1 || true)
fi
if [[ -z "$DUMP_FILE" ]]; then
echo "ERROR: No dump files found in $DIAG_DIR/stacks/"
echo "Dump files should be named: dump_<timestamp>.dmp"
exit 1
fi
;;
esac
echo "Found dump file: $DUMP_FILE"
# --- determine python executable ---
PYTHON_EXE=""
for py_candidate in python3 python; do
if command -v "$py_candidate" >/dev/null 2>&1; then
PYTHON_EXE=$(command -v "$py_candidate")
break
fi
done
if [[ -z "$PYTHON_EXE" ]]; then
echo "WARNING: Could not find python executable. Using 'python' as fallback."
PYTHON_EXE="python"
else
echo "Using python executable: $PYTHON_EXE"
fi
# --- launch debugger ---
case "$OS_TYPE" in
linux)
if ! command -v gdb >/dev/null 2>&1; then
echo "ERROR: gdb not found. Install with: sudo apt-get install gdb"
exit 1
fi
echo ""
echo "======================================"
echo "Opening dump in GDB..."
echo "======================================"
echo "Useful commands:"
echo " (gdb) bt # backtrace of current thread"
echo " (gdb) thread apply all bt # backtrace of all threads"
echo " (gdb) info threads # list all threads"
echo " (gdb) quit # exit gdb"
echo "======================================"
echo ""
exec gdb "$PYTHON_EXE" "$DUMP_FILE"
;;
macos)
if ! command -v lldb >/dev/null 2>&1; then
echo "ERROR: lldb not found. Install with: xcode-select --install"
exit 1
fi
echo ""
echo "======================================"
echo "Opening dump in LLDB..."
echo "======================================"
echo "Useful commands:"
echo " (lldb) bt # backtrace of current thread"
echo " (lldb) thread backtrace all # backtrace of all threads"
echo " (lldb) thread list # list all threads"
echo " (lldb) quit # exit lldb"
echo "======================================"
echo ""
exec lldb -c "$DUMP_FILE" "$PYTHON_EXE"
;;
windows)
# Check for various WinDbg installations
WINDBG_EXE=""
# Check if windbg is on PATH
if command -v windbg.exe >/dev/null 2>&1; then
WINDBG_EXE="windbg.exe"
elif command -v windbgx.exe >/dev/null 2>&1; then
WINDBG_EXE="windbgx.exe"
else
# Common installation paths
for dbg_path in \
"/c/Program Files (x86)/Windows Kits/10/Debuggers/x64/windbg.exe" \
"/c/Program Files (x86)/Windows Kits/10/Debuggers/x86/windbg.exe" \
"$PROGRAMFILES/Windows Kits/10/Debuggers/x64/windbg.exe" \
"${PROGRAMFILES_X86}/Windows Kits/10/Debuggers/x64/windbg.exe"
do
if [[ -f "$dbg_path" ]]; then
WINDBG_EXE="$dbg_path"
break
fi
done
fi
if [[ -z "$WINDBG_EXE" ]]; then
echo "ERROR: WinDbg not found."
echo ""
echo "Please install WinDbg from:"
echo " https://developer.microsoft.com/en-us/windows/downloads/windows-sdk/"
echo ""
echo "Or add WinDbg to your PATH."
echo ""
echo "You can manually open the dump file:"
echo " windbg -z \"$DUMP_FILE\""
exit 1
fi
echo ""
echo "======================================"
echo "Opening dump in WinDbg..."
echo "======================================"
echo "Useful commands:"
echo " !analyze -v # automatic analysis"
echo " ~* kb # backtrace of all threads"
echo " ~ # list all threads"
echo " lm # list loaded modules"
echo " q # quit"
echo "======================================"
echo ""
# Convert to Windows path format
DUMP_FILE_WIN=$(cygpath -w "$DUMP_FILE" 2>/dev/null || echo "$DUMP_FILE")
exec "$WINDBG_EXE" -z "$DUMP_FILE_WIN"
;;
esac

View File

@@ -167,6 +167,7 @@ else
fi
# --- launch target ---
SCRIPT_START_TIME=$(date +%s)
echo "Launching target at $(date -Is) ..." | tee -a "$CONSOLE_LOG"
LAUNCH_ENV=()
@@ -275,5 +276,22 @@ while kill -0 "$TARGET_PID" 2>/dev/null; do
done
echo "Process $TARGET_PID has exited at $(date -Is)." | tee -a "$CONSOLE_LOG"
# --- collect xet log files from this execution ---
HF_HOME="${HF_HOME:-$HOME/.cache/huggingface}"
XET_LOG_DIR="$HF_HOME/xet/logs"
if [[ -d "$XET_LOG_DIR" ]]; then
echo "Collecting xet logs from $XET_LOG_DIR ..." | tee -a "$CONSOLE_LOG"
mkdir -p "$OUTDIR/xet_logs"
# Find log files created during or after script start time using GNU find
find "$XET_LOG_DIR" -name "xet_*.log" -type f -newermt "@$SCRIPT_START_TIME" 2>/dev/null | while read -r logfile; do
cp "$logfile" "$OUTDIR/xet_logs/" 2>/dev/null && \
echo " Copied: $(basename "$logfile")" | tee -a "$CONSOLE_LOG"
done
else
echo "No xet log directory found at $XET_LOG_DIR" | tee -a "$CONSOLE_LOG"
fi
echo "Logs and stacks are in: $OUTDIR"
disown "$LOGGER_BG" 2>/dev/null || true

View File

@@ -132,8 +132,12 @@ else
fi
# --- launch target ---
SCRIPT_START_TIME=$(date +%s)
REF_FILE="$OUTDIR/.ref_timestamp"
touch "$REF_FILE" # Reference file for finding logs created after this point
# Ensure REF_FILE is cleaned up on exit
trap 'rm -f "$REF_FILE"' EXIT
echo "Launching target at $(date "+%Y-%m-%dT%H:%M:%S%z") ..." | tee -a "$CONSOLE_LOG"
(
"${CMD[@]}" & echo $! > "$PID_FILE"
) 2>&1 | tee -a "$CONSOLE_LOG" &
@@ -213,6 +217,23 @@ while kill -0 "$TARGET_PID" 2>/dev/null; do
done
echo "Process $TARGET_PID has exited at $(date "+%Y-%m-%dT%H:%M:%S%z")." | tee -a "$CONSOLE_LOG"
# --- collect xet log files from this execution ---
HF_HOME="${HF_HOME:-$HOME/.cache/huggingface}"
XET_LOG_DIR="$HF_HOME/xet/logs"
if [[ -d "$XET_LOG_DIR" ]]; then
echo "Collecting xet logs from $XET_LOG_DIR ..." | tee -a "$CONSOLE_LOG"
mkdir -p "$OUTDIR/xet_logs"
# Find log files created after script start using reference file
find "$XET_LOG_DIR" -name "xet_*.log" -type f -newer "$REF_FILE" 2>/dev/null | while read -r logfile; do
cp "$logfile" "$OUTDIR/xet_logs/" 2>/dev/null && \
echo " Copied: $(basename "$logfile")" | tee -a "$CONSOLE_LOG"
done
else
echo "No xet log directory found at $XET_LOG_DIR" | tee -a "$CONSOLE_LOG"
fi
echo "Logs and stacks are in: $OUTDIR"
disown "$LOGGER_BG" 2>/dev/null || true

View File

@@ -119,6 +119,7 @@ else
fi
# --- launch target ---
SCRIPT_START_TIME=$(date +%s)
(
"${CMD[@]}" & echo $! > "$PID_FILE"
) 2>&1 | tee "$CONSOLE_LOG" &
@@ -159,5 +160,22 @@ while kill -0 "$TARGET_PID" 2>/dev/null; do
done
echo "Process $TARGET_PID has exited at $(date -Is)." | tee -a "$CONSOLE_LOG"
# --- collect xet log files from this execution ---
HF_HOME="${HF_HOME:-$HOME/.cache/huggingface}"
XET_LOG_DIR="$HF_HOME/xet/logs"
if [[ -d "$XET_LOG_DIR" ]]; then
echo "Collecting xet logs from $XET_LOG_DIR ..." | tee -a "$CONSOLE_LOG"
mkdir -p "$OUTDIR/xet_logs"
# Find log files created during or after script start time using GNU find
find "$XET_LOG_DIR" -name "xet_*.log" -type f -newermt "@$SCRIPT_START_TIME" 2>/dev/null | while read -r logfile; do
cp "$logfile" "$OUTDIR/xet_logs/" 2>/dev/null && \
echo " Copied: $(basename "$logfile")" | tee -a "$CONSOLE_LOG"
done
else
echo "No xet log directory found at $XET_LOG_DIR" | tee -a "$CONSOLE_LOG"
fi
echo "Logs and dumps are in: $OUTDIR"
disown "$LOGGER_BG" 2>/dev/null || true