docs: sync structure-inference SLURM options with AlphaPulldownSnakemake

Mirror the expanded "SLURM defaults for structure inference" section from AlphaPulldownSnakemake: add slurm_exclude_nodes and structure_inference_max_runtime to the config example, and document GPU node exclusion / runtime cap and the unified-memory options (structure_inference_unified_memory + structure_inference_xla_mem_fraction, now defaulting to "auto" = host RAM / GPU VRAM). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-04 14:14:24 +08:00 · 2026-05-21 10:27:15 +02:00
parent 716c061230
commit 4c3e83ce78
1 changed files with 71 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -223,6 +223,8 @@ slurm_qos: "normal"                         # optional QoS if your site uses it
 structure_inference_gpus_per_task: 1        # number of GPUs each inference job needs
 structure_inference_gpu_model: "3090"       # optional GPU model constraint (remove to allow any)
 structure_inference_tasks_per_gpu: 0        # <=0 keeps --ntasks-per-gpu unset in the plugin
+slurm_exclude_nodes: ""                     # optional comma-separated nodes to avoid (sbatch --exclude)
+structure_inference_max_runtime: 10080      # cap wall time (min) at the partition MaxTime
 ```

 `structure_inference_gpus_per_task` and `structure_inference_gpu_model` are read by the
@@ -234,6 +236,75 @@ fields keeps the job submission consistent across clusters.
 the default `0` prevents that flag, which avoids conflicting with the Tres-per-task request on many
 systems. Set it to a positive integer only if your site explicitly requires `--ntasks-per-gpu`.

+The remaining optional fields help with two common cluster issues: keeping inference off GPUs it
+can't use, and large complexes running out of GPU memory. Defaults are sensible; expand below only if
+you hit these.
+
+<details>
+<summary>Avoiding unsuitable GPUs (<code>slurm_exclude_nodes</code>, <code>gpu_model</code>) and the runtime cap</summary>
+
+- **Restrict to one model** with `structure_inference_gpu_model` (e.g. `"A100"`) → the plugin emits
+  `--gpus=<model>:<count>`. Accepts a single model name; leave `""` for any.
+- **Exclude specific nodes** with `slurm_exclude_nodes` → passed verbatim to `sbatch --exclude`
+  (e.g. `"gpu50,gpu51"`). Use it for nodes whose GPU the container can't use — e.g. a CUDA compute
+  capability newer than the container's bundled `ptxas` (fails `ptxas too old` / `UNIMPLEMENTED`).
+  `--exclude` is allowed in `slurm_extra` whereas `--constraint`/`--gres`/`--gpus` are not, so it is
+  the supported way to drop a few nodes while keeping the rest of the partition.
+- **`structure_inference_max_runtime`** caps per-job wall time (minutes). Wall time scales as
+  `1440 * attempt`, so without a cap enough retries exceed the partition `MaxTime` and SLURM rejects
+  the job with `Requested time limit is invalid`. Set it to your partition's `MaxTime`
+  (`scontrol show partition <name>`); default 7 days (10080).
+
+</details>
+
+<details>
+<summary>Unified memory for large complexes (<code>structure_inference_unified_memory</code>)</summary>
+
+Large AlphaFold 3 inputs (or smaller-VRAM GPUs) can fail with `RESOURCE_EXHAUSTED` /
+`Allocator (GPU_0_bfc) ran out of memory`. Inference enables JAX/XLA **unified (managed) memory** by
+default so the model spills from GPU VRAM into host RAM instead of OOM-ing (slower while spilling, but
+it completes) — the
+[DeepMind-recommended setting](https://github.com/google-deepmind/alphafold3/blob/main/docs/performance.md)
+for large inputs. It is exported inside the prediction container as:
+
+```sh
+export TF_FORCE_UNIFIED_MEMORY=true
+export XLA_PYTHON_CLIENT_PREALLOCATE=false   # don't grab a huge VRAM chunk up front
+export XLA_CLIENT_MEM_FRACTION=$FRACTION      # how far past physical VRAM XLA may allocate
+export XLA_PYTHON_CLIENT_MEM_FRACTION=$FRACTION
+```
+
+`XLA_PYTHON_CLIENT_PREALLOCATE=false` is required: without it XLA reserves a large
+slice of VRAM immediately, which defeats the point of letting the allocator grow into
+host RAM on demand.
+
+```yaml
+structure_inference_unified_memory: true     # set false to fail fast on OOM instead
+structure_inference_xla_mem_fraction: auto   # "auto", or pin a number like 3.2
+```
+
+With the default `structure_inference_xla_mem_fraction: auto`, the fraction is computed
+**per job at run time** as `(allocated host RAM) / (physical GPU VRAM)`: the GPU VRAM is
+read with `nvidia-smi` once the job lands on a node, and the host RAM is the job's SLURM
+`--mem` allocation (which scales with retry attempts). This keeps the unified-memory
+ceiling within the SLURM allocation so XLA cannot oversubscribe host RAM beyond what the
+job requested — which would otherwise get the job OOM-killed. The chosen fraction is
+logged as a `[unified-memory]` line at the top of the job log. Pin a number instead if
+you want a fixed multiplier regardless of GPU/RAM (mirrors the EMBL `run_AF_multimer.sh`
+convention).
+
+> The fraction is computed in the job shell rather than via the SLURM executor: the
+> executor passes the submit environment through with `--export=ALL` but offers no
+> per-job env hook, and the value depends on which GPU the job lands on (only known at
+> run time). Computing it in the container shell also avoids the apptainer env-crossing
+> that submit-side env vars would need.
+
+Because spilling is slower, make sure the job also requests enough host RAM
+(`structure_inference_ram_bytes`, in MB) to hold the overflow — under `auto` that RAM is
+exactly what the fraction is sized against.
+
+</details>
+
 ### Using Precomputed Features

 If you have precomputed protein features, specify the directory: