6 Commits

Author SHA1 Message Date
Leonardo Marino-Ramirez
529b756796 feat: add inference.empty_cache_per_design flag to reduce CUDA allocator fragmentation (#451)
## Problem

When running RFdiffusion with variable-length contigs (e.g.
`contigmap.contigs=[A1-469/0 1-50]`) over hundreds or thousands of
designs, per-worker VRAM grows steadily from ~7 GB to 10–13 GB per
process. This limits how many workers can run in parallel on a single
GPU before exhausting VRAM.

Root cause: PyTorch's CUDA caching allocator accumulates fragmented
memory blocks across designs. With variable-length contigs each design
allocates differently-sized tensors; freed blocks are cached but cannot
be reused for different-sized allocations, causing steady VRAM growth.

## Fix

Add an optional `inference.empty_cache_per_design` flag (default
`False`, opt-in) that calls `torch.cuda.empty_cache()` at the end of
each design iteration. This releases all unused cached CUDA memory
blocks back to the CUDA memory manager, keeping each worker near its
initial VRAM footprint for the full run.

### Changes

**`config/inference/base.yaml`**
```yaml
  write_trajectory: True
  empty_cache_per_design: False   # NEW
```

**`scripts/run_inference.py`** — after the trajectory/PDB write block,
before `log.info`:
```python
        if conf.inference.empty_cache_per_design and torch.cuda.is_available():
            torch.cuda.empty_cache()

        log.info(f"Finished design in {(time.time()-start_time)/60:.2f} minutes")
```

## Measured impact

Tested on NVIDIA RTX 5090 32 GB running a long PPI campaign with
variable-length contigs:

| Setting | Per-worker VRAM (steady-state) |
|---------|-------------------------------|
| Without fix | 8–13 GB (grows over run) |
| With `empty_cache_per_design=True` | ~5.2 GB (stable) |

This allowed raising the number of parallel workers from 3 to 5 on a 32
GB GPU.

## Why opt-in

`torch.cuda.empty_cache()` adds a small per-design overhead (~1–2 ms)
and is only beneficial for long runs with variable-length contigs. For
short runs or fixed-length designs there is no fragmentation issue, so
the default remains `False` to preserve existing behavior.

## Testing

All 20 applicable tests in `tests/test_diffusion.py` pass with this
change. The one skipped test (`design_ppi_scaffolded`) fails due to a
missing `ppi_scaffolds/` directory in the test fixture — a pre-existing
issue unrelated to this PR.

## Notes

- Placement is after both the PDB write (`writepdb`) and the optional
trajectory block — every consumer of `denoised_xyz_stack` /
`px0_xyz_stack` has already finished before the cache is cleared.
- This does not affect memory held by live tensors — only frees
cached-but-unused blocks.
- Compatible with all existing RFdiffusion design modes (PPI, motif
scaffolding, unconditional).
2026-04-24 10:41:07 -06:00
Brahm Yachnin
63e270f715 For fixed chains, retain residue numbering
For chains that are completely fixed, retain the residue numbering from
the input rather than renumbering.  For chains that are partially or
fully designed by RFdiffusion, it isn't clear to me what the 'correct'
behaviour should be, so these chains will be re-numbered starting at
residue 1.
2025-05-20 10:38:19 -04:00
tmsincomb
e783762568 changed --parents to -p to be used by each OS 2023-05-30 03:44:43 -07:00
Brian Loyal
487c72068a Add script to download model files 2023-04-21 13:10:10 -07:00
Nathaniel Bennett
92b83decf3 Add verbose checking for GPU 2023-04-14 13:05:24 -07:00
Sam DeLuca
94fb2d8d1c restructuring library as a python module 2023-04-03 14:33:05 -07:00