RFdiffusion

mirror of https://github.com/RosettaCommons/RFdiffusion.git synced 2026-06-04 18:44:21 +08:00

Author	SHA1	Message	Date
Leonardo Marino-Ramirez	529b756796	feat: add inference.empty_cache_per_design flag to reduce CUDA allocator fragmentation (#451 ) ## Problem When running RFdiffusion with variable-length contigs (e.g. `contigmap.contigs=[A1-469/0 1-50]`) over hundreds or thousands of designs, per-worker VRAM grows steadily from ~7 GB to 10–13 GB per process. This limits how many workers can run in parallel on a single GPU before exhausting VRAM. Root cause: PyTorch's CUDA caching allocator accumulates fragmented memory blocks across designs. With variable-length contigs each design allocates differently-sized tensors; freed blocks are cached but cannot be reused for different-sized allocations, causing steady VRAM growth. ## Fix Add an optional `inference.empty_cache_per_design` flag (default `False`, opt-in) that calls `torch.cuda.empty_cache()` at the end of each design iteration. This releases all unused cached CUDA memory blocks back to the CUDA memory manager, keeping each worker near its initial VRAM footprint for the full run. ### Changes `config/inference/base.yaml` ```yaml write_trajectory: True empty_cache_per_design: False # NEW ``` `scripts/run_inference.py` — after the trajectory/PDB write block, before `log.info`: ```python if conf.inference.empty_cache_per_design and torch.cuda.is_available(): torch.cuda.empty_cache() log.info(f"Finished design in {(time.time()-start_time)/60:.2f} minutes") ``` ## Measured impact Tested on NVIDIA RTX 5090 32 GB running a long PPI campaign with variable-length contigs: \| Setting \| Per-worker VRAM (steady-state) \| \|---------\|-------------------------------\| \| Without fix \| 8–13 GB (grows over run) \| \| With `empty_cache_per_design=True` \| ~5.2 GB (stable) \| This allowed raising the number of parallel workers from 3 to 5 on a 32 GB GPU. ## Why opt-in `torch.cuda.empty_cache()` adds a small per-design overhead (~1–2 ms) and is only beneficial for long runs with variable-length contigs. For short runs or fixed-length designs there is no fragmentation issue, so the default remains `False` to preserve existing behavior. ## Testing All 20 applicable tests in `tests/test_diffusion.py` pass with this change. The one skipped test (`design_ppi_scaffolded`) fails due to a missing `ppi_scaffolds/` directory in the test fixture — a pre-existing issue unrelated to this PR. ## Notes - Placement is after both the PDB write (`writepdb`) and the optional trajectory block — every consumer of `denoised_xyz_stack` / `px0_xyz_stack` has already finished before the cache is cleared. - This does not affect memory held by live tensors — only frees cached-but-unused blocks. - Compatible with all existing RFdiffusion design modes (PPI, motif scaffolding, unconditional).	2026-04-24 10:41:07 -06:00
Brahm Yachnin	63e270f715	For fixed chains, retain residue numbering For chains that are completely fixed, retain the residue numbering from the input rather than renumbering. For chains that are partially or fully designed by RFdiffusion, it isn't clear to me what the 'correct' behaviour should be, so these chains will be re-numbered starting at residue 1.	2025-05-20 10:38:19 -04:00
tmsincomb	e783762568	changed --parents to -p to be used by each OS	2023-05-30 03:44:43 -07:00
Brian Loyal	487c72068a	Add script to download model files	2023-04-21 13:10:10 -07:00
Nathaniel Bennett	92b83decf3	Add verbose checking for GPU	2023-04-14 13:05:24 -07:00
Sam DeLuca	94fb2d8d1c	restructuring library as a python module	2023-04-03 14:33:05 -07:00

6 Commits