Addressing code review comments: test skip, docs

Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>
2026-06-04 12:44:26 +08:00 · 2025-11-14 16:05:10 -08:00
parent 007d854759
commit 4292bd6c95
3 changed files with 30 additions and 2 deletions
--- a/docs/source/Inference.md
+++ b/docs/source/Inference.md
@@ -143,14 +143,38 @@ Some commonly used command line flags are here. A full list of flags can be view

 ### Advanced Options for Increasing Efficiency

-#### Speeding up inference 
+#### Turning on TF32 (TensorFloat-32) precision on compatible hardware
+
+When running on latest NVIDIA GPUs, starting from Ampere, you can enable TF32 precision to get about 1.3x performance boost. 
+TF32 uses 1 sign bit, 8 exponent bits (like FP32), and 10 mantissa (significand) bits (like FP16), packed into a 32-bit word.
+It was found generally safe to use OF2 with TF32 instead of full FP32. To enable it globally in Torch: 
+
+```
+torch.backends.cuda.matmul.allow_tf32 = True       # Enable TF32 for matrix multiplications
+torch.backends.cudnn.allow_tf32 = True             # Enable TF32 for convolutions
+``` 
+Make sure NVIDIA_TF32_OVERRIDE environment variable is either not defined or set to 1.
+
+#### Applying lower BF16 precision to EvoformerStack and ExtraMSAStack
+
+BF16 occupies 16 bits: 1 sign bit, 8 exponent bits (same as FP32), and 7 mantissa (fraction) bits. Its dynamic range is equivalent to FP32, but BF16 can only represent numbers with about three decimal digits of precision.
+It was found generally safe to apply BF16 precision cast to EvoformerStack and ExtraMSAStack. This allows to achieve ~1.5x speedup compared to TF32 inferenceof the whole model. 
+To apply BF16, use '--precision=bf16' argument. '--precision=fp16' is also supported, but not recommended due to numerical instability. 
+
+#### Speeding up inference with custom attention and multiplicative update kernels

 The **DeepSpeed DS4Sci_EvoformerAttention kernel** is a memory-efficient attention kernel developed as part of a collaboration between OpenFold and the DeepSpeed4Science initiative. 

 If your system supports deepseed, using deepspeed generally leads an inference speedup of 2 - 3x without significant additional memory use. You may specify this option by selecting the `--use_deepspeed_inference` argument. 

+OF2 supports the CUEquivariance [triangle_multiplicative_update](https://docs.nvidia.com/cuda/cuequivariance/api/generated/cuequivariance_torch.triangle_multiplicative_update.html) and [triangle_attention](https://docs.nvidia.com/cuda/cuequivariance/api/generated/cuequivariance_torch.triangle_attention.html) kernels which can speed up inference/training of the model 1.2 to 1.5 on top of DeepSpeed and even more for sequences with > 1000 residues. To enable, pass '--use_cuequivariance_attention' and  '--use_cuequivariance_multiplicative_update' arguments to run_pretrained_openfold.py.
+CUEquivariance does fall back to DeepSpeed on shapes it does not efficiently support, so enable both for best effect. 
+
 If DeepSpeed is unavailable for your system, you may also try using [FlashAttention](https://github.com/HazyResearch/flash-attention) by adding `globals.use_flash = True` to the `--experiment_config_json`. Note that FlashAttention appears to work best for sequences with < 1000 residues.

+####  Speeding up inference with TensorRT
+Alternatively (or together with CUEquivariance), you can try applying [TensorRT](https://developer.nvidia.com/tensorrt) to key modules. OF2 comes with built-in TensorRT lazy compilation support for EvoformerStack. To enable, pass '--trt_mode-run', '--trt_engine_dir', '--trt_max_sequence_len', '--trt_num_profiles' and '--trt_optimization_level' arguments to run_pretrained_openfold.py. 
+
 #### Large-scale batch inference 
 For large-scale batch inference, we offer an optional tracing mode, which massively improves runtimes at the cost of a lengthy model compilation process. To enable it, add `--trace_model` to the inference command.

--- a/tests/compare_utils.py
+++ b/tests/compare_utils.py
@@ -27,6 +27,9 @@ def skip_unless_ds4s_installed():
        "deepspeed.ops.deepspeed4science") is not None
    return unittest.skipUnless(ds4s_is_installed, "Requires DeepSpeed with version ≥ 0.10.4")

+def skip_unless_cueq_installed():
+    cueq_is_installed = importlib.util.find_spec("cuequivariance_torch") is not None
+    return unittest.skipUnless(cueq_is_installed, "Requires cuEquivariance")

 def skip_unless_flash_attn_installed():
    fa_is_installed = importlib.util.find_spec("flash_attn") is not None
--- a/tests/test_cuequivariance.py
+++ b/tests/test_cuequivariance.py
@@ -36,7 +36,7 @@ import tests.compare_utils as compare_utils
 from tests.data_utils import random_template_feats, random_attention_inputs


-
+@compare_utils.skip_unless_cueq_installed()
 class TestCuEquivarianceKernel(unittest.TestCase):

    def test_compare_template_stack(self):
@@ -133,6 +133,7 @@ class TestCuEquivarianceKernel(unittest.TestCase):
        # https://github.com/aqlaboratory/openfold/issues/532
        with torch.no_grad(), torch.cuda.amp.autocast(dtype=torch.float32):
                model = compare_utils.get_global_pretrained_openfold()
+                model.globals.use_deepspeed_evo_attention = False
                model.globals.use_cuequivariance_attention = False
                model.globals.use_cuequivariance_multiplicative_update = False
                out_repro = model(batch)