mirror of
https://github.com/aqlaboratory/openfold.git
synced 2026-06-04 12:44:26 +08:00
Addressing code review comments: test skip, docs
Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>
This commit is contained in:
@@ -143,14 +143,38 @@ Some commonly used command line flags are here. A full list of flags can be view
|
||||
|
||||
### Advanced Options for Increasing Efficiency
|
||||
|
||||
#### Speeding up inference
|
||||
#### Turning on TF32 (TensorFloat-32) precision on compatible hardware
|
||||
|
||||
When running on latest NVIDIA GPUs, starting from Ampere, you can enable TF32 precision to get about 1.3x performance boost.
|
||||
TF32 uses 1 sign bit, 8 exponent bits (like FP32), and 10 mantissa (significand) bits (like FP16), packed into a 32-bit word.
|
||||
It was found generally safe to use OF2 with TF32 instead of full FP32. To enable it globally in Torch:
|
||||
|
||||
```
|
||||
torch.backends.cuda.matmul.allow_tf32 = True # Enable TF32 for matrix multiplications
|
||||
torch.backends.cudnn.allow_tf32 = True # Enable TF32 for convolutions
|
||||
```
|
||||
Make sure NVIDIA_TF32_OVERRIDE environment variable is either not defined or set to 1.
|
||||
|
||||
#### Applying lower BF16 precision to EvoformerStack and ExtraMSAStack
|
||||
|
||||
BF16 occupies 16 bits: 1 sign bit, 8 exponent bits (same as FP32), and 7 mantissa (fraction) bits. Its dynamic range is equivalent to FP32, but BF16 can only represent numbers with about three decimal digits of precision.
|
||||
It was found generally safe to apply BF16 precision cast to EvoformerStack and ExtraMSAStack. This allows to achieve ~1.5x speedup compared to TF32 inferenceof the whole model.
|
||||
To apply BF16, use '--precision=bf16' argument. '--precision=fp16' is also supported, but not recommended due to numerical instability.
|
||||
|
||||
#### Speeding up inference with custom attention and multiplicative update kernels
|
||||
|
||||
The **DeepSpeed DS4Sci_EvoformerAttention kernel** is a memory-efficient attention kernel developed as part of a collaboration between OpenFold and the DeepSpeed4Science initiative.
|
||||
|
||||
If your system supports deepseed, using deepspeed generally leads an inference speedup of 2 - 3x without significant additional memory use. You may specify this option by selecting the `--use_deepspeed_inference` argument.
|
||||
|
||||
OF2 supports the CUEquivariance [triangle_multiplicative_update](https://docs.nvidia.com/cuda/cuequivariance/api/generated/cuequivariance_torch.triangle_multiplicative_update.html) and [triangle_attention](https://docs.nvidia.com/cuda/cuequivariance/api/generated/cuequivariance_torch.triangle_attention.html) kernels which can speed up inference/training of the model 1.2 to 1.5 on top of DeepSpeed and even more for sequences with > 1000 residues. To enable, pass '--use_cuequivariance_attention' and '--use_cuequivariance_multiplicative_update' arguments to run_pretrained_openfold.py.
|
||||
CUEquivariance does fall back to DeepSpeed on shapes it does not efficiently support, so enable both for best effect.
|
||||
|
||||
If DeepSpeed is unavailable for your system, you may also try using [FlashAttention](https://github.com/HazyResearch/flash-attention) by adding `globals.use_flash = True` to the `--experiment_config_json`. Note that FlashAttention appears to work best for sequences with < 1000 residues.
|
||||
|
||||
#### Speeding up inference with TensorRT
|
||||
Alternatively (or together with CUEquivariance), you can try applying [TensorRT](https://developer.nvidia.com/tensorrt) to key modules. OF2 comes with built-in TensorRT lazy compilation support for EvoformerStack. To enable, pass '--trt_mode-run', '--trt_engine_dir', '--trt_max_sequence_len', '--trt_num_profiles' and '--trt_optimization_level' arguments to run_pretrained_openfold.py.
|
||||
|
||||
#### Large-scale batch inference
|
||||
For large-scale batch inference, we offer an optional tracing mode, which massively improves runtimes at the cost of a lengthy model compilation process. To enable it, add `--trace_model` to the inference command.
|
||||
|
||||
|
||||
@@ -27,6 +27,9 @@ def skip_unless_ds4s_installed():
|
||||
"deepspeed.ops.deepspeed4science") is not None
|
||||
return unittest.skipUnless(ds4s_is_installed, "Requires DeepSpeed with version ≥ 0.10.4")
|
||||
|
||||
def skip_unless_cueq_installed():
|
||||
cueq_is_installed = importlib.util.find_spec("cuequivariance_torch") is not None
|
||||
return unittest.skipUnless(cueq_is_installed, "Requires cuEquivariance")
|
||||
|
||||
def skip_unless_flash_attn_installed():
|
||||
fa_is_installed = importlib.util.find_spec("flash_attn") is not None
|
||||
|
||||
@@ -36,7 +36,7 @@ import tests.compare_utils as compare_utils
|
||||
from tests.data_utils import random_template_feats, random_attention_inputs
|
||||
|
||||
|
||||
|
||||
@compare_utils.skip_unless_cueq_installed()
|
||||
class TestCuEquivarianceKernel(unittest.TestCase):
|
||||
|
||||
def test_compare_template_stack(self):
|
||||
@@ -133,6 +133,7 @@ class TestCuEquivarianceKernel(unittest.TestCase):
|
||||
# https://github.com/aqlaboratory/openfold/issues/532
|
||||
with torch.no_grad(), torch.cuda.amp.autocast(dtype=torch.float32):
|
||||
model = compare_utils.get_global_pretrained_openfold()
|
||||
model.globals.use_deepspeed_evo_attention = False
|
||||
model.globals.use_cuequivariance_attention = False
|
||||
model.globals.use_cuequivariance_multiplicative_update = False
|
||||
out_repro = model(batch)
|
||||
|
||||
Reference in New Issue
Block a user