sync from internal

This commit is contained in:
Zeming Lin
2026-05-29 20:06:20 +00:00
parent af8ef5cead
commit d72119a3f6
16 changed files with 120 additions and 72 deletions

View File

@@ -9,7 +9,7 @@
We are releasing a world model for protein biology: a scientific engine for prediction, design, and discovery. Built on the latest generation of Evolutionary Scale Modeling (ESM), this system learns from the protein sequences produced by evolution and uses that knowledge to represent, map, predict, and design proteins across scales — from atomic interactions to evolutionary relationships spanning billions of years. The system includes three artifacts: ESMC, ESMFold2, and ESM Atlas.
**[ESMC](https://biohub.ai/esm/protein)** is a state-of-the-art protein language model that has learned the rules of protein biology from training on billions of protein sequences. ESMC defines a new scaling frontier relative to ESM2, achieving stronger performance in emergent long-range structural understanding as model scale increases
**[ESMC](https://biohub.ai/esm/protein)** is a state-of-the-art protein language model that has learned the rules of protein biology from training on billions of protein sequences. ESMC defines a new scaling frontier relative to ESM2, achieving stronger performance in emergent long-range structural understanding as model scale increases.
<div align="center">
@@ -21,7 +21,7 @@ We are releasing a world model for protein biology: a scientific engine for pred
**[ESMFold2](https://huggingface.co/Biohub/ESMFold2)**, built on the ESMC 6B model, is a state-of-the-art structure prediction model that has been validated for the design of protein-protein interactions. ESMFold2 surpasses other models in DockQ pass-rate on Foldbench protein-protein and antibody-antigen complexes, and can be used in single-sequence mode for an order of magnitude speedup in folding.
<div align="center">
<img src="_assets/esmfold2_folding.png" width="40%"/>
<img src="_assets/esmfold2_folding.png" width="60%"/>
</div>
@@ -50,7 +50,7 @@ For information on using ESM3, see the [ESM3 README](https://github.com/Biohub/e
[ESMC](https://biohub.ai/esm/protein) is a state-of-the-art protein language model that has learned representations of protein biology from training on billions of protein sequences.
Codebase, model weights, and model variants for ESMC are available through [Hugging Face](https://huggingface.co/collections/Biohub/esmc-model-family).
Codebase, model weights, and model variants for ESMC are available through [Hugging Face](https://huggingface.co/collections/biohub/esmc-model-family).
There are two primary ways of running the ESM models: through the [**Biohub Platform**](https://biohub.ai/) or locally with Hugging Face. The Biohub Platform enables users to easily run inference with ESM models with minimal setup. Users interested in customizing or fine-tuning ESM models can use the models from Hugging Face.
@@ -60,7 +60,7 @@ There are two primary ways of running the ESM models: through the [**Biohub Plat
Install `esm` from GitHub (a PyPI release is coming soon):
```
pip install esm@git+https://github.com/Biohub/esm.git@c94ed8d
pip install esm@git+https://github.com/Biohub/esm.git@main
```
The following code demonstrates how to run ESMC locally
@@ -103,7 +103,7 @@ Note that our API migrated from forge.evolutionaryscale.ai to [biohub.ai](https:
To get started with ESM, install the python library using `pip`:
```
pip install esm@git+https://github.com/Biohub/esm.git@c94ed8d
pip install esm@git+https://github.com/Biohub/esm.git@main
```
Then import the necessary libraries and instantiate your desired model.
@@ -180,7 +180,7 @@ For tutorials on how to use ESMC SAEs, see our [tutorials](https://github.com/Bi
The model predicts high-resolution, all-atom 3D protein structures directly from amino acid sequences, with optional multiple sequence alignment (MSA) input for enhanced accuracy on challenging targets. ESMFold2 achieves state-of-the-art performance matching or exceeding AlphaFold3 across diverse evaluation datasets, while offering improved computational efficiency through optimized diffusion sampling and architectural innovations.
Codebase, model weights, and model variants for ESMFold2 are available through [Hugging Face](https://huggingface.co/collections/biohub/esmfold2-model-family)
Codebase, model weights, and model variants for ESMFold2 are available through [Hugging Face](https://huggingface.co/Biohub/ESMFold2)
### Running ESMFold2 Locally
@@ -237,7 +237,7 @@ with open("1mht_pred.cif", "w") as f:
Install the `esm` Python package
```
pip install esm@git+https://github.com/Biohub/esm.git@c94ed8d
pip install esm@git+https://github.com/Biohub/esm.git@main
```
Import the necessary libraries.

View File

@@ -34,7 +34,7 @@ The code for ESM3 is available from Github and weights for esm3-sm-open-v1 is av
First install the python library using `pip`:
```
pip install esm@git+https://github.com/Biohub/esm.git@c94ed8d
pip install esm@git+https://github.com/Biohub/esm.git@main
```
Then import the necessary libraries and instantiate your model. Use your token from the [Biohub platform](https://biohub.ai")
@@ -54,7 +54,7 @@ The following code demonstrates how to run ESM3 locally and generate a simple se
First install the python library using `pip`:
```
pip install esm@git+https://github.com/Biohub/esm.git@c94ed8d
pip install esm@git+https://github.com/Biohub/esm.git@main
```
Then import the necessary libraries for your model.

View File

@@ -32,7 +32,7 @@
"outputs": [],
"source": [
"%set_env TOKENIZERS_PARALLELISM=false\n",
"!pip install esm@git+https://github.com/Biohub/esm.git@c94ed8d\n",
"!pip install esm@git+https://github.com/Biohub/esm.git@main\n",
"import numpy as np\n",
"import torch\n",
"\n",

File diff suppressed because one or more lines are too long

View File

@@ -31,7 +31,7 @@
"source": [
"%set_env TOKENIZERS_PARALLELISM=false\n",
"# If you are working in colab, uncomment these lines to install dependencies\n",
"# !pip install esm@git+https://github.com/Biohub/esm.git@c94ed8d\n",
"# !pip install esm@git+https://github.com/Biohub/esm.git@main\n",
"# !pip install py3Dmol\n",
"\n",
"import numpy as np\n",

View File

@@ -38,7 +38,7 @@
"outputs": [],
"source": [
"# # If you are working in colab, uncomment these lines to install dependencies\n",
"# !pip install esm@git+https://github.com/Biohub/esm.git@c94ed8d\n",
"# !pip install esm@git+https://github.com/Biohub/esm.git@main\n",
"# !pip install py3dmol"
]
},

View File

@@ -73,7 +73,7 @@
"outputs": [],
"source": [
"# If you are working in colab, uncomment these lines to install dependencies\n",
"# !pip install esm@git+https://github.com/Biohub/esm.git@c94ed8d"
"# !pip install esm@git+https://github.com/Biohub/esm.git@main"
]
},
{

View File

@@ -77,7 +77,7 @@
"outputs": [],
"source": [
"# If you are working in colab, uncomment these lines to install dependencies\n",
"# !pip install esm@git+https://github.com/Biohub/esm.git@c94ed8d\n",
"# !pip install esm@git+https://github.com/Biohub/esm.git@main\n",
"# !pip install py3dmol"
]
},

View File

@@ -45,26 +45,24 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 2,
"id": "cell-3",
"metadata": {
"id": "cell-3"
},
"outputs": [],
"source": [
"# Install dependencies\n",
"# If you are working in colab, uncomment this line to install dependencies\n",
"# If you are working in colab, uncomment these lines to install dependencies\n",
"\n",
"#!pip install -q py3Dmol matplotlib requests numpy"
"!pip install -q \"esm @ git+https://github.com/Biohub/esm.git@main\"\n",
"!pip install -q py3Dmol matplotlib requests numpy "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cell-4",
"metadata": {
"id": "cell-4"
},
"execution_count": 3,
"id": "4eb60dfb",
"metadata": {},
"outputs": [],
"source": [
"from functools import lru_cache\n",
@@ -237,7 +235,7 @@
"source": [
"Now we'll extract SAE features using the Biohub API. We use:\n",
"- **Model**: `esmc-6b-2024-12` (6 billion parameter ESMC model)\n",
"- **SAE**: `esmc-6b-2024-12-sae-sweep-layer60-k64-codebook16384` (k=64 means top 64 features per position)\n",
"- **SAE**: `esmc-6b-2024-12-sae-layer60-k64-codebook16384` (k=64 means top 64 features per position)\n",
"- **Normalization**: `normalize_features=True` applies TF-IDF normalization to down-weight common features"
]
},
@@ -255,7 +253,7 @@
"protein_tensor = model.encode(protein)\n",
"\n",
"# Get SAE features\n",
"sae_model_name = \"esmc-6b-2024-12-sae-sweep-layer60-k64-codebook16384\"\n",
"sae_model_name = \"esmc-6b-2024-12-sae-layer60-k64-codebook16384\"\n",
"output = model.logits(\n",
" protein_tensor,\n",
" config=LogitsConfig(\n",

View File

@@ -57,7 +57,7 @@
"outputs": [],
"source": [
"# If you are working in colab, uncomment these lines to install dependencies\n",
"# !pip install esm@git+https://github.com/Biohub/esm.git@c94ed8d\n",
"# !pip install esm@git+https://github.com/Biohub/esm.git@main\n",
"# !pip install py3dmol"
]
},

View File

@@ -52,7 +52,7 @@
"outputs": [],
"source": [
"# If you are working in colab, uncomment these lines to install dependencies\n",
"# ! pip install esm@git+https://github.com/Biohub/esm.git@c94ed8d\n",
"# ! pip install esm@git+https://github.com/Biohub/esm.git@main\n",
"# ! pip install py3Dmol\n",
"# ! pip install matplotlib\n",
"# ! pip install dna-features-viewer"

View File

@@ -41,7 +41,7 @@
"outputs": [],
"source": [
"# If you are working in colab, uncomment these lines to install dependencies\n",
"# !pip install esm@git+https://github.com/Biohub/esm.git@c94ed8d\n",
"# !pip install esm@git+https://github.com/Biohub/esm.git@main\n",
"# !pip install py3Dmol\n",
"\n",
"from IPython.display import clear_output\n",

View File

@@ -100,6 +100,8 @@ class ChainInfo:
sym_id: int
mol_type: int
tokens: list[TokenInfo] = field(default_factory=list)
# (atom_name1, atom_name2) bonds for SMILES ligands, which have no CCD entry.
ligand_bonds: list[tuple[str, str]] = field(default_factory=list)
# =============================================================================
@@ -566,8 +568,11 @@ def tokenize_ligand_smiles(
atom_offset: int,
space_uid_offset: int,
seed: int | None = None,
) -> tuple[list[TokenInfo], list[AtomInfo]]:
"""Tokenize a ligand from SMILES (1 token per heavy atom)."""
) -> tuple[list[TokenInfo], list[AtomInfo], list[tuple[str, str]]]:
"""Tokenize a ligand from SMILES (1 token per heavy atom).
Returns tokens, atoms, and heavy-atom bonds as (name1, name2) pairs.
"""
from rdkit import Chem
from rdkit.Chem import AllChem
@@ -651,7 +656,13 @@ def tokenize_ligand_smiles(
token_idx += 1
atom_idx += 1
return tokens, atoms_list
bonds: list[tuple[str, str]] = []
for bond in mol_no_h.GetBonds():
n1 = bond.GetBeginAtom().GetProp("name")
n2 = bond.GetEndAtom().GetProp("name")
bonds.append((n1, n2))
return tokens, atoms_list, bonds
# =============================================================================
@@ -753,6 +764,7 @@ def build_chains_from_input(
elif isinstance(item, LigandInput):
has_cov = chain_id_str in covalent_chain_ids
ligand_bonds: list[tuple[str, str]] = []
if item.ccd is not None:
if item.smiles is not None:
warnings.warn("Both ccd and smiles provided, using ccd")
@@ -767,7 +779,7 @@ def build_chains_from_input(
has_covalent_bond=has_cov,
)
elif item.smiles is not None:
new_tokens, new_atoms = tokenize_ligand_smiles(
new_tokens, new_atoms, ligand_bonds = tokenize_ligand_smiles(
smiles=item.smiles,
entity_id=entity_id,
asym_id=asym_id,
@@ -789,6 +801,7 @@ def build_chains_from_input(
sym_id=sym_id,
mol_type=new_tokens[0].mol_type if new_tokens else MOL_TYPE_PROTEIN,
tokens=new_tokens,
ligand_bonds=ligand_bonds if isinstance(item, LigandInput) else [],
)
chains.append(chain)
all_tokens.extend(new_tokens)
@@ -990,16 +1003,24 @@ def compute_token_bonds(
(atom.name, atom.token_index)
)
# SMILES ligand bonds keyed by (asym_id, residue_index 0).
explicit_bonds: dict[tuple[int, int], list[tuple[str, str]]] = {
(c.asym_id, 0): c.ligand_bonds for c in chains if c.ligand_bonds
}
# Add intra-residue bonds from CCD
for (asym_id_val, res_idx), atom_list in residue_tokens.items():
if not atom_list:
continue
res_name = tokens[atom_list[0][1]].residue_name
ccd_bonds = get_ligand_ccd_bonds(res_name)
atom_to_tok = {name: ti for name, ti in atom_list}
if ccd_bonds:
for a1, a2 in ccd_bonds:
bonds = explicit_bonds.get((asym_id_val, res_idx))
if bonds is None:
bonds = get_ligand_ccd_bonds(res_name)
if bonds:
for a1, a2 in bonds:
if a1 in atom_to_tok and a2 in atom_to_tok:
add_bond(atom_to_tok[a1], atom_to_tok[a2])
else:

View File

@@ -0,0 +1,31 @@
"""Tests for ESMFold2 input preparation (prepare_input)."""
import pytest
from rdkit import Chem
from esm.models.esmfold2.prepare_input import (
build_chains_from_input,
compute_token_bonds,
)
from esm.models.esmfold2.types import LigandInput, StructurePredictionInput
@pytest.mark.parametrize(
"smiles",
[
"c1ccccc1", # benzene: 6 atoms, 6 bonds
# The drug-like ligand from the SMILES-vs-CCD issue.
"COC1=CC=C(N2C3=C(C(C(N)=O)=N2)CCN(C4=CC=C(N5CCCCC5=O)C=C4)C3=O)C=C1",
],
)
def test_smiles_ligand_bonds_match_molecular_graph(smiles: str):
"""SMILES ligand bonds must match the molecular graph, not a clique (#313)."""
spi = StructurePredictionInput(sequences=[LigandInput(id="B", smiles=smiles)])
chains, tokens, atoms = build_chains_from_input(spi, seed=0)
token_bonds = compute_token_bonds(tokens, atoms, spi, chains)
mol = Chem.MolFromSmiles(smiles)
assert len(tokens) == mol.GetNumAtoms()
n_edges = int(token_bonds.sum().item()) // 2 # symmetric matrix
assert n_edges == mol.GetNumBonds()
assert n_edges < len(tokens) * (len(tokens) - 1) // 2 # not a clique

View File

@@ -146,7 +146,7 @@ environments:
- conda: https://conda.anaconda.org/conda-forge/noarch/wheel-0.45.1-pyhd8ed1ab_1.conda
- conda: https://conda.anaconda.org/conda-forge/noarch/zipp-3.21.0-pyhd8ed1ab_1.conda
- pypi: .
- pypi: git+https://github.com/Biohub/transformers.git?rev=3a8956fb4d4ea16b0ec8e71deef2c2909b6a5cbf#3a8956fb4d4ea16b0ec8e71deef2c2909b6a5cbf
- pypi: git+https://github.com/Biohub/transformers.git?rev=main#3a8956fb4d4ea16b0ec8e71deef2c2909b6a5cbf
- pypi: https://files.pythonhosted.org/packages/00/78/9cbcc1c073b9d4918e925af1a059762265dc65004e020511b2a06fbfd020/pydssp-0.9.1-py3-none-any.whl
- pypi: https://files.pythonhosted.org/packages/00/c0/8f5d070730d7836adc9c9b6408dec68c6ced86b304a9b26a14df072a6e8c/traitlets-5.14.3-py3-none-any.whl
- pypi: https://files.pythonhosted.org/packages/02/e9/367e81e114deb92a6e0d5740f0bff4548af710be318af65265b9aad72237/botocore-1.40.9-py3-none-any.whl
@@ -346,7 +346,7 @@ environments:
- conda: https://conda.anaconda.org/conda-forge/osx-arm64/zstandard-0.23.0-py312hea69d52_2.conda
- conda: https://conda.anaconda.org/conda-forge/osx-arm64/zstd-1.5.7-h6491c7d_2.conda
- pypi: .
- pypi: git+https://github.com/Biohub/transformers.git?rev=3a8956fb4d4ea16b0ec8e71deef2c2909b6a5cbf#3a8956fb4d4ea16b0ec8e71deef2c2909b6a5cbf
- pypi: git+https://github.com/Biohub/transformers.git?rev=main#3a8956fb4d4ea16b0ec8e71deef2c2909b6a5cbf
- pypi: https://files.pythonhosted.org/packages/00/78/9cbcc1c073b9d4918e925af1a059762265dc65004e020511b2a06fbfd020/pydssp-0.9.1-py3-none-any.whl
- pypi: https://files.pythonhosted.org/packages/00/c0/8f5d070730d7836adc9c9b6408dec68c6ced86b304a9b26a14df072a6e8c/traitlets-5.14.3-py3-none-any.whl
- pypi: https://files.pythonhosted.org/packages/02/e9/367e81e114deb92a6e0d5740f0bff4548af710be318af65265b9aad72237/botocore-1.40.9-py3-none-any.whl
@@ -636,7 +636,7 @@ environments:
- conda: https://conda.anaconda.org/conda-forge/noarch/wheel-0.45.1-pyhd8ed1ab_1.conda
- conda: https://conda.anaconda.org/conda-forge/noarch/zipp-3.21.0-pyhd8ed1ab_1.conda
- pypi: .
- pypi: git+https://github.com/Biohub/transformers.git?rev=3a8956fb4d4ea16b0ec8e71deef2c2909b6a5cbf#3a8956fb4d4ea16b0ec8e71deef2c2909b6a5cbf
- pypi: git+https://github.com/Biohub/transformers.git?rev=main#3a8956fb4d4ea16b0ec8e71deef2c2909b6a5cbf
- pypi: https://files.pythonhosted.org/packages/00/78/9cbcc1c073b9d4918e925af1a059762265dc65004e020511b2a06fbfd020/pydssp-0.9.1-py3-none-any.whl
- pypi: https://files.pythonhosted.org/packages/00/c0/8f5d070730d7836adc9c9b6408dec68c6ced86b304a9b26a14df072a6e8c/traitlets-5.14.3-py3-none-any.whl
- pypi: https://files.pythonhosted.org/packages/02/e9/367e81e114deb92a6e0d5740f0bff4548af710be318af65265b9aad72237/botocore-1.40.9-py3-none-any.whl
@@ -863,7 +863,7 @@ environments:
- conda: https://conda.anaconda.org/conda-forge/osx-arm64/zstandard-0.23.0-py312hea69d52_2.conda
- conda: https://conda.anaconda.org/conda-forge/osx-arm64/zstd-1.5.7-h6491c7d_2.conda
- pypi: .
- pypi: git+https://github.com/Biohub/transformers.git?rev=3a8956fb4d4ea16b0ec8e71deef2c2909b6a5cbf#3a8956fb4d4ea16b0ec8e71deef2c2909b6a5cbf
- pypi: git+https://github.com/Biohub/transformers.git?rev=main#3a8956fb4d4ea16b0ec8e71deef2c2909b6a5cbf
- pypi: https://files.pythonhosted.org/packages/00/78/9cbcc1c073b9d4918e925af1a059762265dc65004e020511b2a06fbfd020/pydssp-0.9.1-py3-none-any.whl
- pypi: https://files.pythonhosted.org/packages/00/c0/8f5d070730d7836adc9c9b6408dec68c6ced86b304a9b26a14df072a6e8c/traitlets-5.14.3-py3-none-any.whl
- pypi: https://files.pythonhosted.org/packages/02/e9/367e81e114deb92a6e0d5740f0bff4548af710be318af65265b9aad72237/botocore-1.40.9-py3-none-any.whl
@@ -5297,7 +5297,7 @@ packages:
name: esm
requires_dist:
- torch>=2.2.0
- transformers @ git+https://github.com/Biohub/transformers.git@3a8956fb4d4ea16b0ec8e71deef2c2909b6a5cbf
- transformers @ git+https://github.com/Biohub/transformers.git@main
- ipython
- einops
- biotite>=1.0.0
@@ -5320,7 +5320,7 @@ packages:
- dna-features-viewer
- accelerate
requires_python: '>=3.12,<3.13'
- pypi: git+https://github.com/Biohub/transformers.git?rev=3a8956fb4d4ea16b0ec8e71deef2c2909b6a5cbf#3a8956fb4d4ea16b0ec8e71deef2c2909b6a5cbf
- pypi: git+https://github.com/Biohub/transformers.git?rev=main#3a8956fb4d4ea16b0ec8e71deef2c2909b6a5cbf
name: transformers
version: 4.57.6
requires_dist:

View File

@@ -22,7 +22,7 @@ classifiers = [
dependencies = [
"torch>=2.2.0",
"transformers @ git+https://github.com/Biohub/transformers.git@3a8956fb4d4ea16b0ec8e71deef2c2909b6a5cbf",
"transformers @ git+https://github.com/Biohub/transformers.git@main",
"ipython",
"einops",
"biotite>=1.0.0",