mirror of
https://github.com/evolutionaryscale/esm.git
synced 2026-06-04 09:04:23 +08:00
835 lines
28 KiB
Plaintext
835 lines
28 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# [Tutorial 1](https://github.com/evolutionaryscale/esm/tree/main/cookbook/tutorials): Input tracks of `ESMProtein`\n",
|
|
"\n",
|
|
"ESM3 is a frontier generative model for biology, able to jointly reason across three fundamental biological properties of proteins: sequence, structure, and function. These three data modalities are represented as tracks of discrete tokens at the input and output of ESM3. You can present the model with a combination of partial inputs across the tracks, and ESM3 will provide output predictions for all the tracks."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"In this notebook, we will familiarize ourselves with the `ESMProtein` class, which holds multiple properties of a protein representing sequence, structure, and function. The ESM3 models use these properties from the input (prompts) and generate them as part of the output."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"An `ESMProtein` has 5 attributes that represent input (promptable) tracks:\n",
|
|
"\n",
|
|
"* `sequence`: amino acid sequence\n",
|
|
"* `coordinates`: 3D coordinates of atoms in each amino acid of the protein\n",
|
|
"* `secondary_structure`: [8-class secondary structure](https://en.wikipedia.org/wiki/Protein_secondary_structure#DSSP_classification) (SS8)\n",
|
|
"* `sasa`: [solvent-accessible surface area](https://en.wikipedia.org/wiki/Accessible_surface_area) (SASA)\n",
|
|
"* `function_annotations`: function annotations derived from [InterPro](https://www.ebi.ac.uk/interpro/)\n",
|
|
"\n",
|
|
"You can prompt an ESM3 model by setting any subset of these tracks to be partially unmasked when calling the model with an `ESMProtein` instance."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"One way to create an `ESMProtein` object is from a pdb id and chain id from [RCSB](https://www.rcsb.org). Below, we first create a `ProteinChain` with the pdb id and chain id and then create an `ESMProtein` from it. This will populate the `sequence` and `coordinates` properties."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Imports"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Install esm and other dependencies\n",
|
|
"! pip install esm\n",
|
|
"! pip install py3Dmol\n",
|
|
"! pip install matplotlib\n",
|
|
"! pip install dna-features-viewer"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Create an ESMProtein object"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from biotite.database import rcsb\n",
|
|
"\n",
|
|
"from esm.sdk.api import ESMProtein\n",
|
|
"from esm.utils.structure.protein_chain import ProteinChain\n",
|
|
"from esm.utils.types import FunctionAnnotation\n",
|
|
"\n",
|
|
"pdb_id = \"1cm4\"\n",
|
|
"chain_id = \"A\"\n",
|
|
"\n",
|
|
"# Create a protein using a pdb format file from RCSB\n",
|
|
"# Note: instead of the next two lines, we could use\n",
|
|
"# protein_chain = ProteinChain.from_rcsb(pdb_id, chain_id)\n",
|
|
"# but in future implementations, this function may use the mmcif file\n",
|
|
"# which would throw off some indices later on in this notebook\n",
|
|
"str_io = rcsb.fetch(pdb_id, \"pdb\")\n",
|
|
"protein_chain = ProteinChain.from_pdb(str_io, chain_id=chain_id, id=pdb_id)\n",
|
|
"protein = ESMProtein.from_protein_chain(protein_chain)\n",
|
|
"\n",
|
|
"## We can also load from a local pdb file by passing its path\n",
|
|
"# protein_chain = ProteinChain.from_pdb('xxxx.pdb', chain_id=chain_id, id=pdb_id)\n",
|
|
"# The chain_id and id arguments are optional and will be inferred if None"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## `sequence`\n",
|
|
"The `sequence` track contains a sequence of 1-letter representation of the amino acids in the protein:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"print(protein.sequence)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## `coordinates`\n",
|
|
"\n",
|
|
"\n",
|
|
"`coordinates` contains the 3D coordinates of atoms in the protein. It contains a tensor of shape `(n_residues, 37, 3)`, where \n",
|
|
"\n",
|
|
"* `n_residues` is the number of amino acids in the protein.\n",
|
|
"* `37` is the maximum possible number of atoms in an amino acid, represented in the atom37 representation. If certain atoms are not present in the structure, they will show up as `nan`.\n",
|
|
"* `3` is for 3D (x,y,z) coordinates. "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"print(protein.coordinates.shape)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"print(protein.coordinates)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"We define two functions below that visualize the `coordinates` attribute: we define two functions below (as before, there is no need to go through them)\n",
|
|
"* `visualize_3D_coordinates()` visualizes directly from the coordinates tensor by creating a pdb file with all alanines\n",
|
|
"* `visualize_3D_protein()` visualizes from the `ESMProtein` instance, which has the correct amino acids"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 6,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Functions for visualizing 3D structure\n",
|
|
"\n",
|
|
"import py3Dmol\n",
|
|
"\n",
|
|
"\n",
|
|
"def visualize_pdb(pdb_string):\n",
|
|
" view = py3Dmol.view(width=400, height=400)\n",
|
|
" view.addModel(pdb_string, \"pdb\")\n",
|
|
" view.setStyle({\"cartoon\": {\"color\": \"spectrum\"}})\n",
|
|
" view.zoomTo()\n",
|
|
" view.render()\n",
|
|
" view.center()\n",
|
|
" return view\n",
|
|
"\n",
|
|
"\n",
|
|
"def visualize_3D_coordinates(coordinates):\n",
|
|
" \"\"\"\n",
|
|
" This uses all Alanines\n",
|
|
" \"\"\"\n",
|
|
" protein_with_same_coords = ESMProtein(coordinates=coordinates)\n",
|
|
" # pdb with all alanines\n",
|
|
" pdb_string = protein_with_same_coords.to_pdb_string()\n",
|
|
" return visualize_pdb(pdb_string)\n",
|
|
"\n",
|
|
"\n",
|
|
"def visualize_3D_protein(protein):\n",
|
|
" pdb_string = protein.to_pdb_string()\n",
|
|
" return visualize_pdb(pdb_string)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# visualize from just the coordinates\n",
|
|
"visualize_3D_coordinates(protein.coordinates)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# visualize from sequence and coordinates\n",
|
|
"visualize_3D_protein(protein)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### `secondary_structure`\n",
|
|
"\n",
|
|
"The `secondary_structure` property contains a representation of the secondary structure. At a high level of categorization, we can classify each amino acid as belonging into three classes: alpha helices, beta sheets, and coil, which we could see in the previous 3D visualization.\n",
|
|
"\n",
|
|
"`ESMProtein` uses a [8-class secondary structure](https://en.wikipedia.org/wiki/Protein_secondary_structure#DSSP_classification) that can be computed with [dssp](https://swift.cmbi.umcn.nl/gv/dssp/) given 3D atom coordinates. Since installing dssp is a separate process from installing the `esm` package, in this notebook, we show how to compute the coarser 3-class classification using biotite's [annotate_sse](https://www.biotite-python.org/apidoc/biotite.structure.annotate_sse.html). We can set the `secondary_structure` property with this 3-class classification."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 9,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from biotite.structure import annotate_sse\n",
|
|
"\n",
|
|
"\n",
|
|
"def get_approximate_ss(protein_chain: ProteinChain):\n",
|
|
" # get biotite's ss3 representation\n",
|
|
" ss3_arr = annotate_sse(protein_chain.atom_array)\n",
|
|
" biotite_ss3_str = \"\".join(ss3_arr)\n",
|
|
"\n",
|
|
" # translate into ESM3's representation\n",
|
|
" translation_table = str.maketrans(\n",
|
|
" {\n",
|
|
" \"a\": \"H\", # alpha helix\n",
|
|
" \"b\": \"E\", # beta sheet\n",
|
|
" \"c\": \"C\", # coil\n",
|
|
" }\n",
|
|
" )\n",
|
|
" esm_ss3 = biotite_ss3_str.translate(translation_table)\n",
|
|
" return esm_ss3"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"protein.secondary_structure = get_approximate_ss(protein_chain)\n",
|
|
"print(protein.secondary_structure)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"The next cell defines a function that visualizes the secondary structure and there is no need to read them!"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 18,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Slightly modified version of secondary structure plotting code from\n",
|
|
"# https://www.biotite-python.org/examples/gallery/structure/transketolase_sse.html\n",
|
|
"# Code source: Patrick Kunzmann\n",
|
|
"# License: BSD 3 clause\n",
|
|
"\n",
|
|
"import biotite\n",
|
|
"import biotite.sequence as seq\n",
|
|
"import biotite.sequence.graphics as graphics\n",
|
|
"import matplotlib.pyplot as plt\n",
|
|
"import numpy as np\n",
|
|
"from matplotlib.patches import Rectangle\n",
|
|
"\n",
|
|
"\n",
|
|
"# Create 'FeaturePlotter' subclasses\n",
|
|
"# for drawing the secondary structure features\n",
|
|
"class HelixPlotter(graphics.FeaturePlotter):\n",
|
|
" def __init__(self):\n",
|
|
" pass\n",
|
|
"\n",
|
|
" # Check whether this class is applicable for drawing a feature\n",
|
|
" def matches(self, feature):\n",
|
|
" if feature.key == \"SecStr\":\n",
|
|
" if \"sec_str_type\" in feature.qual:\n",
|
|
" if feature.qual[\"sec_str_type\"] == \"helix\":\n",
|
|
" return True\n",
|
|
" return False\n",
|
|
"\n",
|
|
" # The drawing function itself\n",
|
|
" def draw(self, axes, feature, bbox, loc, style_param):\n",
|
|
" # Approx. 1 turn per 3.6 residues to resemble natural helix\n",
|
|
" n_turns = np.ceil((loc.last - loc.first + 1) / 3.6)\n",
|
|
" x_val = np.linspace(0, n_turns * 2 * np.pi, 100)\n",
|
|
" # Curve ranges from 0.3 to 0.7\n",
|
|
" y_val = (-0.4 * np.sin(x_val) + 1) / 2\n",
|
|
"\n",
|
|
" # Transform values for correct location in feature map\n",
|
|
" x_val *= bbox.width / (n_turns * 2 * np.pi)\n",
|
|
" x_val += bbox.x0\n",
|
|
" y_val *= bbox.height\n",
|
|
" y_val += bbox.y0\n",
|
|
"\n",
|
|
" # Draw white background to overlay the guiding line\n",
|
|
" background = Rectangle(\n",
|
|
" bbox.p0, bbox.width, bbox.height, color=\"white\", linewidth=0\n",
|
|
" )\n",
|
|
" axes.add_patch(background)\n",
|
|
" axes.plot(x_val, y_val, linewidth=2, color=biotite.colors[\"dimgreen\"])\n",
|
|
"\n",
|
|
"\n",
|
|
"class SheetPlotter(graphics.FeaturePlotter):\n",
|
|
" def __init__(self, head_width=0.8, tail_width=0.5):\n",
|
|
" self._head_width = head_width\n",
|
|
" self._tail_width = tail_width\n",
|
|
"\n",
|
|
" def matches(self, feature):\n",
|
|
" if feature.key == \"SecStr\":\n",
|
|
" if \"sec_str_type\" in feature.qual:\n",
|
|
" if feature.qual[\"sec_str_type\"] == \"sheet\":\n",
|
|
" return True\n",
|
|
" return False\n",
|
|
"\n",
|
|
" def draw(self, axes, feature, bbox, loc, style_param):\n",
|
|
" x = bbox.x0\n",
|
|
" y = bbox.y0 + bbox.height / 2\n",
|
|
" dx = bbox.width\n",
|
|
" dy = 0\n",
|
|
"\n",
|
|
" if loc.defect & seq.Location.Defect.MISS_RIGHT:\n",
|
|
" # If the feature extends into the previous or next line\n",
|
|
" # do not draw an arrow head\n",
|
|
" draw_head = False\n",
|
|
" else:\n",
|
|
" draw_head = True\n",
|
|
"\n",
|
|
" axes.add_patch(\n",
|
|
" biotite.AdaptiveFancyArrow(\n",
|
|
" x,\n",
|
|
" y,\n",
|
|
" dx,\n",
|
|
" dy,\n",
|
|
" self._tail_width * bbox.height,\n",
|
|
" self._head_width * bbox.height,\n",
|
|
" # Create head with 90 degrees tip\n",
|
|
" # -> head width/length ratio = 1/2\n",
|
|
" head_ratio=0.5,\n",
|
|
" draw_head=draw_head,\n",
|
|
" color=biotite.colors[\"orange\"],\n",
|
|
" linewidth=0,\n",
|
|
" )\n",
|
|
" )\n",
|
|
"\n",
|
|
"\n",
|
|
"# Converter for the DSSP secondary structure elements\n",
|
|
"# to the classical ones\n",
|
|
"dssp_to_abc = {\n",
|
|
" \"I\": \"c\",\n",
|
|
" \"S\": \"c\",\n",
|
|
" \"H\": \"a\",\n",
|
|
" \"E\": \"b\",\n",
|
|
" \"G\": \"c\",\n",
|
|
" \"B\": \"b\",\n",
|
|
" \"T\": \"c\",\n",
|
|
" \"C\": \"c\",\n",
|
|
"}\n",
|
|
"\n",
|
|
"\n",
|
|
"def visualize_secondary_structure(sse, first_id):\n",
|
|
" \"\"\"\n",
|
|
" Helper function to convert secondary structure array to annotation\n",
|
|
" and visualize it.\n",
|
|
" \"\"\"\n",
|
|
"\n",
|
|
" def _add_sec_str(annotation, first, last, str_type):\n",
|
|
" if str_type == \"a\":\n",
|
|
" str_type = \"helix\"\n",
|
|
" elif str_type == \"b\":\n",
|
|
" str_type = \"sheet\"\n",
|
|
" else:\n",
|
|
" # coil\n",
|
|
" return\n",
|
|
" feature = seq.Feature(\n",
|
|
" \"SecStr\", [seq.Location(first, last)], {\"sec_str_type\": str_type}\n",
|
|
" )\n",
|
|
" annotation.add_feature(feature)\n",
|
|
"\n",
|
|
" # Find the intervals for each secondary structure element\n",
|
|
" # and add to annotation\n",
|
|
" annotation = seq.Annotation()\n",
|
|
" curr_sse = None\n",
|
|
" curr_start = None\n",
|
|
" for i in range(len(sse)):\n",
|
|
" if curr_start is None:\n",
|
|
" curr_start = i\n",
|
|
" curr_sse = sse[i]\n",
|
|
" else:\n",
|
|
" if sse[i] != sse[i - 1]:\n",
|
|
" _add_sec_str(\n",
|
|
" annotation, curr_start + first_id, i - 1 + first_id, curr_sse\n",
|
|
" )\n",
|
|
" curr_start = i\n",
|
|
" curr_sse = sse[i]\n",
|
|
" # Add last secondary structure element to annotation\n",
|
|
" _add_sec_str(annotation, curr_start + first_id, i + first_id, curr_sse)\n",
|
|
"\n",
|
|
" fig = plt.figure(figsize=(30.0, 3.0))\n",
|
|
" ax = fig.add_subplot(111)\n",
|
|
" graphics.plot_feature_map(\n",
|
|
" ax,\n",
|
|
" annotation,\n",
|
|
" symbols_per_line=150,\n",
|
|
" loc_range=(first_id, first_id + len(sse)),\n",
|
|
" feature_plotters=[HelixPlotter(), SheetPlotter()],\n",
|
|
" )\n",
|
|
" fig.tight_layout()\n",
|
|
" return fig, ax\n",
|
|
"\n",
|
|
"\n",
|
|
"def plot_ss8(ss8_string):\n",
|
|
" ss3 = np.array([dssp_to_abc[e] for e in ss8_string], dtype=\"U1\")\n",
|
|
" _, ax = visualize_secondary_structure(ss3, 1)\n",
|
|
" ax.set_xticks([])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Using these functions, we can visualize the secondary structure that we obtained. The alpha helices are represented in green, the beta sheets are represented in orange, and coils are represented by gray lines. \n",
|
|
"\n",
|
|
"Note: because the secondary structure assignment algorithm is not the same one as the one used by 3D visualization, this differs a bit from the cartoon representations in the 3D assignment."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"plot_ss8(protein.secondary_structure)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## `function_annotations`\n",
|
|
"\n",
|
|
"An `ESMProtein` also contains function annotations derived from [InterPro](https://www.ebi.ac.uk/interpro/). Annotations directly from InterPro contain information about the following [entry types](https://interpro-documentation.readthedocs.io/en/latest/faq.html#what-are-entry-types):\n",
|
|
"* Family\n",
|
|
"* Domain\n",
|
|
"* Homologous superfamily\n",
|
|
"* Repeat\n",
|
|
"* Site (conserved site, active site, binding site, post-translational modification site)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 20,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"interpro_function_annotations = [\n",
|
|
" FunctionAnnotation(label=\"IPR050145\", start=1, end=142), # 1 indexed, inclusive;\n",
|
|
" FunctionAnnotation(label=\"IPR002048\", start=4, end=75),\n",
|
|
" FunctionAnnotation(label=\"IPR002048\", start=77, end=144),\n",
|
|
" FunctionAnnotation(label=\"IPR011992\", start=1, end=143),\n",
|
|
" FunctionAnnotation(label=\"IPR018247\", start=17, end=29),\n",
|
|
" FunctionAnnotation(label=\"IPR018247\", start=53, end=65),\n",
|
|
" FunctionAnnotation(label=\"IPR018247\", start=90, end=102),\n",
|
|
" FunctionAnnotation(label=\"IPR018247\", start=126, end=138),\n",
|
|
"]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"We can visualize these InterPro annotations with the following function:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 21,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Functions for visualizing InterPro function annotations\n",
|
|
"\n",
|
|
"from dna_features_viewer import GraphicFeature, GraphicRecord\n",
|
|
"from matplotlib import colormaps\n",
|
|
"\n",
|
|
"from esm.utils.function.interpro import InterPro, InterProEntryType\n",
|
|
"\n",
|
|
"\n",
|
|
"def visualize_function_annotations(\n",
|
|
" annotations: list[FunctionAnnotation],\n",
|
|
" sequence_length: int,\n",
|
|
" ax: plt.Axes,\n",
|
|
" interpro_=InterPro(),\n",
|
|
"):\n",
|
|
" cmap = colormaps[\"tab10\"]\n",
|
|
" colors = [cmap(i) for i in range(len(InterProEntryType))]\n",
|
|
" type_colors = dict(zip(InterProEntryType, colors))\n",
|
|
"\n",
|
|
" features = []\n",
|
|
" for annotation in annotations:\n",
|
|
" if annotation.label in interpro_.entries:\n",
|
|
" entry = interpro_.entries[annotation.label]\n",
|
|
" label = entry.name\n",
|
|
" entry_type = entry.type\n",
|
|
" else:\n",
|
|
" label = annotation.label\n",
|
|
" entry_type = InterProEntryType.UNKNOWN\n",
|
|
"\n",
|
|
" feature = GraphicFeature(\n",
|
|
" start=annotation.start - 1, # one index -> zero index\n",
|
|
" end=annotation.end,\n",
|
|
" label=label,\n",
|
|
" color=type_colors[entry_type],\n",
|
|
" strand=None,\n",
|
|
" )\n",
|
|
" features.append(feature)\n",
|
|
"\n",
|
|
" record = GraphicRecord(\n",
|
|
" sequence=None, sequence_length=sequence_length, features=features\n",
|
|
" )\n",
|
|
"\n",
|
|
" record.plot(figure_width=12, plot_sequence=False, ax=ax)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"We plot the InterPro annotations below, with colors indicating the entry type of the InterPro annotation "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"fig, ax = plt.subplots(figsize=(20.0, 4.0))\n",
|
|
"visualize_function_annotations(interpro_function_annotations, len(protein), ax)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"When using our `ESM3` model, we recommend you use keyword annotations, which are keywords in the description of the InterPro entry and associated Gene Ontology terms from [InterPro2GO](https://www.ebi.ac.uk/GOA/InterPro2GO). For instance, for the InterPro entry [IPR011992](https://www.ebi.ac.uk/interpro/entry/InterPro/IPR011992/), the keywords are \"domain pair\", \"hand domain\", \"ef hand\", \"pair\", and \"ef\". For more details regarding how the keywords were computed, please refer to our [preprint](https://www.biorxiv.org/content/10.1101/2024.07.01.600583v1.full.pdf).\n",
|
|
"\n",
|
|
"Practically, we can derive keyword annotations from the InterPro annotations with the function below. Each InterPro annotation corresponds to multiple keyword annotation covering the same range."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 23,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from esm.tokenization import InterProQuantizedTokenizer\n",
|
|
"\n",
|
|
"\n",
|
|
"def get_keywords_from_interpro(\n",
|
|
" interpro_annotations,\n",
|
|
" interpro2keywords=InterProQuantizedTokenizer().interpro2keywords,\n",
|
|
"):\n",
|
|
" keyword_annotations_list = []\n",
|
|
" for interpro_annotation in interpro_annotations:\n",
|
|
" keywords = interpro2keywords.get(interpro_annotation.label, [])\n",
|
|
" keyword_annotations_list.extend(\n",
|
|
" [\n",
|
|
" FunctionAnnotation(\n",
|
|
" label=keyword,\n",
|
|
" start=interpro_annotation.start,\n",
|
|
" end=interpro_annotation.end,\n",
|
|
" )\n",
|
|
" for keyword in keywords\n",
|
|
" ]\n",
|
|
" )\n",
|
|
" return keyword_annotations_list"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"protein.function_annotations = get_keywords_from_interpro(interpro_function_annotations)\n",
|
|
"protein.function_annotations"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"We can also visualize the keyword annotations, which all have the same color, indicating it is not a known InterPro entry type."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"fig, ax = plt.subplots(figsize=(20.0, 8.0))\n",
|
|
"visualize_function_annotations(protein.function_annotations, len(protein), ax)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## `sasa`"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"The final input track of `ESMProtein` is the solvent-accessible surface area, or [SASA](https://en.wikipedia.org/wiki/Accessible_surface_area). For each amino acid, this track indicates how much of it is accessible to the solvent. We can compute this by `ProteinChain`'s `sasa` function, which uses biotite's [`sasa`](https://www.biotite-python.org/apidoc/biotite.structure.sasa.html) function under the hood."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 26,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"protein.sasa = protein_chain.sasa()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"One way to visualize this track is to represent its values as it varies along the amino acid sequence."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"plt.plot(protein.sasa)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"We can also map these SASA values onto the 3D visualization of the structure, leveraging the fact that we have this protein's 3D coordinates.\n",
|
|
"\n",
|
|
"First we define which colors map to which values:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"cmap = colormaps[\"cividis\"]\n",
|
|
"clip_sasa_lower = 10\n",
|
|
"clip_sasa_upper = 90\n",
|
|
"\n",
|
|
"\n",
|
|
"def plot_heatmap_legend(cmap, clip_sasa_lower, clip_sasa_upper):\n",
|
|
" gradient = np.linspace(0, 1, 256)\n",
|
|
" gradient = np.vstack((gradient, gradient))\n",
|
|
" _, ax = plt.subplots(figsize=(5, 0.3), dpi=350)\n",
|
|
" ax.imshow(gradient, aspect=\"auto\", cmap=cmap)\n",
|
|
" ax.text(\n",
|
|
" 0.1,\n",
|
|
" -0.3,\n",
|
|
" f\"{clip_sasa_lower} or lower\",\n",
|
|
" va=\"center\",\n",
|
|
" ha=\"right\",\n",
|
|
" fontsize=7,\n",
|
|
" transform=ax.transAxes,\n",
|
|
" )\n",
|
|
" ax.text(\n",
|
|
" 0.5,\n",
|
|
" -0.3,\n",
|
|
" f\"{(clip_sasa_lower + clip_sasa_upper) // 2}\",\n",
|
|
" va=\"center\",\n",
|
|
" ha=\"right\",\n",
|
|
" fontsize=7,\n",
|
|
" transform=ax.transAxes,\n",
|
|
" )\n",
|
|
" ax.text(\n",
|
|
" 0.9,\n",
|
|
" -0.3,\n",
|
|
" f\"{clip_sasa_upper} or higher\",\n",
|
|
" va=\"center\",\n",
|
|
" ha=\"left\",\n",
|
|
" fontsize=7,\n",
|
|
" transform=ax.transAxes,\n",
|
|
" )\n",
|
|
" ax.set_xticklabels([])\n",
|
|
" ax.set_yticklabels([])\n",
|
|
" ax.set_xticks([])\n",
|
|
" ax.set_yticks([])\n",
|
|
" plt.show()\n",
|
|
"\n",
|
|
"\n",
|
|
"plot_heatmap_legend(cmap, clip_sasa_lower, clip_sasa_upper)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 33,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Functions for visualizing SASA as colors on the 3D structure\n",
|
|
"\n",
|
|
"\n",
|
|
"def get_color_strings(sasa, clip_sasa_lower, clip_sasa_upper, cmap):\n",
|
|
" transformed_sasa = np.clip(sasa, clip_sasa_lower, clip_sasa_upper)\n",
|
|
" transformed_sasa = (transformed_sasa - clip_sasa_lower) / (\n",
|
|
" clip_sasa_upper - clip_sasa_lower\n",
|
|
" )\n",
|
|
" rgbas = (cmap(transformed_sasa) * 255).astype(int)\n",
|
|
"\n",
|
|
" return [f\"rgb({rgba[0]},{rgba[1]},{rgba[2]})\" for rgba in rgbas]\n",
|
|
"\n",
|
|
"\n",
|
|
"def visualize_sasa_3D_protein(\n",
|
|
" protein, clip_sasa_lower=clip_sasa_lower, clip_sasa_upper=clip_sasa_upper, cmap=cmap\n",
|
|
"):\n",
|
|
" pdb_string = protein.to_pdb_string()\n",
|
|
" plot_heatmap_legend(cmap, clip_sasa_lower, clip_sasa_upper)\n",
|
|
" view = py3Dmol.view(width=400, height=400)\n",
|
|
" view.addModel(pdb_string, \"pdb\")\n",
|
|
"\n",
|
|
" for res_pos, res_color in enumerate(\n",
|
|
" get_color_strings(protein.sasa, clip_sasa_lower, clip_sasa_upper, cmap)\n",
|
|
" ):\n",
|
|
" view.setStyle(\n",
|
|
" {\"chain\": \"A\", \"resi\": res_pos + 1}, {\"cartoon\": {\"color\": res_color}}\n",
|
|
" )\n",
|
|
" view.zoomTo()\n",
|
|
" view.render()\n",
|
|
" view.center()\n",
|
|
"\n",
|
|
" return view"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"We visualize SASA on the 3D structure below. Note that the amino acids that are on the inside have lower SASA values, and the amino acids at the surface have higher SASA values."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"visualize_sasa_3D_protein(protein)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"We have now covered all the tracks of `ESMProtein`. \n",
|
|
"\n",
|
|
"We can initialize an `ESMProtein` by providing any of these tracks. For instance, to initialize a protein with the same coordinates as our `protein`, we would do:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 35,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"same_structure_protein = ESMProtein(coordinates=protein.coordinates)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"and similarly for any other track."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"We hope this helps you get started with our ESM3 models!"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.10.0"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
}
|