* Initial pass at file structure for MPNN merge. * Copy and refactor for clarity pos encoding. * PositionWiseFeedForward copied. * Message passing layers, refactored. * Token and atom encodings for MPNN. * Naming consistency. * Graph featurization for ProteinMPNN. As well as membrane and pssm versions of MPNN. * Finished LigandMPNN graph features. * Code cleanup. * MPNN classes rough draft. * Additional comments. * Move structure noise/sc atomize flag to kwargs. * Finish encoding for protein and ligand MPNN. * Saved some features for bookkeeping. * Masks and decoding order. Decoder in progress. * Added decoding; many bug fixes. * feat: MPNN pipeline (#268) * feat: MPNN pipeline, pipeline tests * feat: backbone occupancy threshold * chore: MR comments * chore: fix tests * Directory cleanup. * Rework kwargs, change masking. * Bug fix: permutation of causality masks. * Chore: push probability utils. * Chore: variable name typo. * Feat: Symmetry handling during decoding. * Bug fix: repeat symmetry input along batch. * Bug fix: if/else for symmetry_weight. * Bug fix: node_features -> num_node_features. * Bug fix: various typos/misnamings. * Bug fix: np.minimum -> min for python ints. * Chore: spacing. * Chore: rename forward output. * Chore: documentation. * Feat: loss. * Feat: weight init static method. * Chore: int->bool for masks. * Chore: ensure decode_last_mask is bool. * Bug: fix modelhub imports. * Chore: refactor ligand subgraph featurization. * Chore: missing imports. * Chore: rename loss. * Chore: rename model file. * Chore: bug fixes and documentation. * Bug fix: symmetry and autograd. * Chore: documentation. * Bug: save "pre noise" coords even when no noise. * Bug: Fix dtypes. * Chore: Model tests. * Chore: input names. * Chore: update comment. * Chore: change input to model. * Chore: rename feats. * Chore: rename feats downstream. * Chore: S_pred->S_sampled rename for clarity. * Chore: linter * Feat: protein-ligand interface calculation. (#410) * Feat: protein-ligand interface calculation. * Chore: use datahub validation check. --------- Co-authored-by: Andrew Kubaney <akubaney@localhost> * Chore: move token encoding. * Chore: split transforms from pipeline. * Feat: protein interface mask and batching. * Chore: move protein-interface calc. * Chore: empty commits for sampler/trainer. * Chore: rename protein-ligand to polymer-ligand * Feat: turn interface calcs into transform. * Feat: transform for polymer interface mask. * Bug: rename feats->input_features in tests. * Bug: rename S_pred->S_sampled in tests. * Bug: fix tests to match new model input format. * Bug: fixed issues with polymer-ligand tests. * Feat: collator. * Chore: remove empty sampler. * Chore: test refactor. * Chore: update collate default. * Chore: remove dist calc from interface for speed. * Feat: auxillary settings in pipeline. * Bug: fix mask_for_loss repeating. * Feat: metrics and test updates. * Chore: cleanup old files. * Feat: padded token bucket sampler. (#432) * Feat: padded token bucket sampler. * Chore: defaults. --------- Co-authored-by: Andrew Kubaney <akubaney@localhost> * Bug: ligand subgraph shapes. * Feat: trainer. * Bug: call .item() on metrics. * Bug: move idx to proper device. * Chore: remove compute train metrics (unused). * Feat: checkpointing. * Feat: minimal return option in pipeline (for mem). * Chore: refactor sampler. * Chore: move empty atom_array assert. * Feat: rough training code. * Feat: set_epoch and torch generator. * Bug: sampler name. * Feat: token budget aware collation. * Chore: cleanup prints. * Chore: code style. * Feat: more robust pipeline (from Nate). * Feat: batch sampler logic (from Nate). * Feat: train updates. * Feat: checks for invalid examples. * Bug: fix for empty non_atomized_array. * Feat: shell scripts for training. * Feat: first pass old weight loading. * Chore: updates to training hyperparams. * Chore: partial restructure under src. * Chore: changes for amp and comment. * Chore: move MPNN into models. * Chore: add mpnn to shebang. * Chore: initial readme. * Chore: fix imports for atomworks and model. * Chore: conftest added. * Chore: move training shell scripts. * Chore: add __init__.py. * Chore: restructure mpnn dir. * Bug: fix atomworks imports. * Bug: continued import fixing. * Chore: update autocast dtype functions. * Bug: fix issues with tests. * Chore: add model route in data pipeline. * Chore: fix comment about ligandmpnn legacy bug. * Chore: rename add auxillary settings. * chore: organize transforms. * feat: pipeline handles atomarray annotation. * chore: split mpnn and rf3 exec. * chore: rename old -> legacy. * chore: rename old->legacy in code. * chore: update intro README. * chore: update shebang. * chore: update training scripts. * chore: move launch training scripts. * chore: fix path for train file. * chore: add addn params protein vs ligand. * fix: make train.py/inference.py executable. * fix: python->srun. * chore: update notes. * feat: add back metrics_logging to modelhub. TODO: deduplicate rf3/rfd3 for callback StoreValidationMetricsInDFCallback. * chore: fix atomworks imports. * chore: add .env call. * fix: actually fix env setup. * chore: rename featurization of user setting. * fix: move featurize user settings to end. * fix: import. * chore: import order. * fix: train date cutoff, ckpt loading. * fix: update atomworks to fix residue starts. * chore: rearrange utils. * chore: create io utils file. * chore: file rename. * feat: DRAFT of inference engine/utils/script. * small inference input loading fixes * fix: collater and repeat_sample_num handling. * feat: significant upgrade of cli/inference input. * fix: addtional checks for user inputs. * chore: update high level inference script. * chore: minor changes; prepping for refactor. * chore: comment and small fix legacy wts. * chore: reorder constants in legacy wts. * chore: warnings to README. * chore: more notes on readme. * chore: readme updates. * feat: inference working. * chore: note on README. * fix: readme syntax issue. * chore: readme format. * chore: tests for inference. * chore: formatting. --------- Co-authored-by: Andrew Kubaney <akubaney@localhost> Co-authored-by: Nathaniel Corley <ncorley@uw.edu> Co-authored-by: Andrew Kubaney <akubaney@digs> Co-authored-by: Raktim Mitra <raktim@localhost>
2.7 KiB
ProteinMPNN and LigandMPNN
Warning
Benchmarking: Please use the old repositories of ProteinMPNN and LigandMPNN for model benchmarking/comparison until the API and public weights stabilize. We are in the process of validating that the re-implementation (both the retrained version and the old weight loading option) is as performant as the original models.
Important
Issues: Please provide feedback on any issues you encounter with the ProteinMPNN/LigandMPNN re-implementation. We are particularly interested in discrepancies between the original models and this re-implementation, issues with performance when loading the original weights from the old repositories, problems with inference hyperparameters/conditioning, and input/output bugs.
Important
API Instability: We are currently finalizing some cleanup work on the inference API and training code. Please expect the API (including input formats and outputs) to stabilize in the upcoming weeks. Thank you for your patience!
Important
Training Code and New Weights: We are working to release the dataframes used for retrianing the ProteinMPNN and LigandMPNN re-implementations. Also, we are finalizing the retraining runs and will release weights retrained within this repository shortly.
ProteinMPNN enables protein sequence design given a fixed backbone structure of a protein. LigandMPNN extends this functionality to enable fixed-backbone sequence design of proteins in the context of ligands (i.e. small molecules, ions, DNA/RNA, etc.). This module represents a re-implementation of the original ProteinMPNN and LigandMPNN models within the modelforge/atomworks framework.
For more information on the original models, please see:
- Robust deep learning–based protein sequence design using ProteinMPNN | ProteinMPNN Original Github
- Atomic context-conditioned protein sequence design using LigandMPNN | LigandMPNN Original Github
This guide provides instructions on preparing inputs and running inference for ProteinMPNN/LigandMPNN, as well as training these models.
Inference
Important
When using weights from the original ProteinMPNN/LigandMPNN repositories, please ensure to set
is_legacy_weightstoTruewhen running inference.
Notes on Programmatic (Scripted) Inference
- Currently, 'mpnn_bias' and 'mpnn_pair_bias' annotations cannot be saved to CIF files due to shape limitations. As a result, these annotations must be recreated (either directly with annotation on the atom array or via the input config dictionary) when reloading designed structures from CIF files.