2022-09-08 14:36:12 -07:00
2022-09-09 14:15:51 -07:00
2022-09-08 15:11:19 -07:00
2022-09-08 22:55:25 -07:00
2022-09-08 22:55:25 -07:00
2022-09-09 13:35:18 -07:00
2022-09-08 23:02:18 -07:00
2022-08-22 23:43:27 +00:00
2022-07-02 05:46:50 +00:00
2022-06-30 08:27:16 -07:00
2022-09-09 14:15:44 -07:00
2022-06-30 08:27:13 -07:00
2022-06-30 08:27:14 -07:00

Protein diffusion

Code style: black

We present a diffusion model for generating novel protein backbone structures.

Installation

This software is written in Python, notably using PyTorch, PyTorch Ligthing, and the HuggingFace transformers library. The required conda environment is defined within the environment.yml file. To set this up, make sure you have conda (or mamba) installed and run:

conda env create -f environment.yml

Note that you do not need to have this set up if you are only submitting jobs to the cluster.

Training models

To train a model on the CATH dataset, use the script at bin/train.py in combination with one of the json config files under config_jsons (or write your own). An example usage of this is as follows:

python bin/train.py config_jsons/full_run_canonical_angles_only_zero_centered_1000_timesteps_reduced_len.json

The output of the model will be in the results folder with the following major files present:

results/
    - config.json           # Contains the config file for the huggingface BERT model itself
    - logs/                 # Contains the logs from training
    - models/               # Contains model checkpoints. By default we store the best 5 models by validation loss and the best 5 by training loss
    - training_args.json    # Full set of arguments, can be used to reproduce run

Downloading data

We requires some data files not packaged on Git due to their large size. These are required to be downloaded locally even if you are running this on Singularity (as they are uploaded). To download these, do the following:

# Download the CATH dataset
cd data  # Ensure that you are in the data subdirectory within the codebase
./download_cath.sh

Sampling protein backbones

To sample protein backbones, use the script bin/sample.py. An example command to do this is as follows.

python ~/protdiff/bin/sample.py ../projects/models/full_angles/results/ --num 512 --device cuda:3

This will run the model contained in the results folder and generate 512 sequences of varying lengths. Not specifying a device will default to the first device cuda:0; use --device cpu to run on CPU.

Tests

Tests are implemented through a mixture of doctests and unittests. To run unittests, run:

python -m unittest -v

Singularity/amulet

To run on singularity/amulet, first install amulet following the instructions at https://amulet-docs.azurewebsites.net/main/setup.html. This should leave you with a conda environment named amlt8. Note that this environment should be separate from the environment that is required to actually run model. To run training on singularity, run:

conda activate amlt8  # Activate the conda env.
amlt run -y scripts/amlt.yaml -o results

Within this amlt.yaml file, the python command contains a pointer to a config json file. Edit the path indicated here to use a different configuration for training.

Note rearding the structure of the amlt.yaml file: installing packages via conda is very slow on the Singularity cluster, so we recreate the same set of packages installed via pip instead of relying on conda.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Languages
Jupyter Notebook 97.3%
Python 2.7%