diff --git a/README.md b/README.md index b82bb96..332f0f5 100644 --- a/README.md +++ b/README.md @@ -1,7 +1,6 @@ ![header ](imgs/of_banner.png) _Figure: Comparison of OpenFold and AlphaFold2 predictions to the experimental structure of PDB 7KDX, chain B._ - # OpenFold A faithful but trainable PyTorch reproduction of DeepMind's @@ -10,6 +9,8 @@ A faithful but trainable PyTorch reproduction of DeepMind's # Documentation See our new home for docs at [openfold.readthedocs.io](https://openfold.readthedocs.io/en/latest/), with instructions for installation and model inference/training. +Much of the content from this page may be found [here.](https://github.com/aqlaboratory/openfold/blob/main/docs/source/original_readme.md) + ## Copyright Notice While AlphaFold's and, by extension, OpenFold's source code is licensed under diff --git a/docs/source/Aux_seq_files.md b/docs/source/Aux_seq_files.md index 0e03f2a..820872f 100644 --- a/docs/source/Aux_seq_files.md +++ b/docs/source/Aux_seq_files.md @@ -14,7 +14,7 @@ For example, consider two protein as a case study ``` - OpenProteinSet └── mmcifs - └── 3lrm.cif + ├── 3lrm.cif └── 6kwc.cif ... ``` @@ -64,13 +64,13 @@ All together, the file directory would look like: └── pdb ├── mmcif_cache.json └── mmcifs - └── 3lrm.cif + ├── 3lrm.cif └── 6kwc.cif └── alignment_db - └── alignment_db_0.db - └── alignment_db_1.db + ├── alignment_db_0.db + ├── alignment_db_1.db ... - └── alignment_db_9.db + ├── alignment_db_9.db └── alignment_db.index ``` diff --git a/docs/source/OpenFold_Training_Setup.md b/docs/source/OpenFold_Training_Setup.md index 2890bdf..59da9a4 100644 --- a/docs/source/OpenFold_Training_Setup.md +++ b/docs/source/OpenFold_Training_Setup.md @@ -4,18 +4,20 @@ The multiple sequence alignments of OpenProteinSet and mmCIF structure files req ### Pre-Requisites: - OpenFold conda environment. See [OpenFold Installation](Installation.md) for instructions on how to build this environment. +- In particular, the [AWS CLI](https://aws.amazon.com/cli/) is used to download data from RODA. - For this guide, we assume that the OpenFold codebase is located at `$OF_DIR`. ## 1. Downloading alignments and structure files To fetch all the alignments corresponding to the original PDB training set of OpenFold alongside their mmCIF 3D structures, you can run the following commands: ```bash -mkdir -p alignment_data/alignment_dir_roda --recursive --no-sign-request +mkdir -p alignment_data/alignment_dir_roda aws s3 cp s3://openfold/pdb/ alignment_data/alignment_dir_roda/ --recursive --no-sign-request mkdir pdb_data aws s3 cp s3://openfold/pdb_mmcif.zip pdb_data/ --no-sign-request -aws s3 cp s3://openfold/duplicate_pdb_chains.txt pdb_data/ --no-sign-request +aws s3 cp s3://openfold/duplicate_pdb_chains.txt . --no-sign-request +unzip pdb_mmcif.zip -d pdb_data ``` The nested alignment directory structure is not yet exactly what OpenFold expects, so you can run the `flatten_roda.sh` script to convert them to the correct format: @@ -102,7 +104,12 @@ python $OF_DIR/scripts/fasta_to_clusterfile.py \ ## 5. Generating cluster-files As a last step, OpenFold requires ["cache" files](Aux_seq_files.md#chain-cache-files-and-mmcif-cache-files) with metadata information for each chain that are used for choosing templates and samples during training. -The mmCIF-cache is used for filtering templates and can be generated with the following script: +The data caches for OpenProteinSet can be downloaded from RODA with the following: + +```bash +aws s3 cp s3://openfold/data_caches/ pdb_data/ --recursive --no-sign-request +``` +If you wish to create data caches for your own datasets, the steps to generate the cache are as follows: ```bash mkdir pdb_data/data_caches diff --git a/docs/source/Training_OpenFold.md b/docs/source/Training_OpenFold.md index a14c114..066672c 100644 --- a/docs/source/Training_OpenFold.md +++ b/docs/source/Training_OpenFold.md @@ -1,62 +1,65 @@ # Training OpenFold ## Background -This guide covers how to train an OpenFold model. These instructions focus on training a model for predicting monomers, but additional instructions are provided for training a monomer / multimer model. +This guide covers how to train an OpenFold model for monomers. Some additional instructions are provided at the end for fine-tuning your model. ### Pre-requisites: This guide requires the following: - [Installation of OpenFold and dependencies](Installation.md) (Including jackhmmer and hhblits depedencies) - A preprocessed dataset: - - For this guide, we will use the original OpenFold dataset which is available on RODA (TODO: add link to processed dataset). - - If you wish to construct your own dataset, [these instructions](OpenFold_Training_Setup.md) provide guidance for preprocessing alignments into an OpenFold format. + - For this guide, we will use the original OpenFold dataset which is available on RODA, processed with [these instructions](OpenFold_Training_Setup.md). - GPUs configured with CUDA. Training OpenFold with CPUs only is not supported. -Expected data directory structure: -``` -- OpenProteinSet - └── alignments - └── 2x7l_M - └── mgnify_hits.a3m - └── bfd_uniclust_hits.a3m - └── uniref90_hits.a3m - └── pdb70_hits.hhr - ... - └── mmcifs - └── 3u8d.cif - └── 3lrm.cif - ... - └── mmcif_cache.json - └── chain_data_cache.json -``` - -The `mmcif_cache.json` and the `chain_data_cache.json` provide metadata for the mmcif and the protein chains in the dataset. - ## Training a new OpenFold model #### Basic command -The basic command to train a new OpenFold model is + +For a dataset that has the default alignment file structure, e.g. + ``` -python3 train_openfold.py $DATA_DIR/mmcifs/ $DATA_DIR/alignments/ template_mmcif_dir/ $OUTPUT_DIR \ +-$DATA_DIR + └── pdb_data + ├── mmcifs + ├── 3lrm.cif + └── 6kwc.cif + ... + ├── obsolete.dat + ├── duplicate_pdb_chains.txt + └── data_caches + ├── duplicate_pdb_chains.txt + └── data_caches + └── alignment_data + └── alignments + ├── 3lrm_A/ + ├── 3lrm_B/ + └── 6kwc_A/ + ... +``` + +The basic command to train a new OpenFold model is: + +``` +python3 train_openfold.py $DATA_DIR/pdb/mmcifs $DATA_DIR/alignment_data/alignments $TEMPLATE_MMCIF_DIR $OUTPUT_DIR \ --max_template_date 2021-10-10 \ - --train_chain_data_cache_path chain_data_cache.json \ - --template_release_dates_cache_path mmcif_cache.json \ + --train_chain_data_cache_path $DATA_DIR/pdb_data/data_caches/chain_data_cache.json \ + --template_release_dates_cache_path $DATA_DIR/pdb_data/data_caches/mmcif_cache.json \ --config_preset initial_training \ --seed 42 \ - --obsolete_pdbs_file_path obsolete.dat \ + --obsolete_pdbs_file_path $DATA_DIR/pdb_data/obsolete.dat \ --num_nodes 1 \ --gpus 4 \ - --num_workers 4 \ + --num_workers 4 ``` The required arguments are: - `mmcif_dir` : Mmcif files for the training set. -- `alignment_dir`: Alignments for the sequences in `mmcif_dir`, see expected directory structure +- `alignments_dir`: Alignments for the sequences in `mmcif_dir`, see expected directory structure - `template_mmcif_dir`: Template mmcif files with structures, which can be the same directory as mmcif_dir. The `max_template_date` and `template_release_dates_cache_path` will specify which templates will be allowed based on a date cutoff -- `$OUTPUT_DIR` : Where model checkpoint files and other outputs will be saved. +- `output_dir` : Where model checkpoint files and other outputs will be saved. Commonly used flags include: -- `config_preset`: Specifies which selection of hyperparameters should be used for initial model training. Commonly used configs are defined in `openfold/config.py` +- `config_preset`: Specifies which selection of hyperparameters should be used for initial model training. Commonly used configs are defined in [`openfold/config.py`](https://github.com/aqlaboratory/openfold) - `num_nodes` and `gpus`: Specifies number of nodes and GPUs available to train OpenFold. - `seed` - Specifies random seed - `num_workers`: Number of CPU workers to assign for creating dataset examples @@ -67,16 +70,40 @@ Commonly used flags include: Note that `--seed` must be specified to correctly configure training examples on multi-GPU training runs ``` +#### Train with OpenFold Dataset Configuration +If the [OpenFold alignment database](OpenFold_Training_Setup.md#2-creating-alignment-dbs-optional) setup is used, resulting in a data directory such as: +``` +- $DATA_DIR + ├── duplicate_pdb_chains.txt + ├── pdb_data + └── mmcifs + ├── 3lrm.cif + └── 6kwc.cif + └── alignment_data + └── alignment_db + ├── alignment_db_0.db + ├── alignment_db_1.db + ... + ├── alignment_db_9.db + └── alignment_db.index +``` -#### Train OpenFold with Different Dataset Configurations - -If the [OpenFold alignment database](OpenFold_Training_Setup.md#2-creating-alignment-dbs-optional) setup is used, the training command will instead look like this: - - - - +The training command will use the `alignment_index_path` argument to specify `db.index` files, e.g.: +``` +python3 train_openfold.py $DATA_DIR/pdb_data/mmcifs $DATA_DIR/alignment_data/alignment_db $TEMPLATE_MMCIF_DIR $OUTPUT_DIR \ + --max_template_date 2021-10-10 \ + --train_chain_data_cache_path $DATA_DIR/pdb_data/data_caches/chain_data_cache.json \ + --template_release_dates_cache_path $DATA_DIR/pdb_data/data_caches/mmcif_cache.json \ + --alignment_index_path $DATA_DIR/pdb/alignment_db.index + --config_preset initial_training \ + --seed 42 \ + --obsolete_pdbs_file_path $DATA_DIR/pdb/obsolete.dat \ + --num_nodes 1 \ + --gpus 4 \ + --num_workers 4 +``` #### Additional command line flag options: @@ -104,40 +131,29 @@ Here we provide brief descriptions for customizing your training run of OpenFold - **Restart training from an existing checkpoint:** Use the `--resume_from_ckpt` to restart training from an existing checkpoint. ## Advanced Training Configurations - -### Training OpenFold Multimer - -At this time, we do not have a multimer training set available. To prepare your own multimer training set, please see the instructions at [Data Processing - multimer] - -The basic command for training a multimer model is then: - -``` -multimer training command here -``` - -The key differences are: -- Dataset configuration / preparation +::: ### Fine tuning from existing model weights -If you have existing model weights, you can fine tune the model using the following command: +If you have existing model weights, you can fine tune the model by specifying a checkpoint path with `--resume_from_ckpt` and `--resume_model_weights_only` arguments, e.g. ``` -python3 train_openfold.py mmcif_dir/ alignment_dir/ template_mmcif_dir/ $OUTPUT_DIR \ +python3 train_openfold.py $DATA_DIR/mmcifs $DATA_DIR/alignment.db $TEMPLATE_MMCIF_DIR $OUTPUT_DIR \ --max_template_date 2021-10-10 \ --train_chain_data_cache_path chain_data_cache.json \ --template_release_dates_cache_path mmcif_cache.json \ --config_preset finetuning \ + --alignment_index_path $DATA_DIR/pdb/alignment_db.index \ --seed 4242022 \ --obsolete_pdbs_file_path obsolete.dat \ --num_nodes 1 \ --gpus 4 \ --num_workers 4 \ - --resume_from_ckpt $CHECKPOINT_PATH + --resume_from_ckpt $CHECKPOINT_PATH \ --resume_model_weights_only ``` -If you have model parameters from OpenFold v1.x, you may need to convert your checkpoint file or parameter. See [[Converting OpenFold v1 Weights]] +If you have model parameters from OpenFold v1.x, you may need to convert your checkpoint file or parameter. See [Converting OpenFold v1 Weights](convert_of_v1_weights.md) for more details. ### Using MPI @@ -145,3 +161,10 @@ If MPI is configured on your system, and you would like to use MPI to train Open 1. Add the `mpi4py` package, which are available through pip and conda. Please see [mpi4py documentation](https://pypi.org/project/mpi4py/) for more instructions on installation. 2. Add the `--mpi_plugin` flag to your training command. + + +### Training Multimer models + +```{note} +Coming soon. +``` \ No newline at end of file diff --git a/docs/source/convert_of_v1_weights.md b/docs/source/convert_of_v1_weights.md index 6a20329..f50c20e 100644 --- a/docs/source/convert_of_v1_weights.md +++ b/docs/source/convert_of_v1_weights.md @@ -25,8 +25,7 @@ $ python3 $OPENFOLD_DIR/train_openfold.py test_data_epoch/mmcifs test_data_epoch ### How do I convert my checkpoints? -Use the `convert_v1_to_v2_weights.py` script in the `scripts` directory of the OpenFold repo: -e.g. +Use [`scripts/convert_v1_to_v2_weights.py`](https://github.com/aqlaboratory/openfold/blob/main/scripts/convert_v1_to_v2_weights.py) e.g. `python scripts/convert_v1_to_v2_weights.py checkpoints/6-209.ckpt checkpoints/6-209.ckpt.converted` diff --git a/docs/source/index.md b/docs/source/index.md index 367ba6a..ae58587 100644 --- a/docs/source/index.md +++ b/docs/source/index.md @@ -8,14 +8,15 @@ Welcome to the Documentation for OpenFold, the fully open source, trainable, PyTorch-based reproduction of DeepMind's [AlphaFold 2](https://github.com/deepmind/alphafold). - Here, you will find guides and documentation for: - [Getting started with OpenFold](installation.md)! - Learn how to [run inference with OpenFold](Inference.md) - [Train your own OpenFold models](Training_OpenFold.md) - Find guidance for setup and running OpenFold in the [FAQ](FAQ.md). -Some portions of the documentation are still under migration from the original README, which can be found [here](original_readme.md) +We also have a [Colab notebook](https://colab.research.google.com/github/aqlaboratory/openfold/blob/main/notebooks/OpenFold.ipynb) that can be used for single structure / multimer prediction. + +Some portions of the documentation are still under migration from the original README, which can be found [here](original_readme.md). # Features