mirror of
https://github.com/aqlaboratory/openfold.git
synced 2026-06-04 12:44:26 +08:00
Update training OpenFold docs with correct paths.
This commit is contained in:
@@ -14,7 +14,7 @@ For example, consider two protein as a case study
|
||||
```
|
||||
- OpenProteinSet
|
||||
└── mmcifs
|
||||
└── 3lrm.cif
|
||||
├── 3lrm.cif
|
||||
└── 6kwc.cif
|
||||
...
|
||||
```
|
||||
@@ -64,13 +64,13 @@ All together, the file directory would look like:
|
||||
└── pdb
|
||||
├── mmcif_cache.json
|
||||
└── mmcifs
|
||||
└── 3lrm.cif
|
||||
├── 3lrm.cif
|
||||
└── 6kwc.cif
|
||||
└── alignment_db
|
||||
└── alignment_db_0.db
|
||||
└── alignment_db_1.db
|
||||
├── alignment_db_0.db
|
||||
├── alignment_db_1.db
|
||||
...
|
||||
└── alignment_db_9.db
|
||||
├── alignment_db_9.db
|
||||
└── alignment_db.index
|
||||
```
|
||||
|
||||
|
||||
@@ -4,18 +4,20 @@ The multiple sequence alignments of OpenProteinSet and mmCIF structure files req
|
||||
|
||||
### Pre-Requisites:
|
||||
- OpenFold conda environment. See [OpenFold Installation](Installation.md) for instructions on how to build this environment.
|
||||
- In particular, the [AWS CLI](https://aws.amazon.com/cli/) is used to download data from RODA.
|
||||
- For this guide, we assume that the OpenFold codebase is located at `$OF_DIR`.
|
||||
|
||||
## 1. Downloading alignments and structure files
|
||||
To fetch all the alignments corresponding to the original PDB training set of OpenFold alongside their mmCIF 3D structures, you can run the following commands:
|
||||
|
||||
```bash
|
||||
mkdir -p alignment_data/alignment_dir_roda --recursive --no-sign-request
|
||||
mkdir -p alignment_data/alignment_dir_roda
|
||||
aws s3 cp s3://openfold/pdb/ alignment_data/alignment_dir_roda/ --recursive --no-sign-request
|
||||
|
||||
mkdir pdb_data
|
||||
aws s3 cp s3://openfold/pdb_mmcif.zip pdb_data/ --no-sign-request
|
||||
aws s3 cp s3://openfold/duplicate_pdb_chains.txt pdb_data/ --no-sign-request
|
||||
aws s3 cp s3://openfold/duplicate_pdb_chains.txt . --no-sign-request
|
||||
unzip pdb_mmcif.zip -d pdb_data
|
||||
```
|
||||
|
||||
The nested alignment directory structure is not yet exactly what OpenFold expects, so you can run the `flatten_roda.sh` script to convert them to the correct format:
|
||||
@@ -102,7 +104,12 @@ python $OF_DIR/scripts/fasta_to_clusterfile.py \
|
||||
## 5. Generating cluster-files
|
||||
As a last step, OpenFold requires ["cache" files](Aux_seq_files.md#chain-cache-files-and-mmcif-cache-files) with metadata information for each chain that are used for choosing templates and samples during training.
|
||||
|
||||
The mmCIF-cache is used for filtering templates and can be generated with the following script:
|
||||
The data caches for OpenProteinSet can be downloaded from RODA with the following:
|
||||
|
||||
```bash
|
||||
aws s3 cp s3://openfold/data_caches/ pdb_data/ --recursive --no-sign-request
|
||||
```
|
||||
If you wish to create data caches for your own datasets, the steps to generate the cache are as follows:
|
||||
|
||||
```bash
|
||||
mkdir pdb_data/data_caches
|
||||
|
||||
@@ -1,62 +1,65 @@
|
||||
# Training OpenFold
|
||||
## Background
|
||||
|
||||
This guide covers how to train an OpenFold model. These instructions focus on training a model for predicting monomers, but additional instructions are provided for training a monomer / multimer model.
|
||||
This guide covers how to train an OpenFold model for monomers. Some additional instructions are provided at the end for fine-tuning your model.
|
||||
|
||||
### Pre-requisites:
|
||||
|
||||
This guide requires the following:
|
||||
- [Installation of OpenFold and dependencies](Installation.md) (Including jackhmmer and hhblits depedencies)
|
||||
- A preprocessed dataset:
|
||||
- For this guide, we will use the original OpenFold dataset which is available on RODA (TODO: add link to processed dataset).
|
||||
- If you wish to construct your own dataset, [these instructions](OpenFold_Training_Setup.md) provide guidance for preprocessing alignments into an OpenFold format.
|
||||
- For this guide, we will use the original OpenFold dataset which is available on RODA, processed with [these instructions](OpenFold_Training_Setup.md).
|
||||
- GPUs configured with CUDA. Training OpenFold with CPUs only is not supported.
|
||||
|
||||
Expected data directory structure:
|
||||
```
|
||||
- OpenProteinSet
|
||||
└── alignments
|
||||
└── 2x7l_M
|
||||
└── mgnify_hits.a3m
|
||||
└── bfd_uniclust_hits.a3m
|
||||
└── uniref90_hits.a3m
|
||||
└── pdb70_hits.hhr
|
||||
...
|
||||
└── mmcifs
|
||||
└── 3u8d.cif
|
||||
└── 3lrm.cif
|
||||
...
|
||||
└── mmcif_cache.json
|
||||
└── chain_data_cache.json
|
||||
```
|
||||
|
||||
The `mmcif_cache.json` and the `chain_data_cache.json` provide metadata for the mmcif and the protein chains in the dataset.
|
||||
|
||||
## Training a new OpenFold model
|
||||
|
||||
#### Basic command
|
||||
The basic command to train a new OpenFold model is
|
||||
|
||||
For a dataset that has the default alignment file structure, e.g.
|
||||
|
||||
```
|
||||
python3 train_openfold.py $DATA_DIR/mmcifs/ $DATA_DIR/alignments/ template_mmcif_dir/ $OUTPUT_DIR \
|
||||
-$DATA_DIR
|
||||
└── pdb_data
|
||||
├── mmcifs
|
||||
├── 3lrm.cif
|
||||
└── 6kwc.cif
|
||||
...
|
||||
├── obsolete.dat
|
||||
├── duplicate_pdb_chains.txt
|
||||
└── data_caches
|
||||
├── duplicate_pdb_chains.txt
|
||||
└── data_caches
|
||||
└── alignment_data
|
||||
└── alignments
|
||||
├── 3lrm_A/
|
||||
├── 3lrm_B/
|
||||
└── 6kwc_A/
|
||||
...
|
||||
```
|
||||
|
||||
The basic command to train a new OpenFold model is:
|
||||
|
||||
```
|
||||
python3 train_openfold.py $DATA_DIR/pdb/mmcifs $DATA_DIR/alignment_data/alignments $TEMPLATE_MMCIF_DIR $OUTPUT_DIR \
|
||||
--max_template_date 2021-10-10 \
|
||||
--train_chain_data_cache_path chain_data_cache.json \
|
||||
--template_release_dates_cache_path mmcif_cache.json \
|
||||
--train_chain_data_cache_path $DATA_DIR/pdb_data/data_caches/chain_data_cache.json \
|
||||
--template_release_dates_cache_path $DATA_DIR/pdb_data/data_caches/mmcif_cache.json \
|
||||
--config_preset initial_training \
|
||||
--seed 42 \
|
||||
--obsolete_pdbs_file_path obsolete.dat \
|
||||
--obsolete_pdbs_file_path $DATA_DIR/pdb_data/obsolete.dat \
|
||||
--num_nodes 1 \
|
||||
--gpus 4 \
|
||||
--num_workers 4 \
|
||||
--num_workers 4
|
||||
```
|
||||
|
||||
The required arguments are:
|
||||
- `mmcif_dir` : Mmcif files for the training set.
|
||||
- `alignment_dir`: Alignments for the sequences in `mmcif_dir`, see expected directory structure
|
||||
- `alignments_dir`: Alignments for the sequences in `mmcif_dir`, see expected directory structure
|
||||
- `template_mmcif_dir`: Template mmcif files with structures, which can be the same directory as mmcif_dir. The `max_template_date` and `template_release_dates_cache_path` will specify which templates will be allowed based on a date cutoff
|
||||
- `$OUTPUT_DIR` : Where model checkpoint files and other outputs will be saved.
|
||||
- `output_dir` : Where model checkpoint files and other outputs will be saved.
|
||||
|
||||
Commonly used flags include:
|
||||
- `config_preset`: Specifies which selection of hyperparameters should be used for initial model training. Commonly used configs are defined in `openfold/config.py`
|
||||
- `config_preset`: Specifies which selection of hyperparameters should be used for initial model training. Commonly used configs are defined in [`openfold/config.py`](https://github.com/aqlaboratory/openfold)
|
||||
- `num_nodes` and `gpus`: Specifies number of nodes and GPUs available to train OpenFold.
|
||||
- `seed` - Specifies random seed
|
||||
- `num_workers`: Number of CPU workers to assign for creating dataset examples
|
||||
@@ -67,16 +70,40 @@ Commonly used flags include:
|
||||
Note that `--seed` must be specified to correctly configure training examples on multi-GPU training runs
|
||||
```
|
||||
|
||||
#### Train with OpenFold Dataset Configuration
|
||||
|
||||
If the [OpenFold alignment database](OpenFold_Training_Setup.md#2-creating-alignment-dbs-optional) setup is used, resulting in a data directory such as:
|
||||
```
|
||||
- $DATA_DIR
|
||||
├── duplicate_pdb_chains.txt
|
||||
├── pdb_data
|
||||
└── mmcifs
|
||||
├── 3lrm.cif
|
||||
└── 6kwc.cif
|
||||
└── alignment_data
|
||||
└── alignment_db
|
||||
├── alignment_db_0.db
|
||||
├── alignment_db_1.db
|
||||
...
|
||||
├── alignment_db_9.db
|
||||
└── alignment_db.index
|
||||
```
|
||||
|
||||
#### Train OpenFold with Different Dataset Configurations
|
||||
|
||||
If the [OpenFold alignment database](OpenFold_Training_Setup.md#2-creating-alignment-dbs-optional) setup is used, the training command will instead look like this:
|
||||
|
||||
|
||||
|
||||
|
||||
The training command will use the `alignment_index_path` argument to specify `db.index` files, e.g.:
|
||||
|
||||
```
|
||||
python3 train_openfold.py $DATA_DIR/pdb_data/mmcifs $DATA_DIR/alignment_data/alignment_db $TEMPLATE_MMCIF_DIR $OUTPUT_DIR \
|
||||
--max_template_date 2021-10-10 \
|
||||
--train_chain_data_cache_path $DATA_DIR/pdb_data/data_caches/chain_data_cache.json \
|
||||
--template_release_dates_cache_path $DATA_DIR/pdb_data/data_caches/mmcif_cache.json \
|
||||
--alignment_index_path $DATA_DIR/pdb/alignment_db.index
|
||||
--config_preset initial_training \
|
||||
--seed 42 \
|
||||
--obsolete_pdbs_file_path $DATA_DIR/pdb/obsolete.dat \
|
||||
--num_nodes 1 \
|
||||
--gpus 4 \
|
||||
--num_workers 4
|
||||
```
|
||||
|
||||
#### Additional command line flag options:
|
||||
|
||||
@@ -104,40 +131,29 @@ Here we provide brief descriptions for customizing your training run of OpenFold
|
||||
- **Restart training from an existing checkpoint:** Use the `--resume_from_ckpt` to restart training from an existing checkpoint.
|
||||
|
||||
## Advanced Training Configurations
|
||||
|
||||
### Training OpenFold Multimer
|
||||
|
||||
At this time, we do not have a multimer training set available. To prepare your own multimer training set, please see the instructions at [Data Processing - multimer]
|
||||
|
||||
The basic command for training a multimer model is then:
|
||||
|
||||
```
|
||||
multimer training command here
|
||||
```
|
||||
|
||||
The key differences are:
|
||||
- Dataset configuration / preparation
|
||||
:::
|
||||
|
||||
### Fine tuning from existing model weights
|
||||
|
||||
If you have existing model weights, you can fine tune the model using the following command:
|
||||
If you have existing model weights, you can fine tune the model by specifying a checkpoint path with `--resume_from_ckpt` and `--resume_model_weights_only` arguments, e.g.
|
||||
|
||||
```
|
||||
python3 train_openfold.py mmcif_dir/ alignment_dir/ template_mmcif_dir/ $OUTPUT_DIR \
|
||||
python3 train_openfold.py $DATA_DIR/mmcifs $DATA_DIR/alignment.db $TEMPLATE_MMCIF_DIR $OUTPUT_DIR \
|
||||
--max_template_date 2021-10-10 \
|
||||
--train_chain_data_cache_path chain_data_cache.json \
|
||||
--template_release_dates_cache_path mmcif_cache.json \
|
||||
--config_preset finetuning \
|
||||
--alignment_index_path $DATA_DIR/pdb/alignment_db.index \
|
||||
--seed 4242022 \
|
||||
--obsolete_pdbs_file_path obsolete.dat \
|
||||
--num_nodes 1 \
|
||||
--gpus 4 \
|
||||
--num_workers 4 \
|
||||
--resume_from_ckpt $CHECKPOINT_PATH
|
||||
--resume_from_ckpt $CHECKPOINT_PATH \
|
||||
--resume_model_weights_only
|
||||
```
|
||||
|
||||
If you have model parameters from OpenFold v1.x, you may need to convert your checkpoint file or parameter. See [[Converting OpenFold v1 Weights]]
|
||||
If you have model parameters from OpenFold v1.x, you may need to convert your checkpoint file or parameter. See [Converting OpenFold v1 Weights](convert_of_v1_weights.md) for more details.
|
||||
|
||||
### Using MPI
|
||||
|
||||
@@ -145,3 +161,10 @@ If MPI is configured on your system, and you would like to use MPI to train Open
|
||||
|
||||
1. Add the `mpi4py` package, which are available through pip and conda. Please see [mpi4py documentation](https://pypi.org/project/mpi4py/) for more instructions on installation.
|
||||
2. Add the `--mpi_plugin` flag to your training command.
|
||||
|
||||
|
||||
### Training Multimer models
|
||||
|
||||
```{note}
|
||||
Coming soon.
|
||||
```
|
||||
@@ -25,8 +25,7 @@ $ python3 $OPENFOLD_DIR/train_openfold.py test_data_epoch/mmcifs test_data_epoch
|
||||
|
||||
### How do I convert my checkpoints?
|
||||
|
||||
Use the `convert_v1_to_v2_weights.py` script in the `scripts` directory of the OpenFold repo:
|
||||
e.g.
|
||||
Use [`scripts/convert_v1_to_v2_weights.py`](https://github.com/aqlaboratory/openfold/blob/main/scripts/convert_v1_to_v2_weights.py) e.g.
|
||||
|
||||
`python scripts/convert_v1_to_v2_weights.py checkpoints/6-209.ckpt checkpoints/6-209.ckpt.converted`
|
||||
|
||||
|
||||
@@ -8,14 +8,15 @@
|
||||
Welcome to the Documentation for OpenFold, the fully open source, trainable, PyTorch-based reproduction of DeepMind's
|
||||
[AlphaFold 2](https://github.com/deepmind/alphafold).
|
||||
|
||||
|
||||
Here, you will find guides and documentation for:
|
||||
- [Getting started with OpenFold](installation.md)!
|
||||
- Learn how to [run inference with OpenFold](Inference.md)
|
||||
- [Train your own OpenFold models](Training_OpenFold.md)
|
||||
- Find guidance for setup and running OpenFold in the [FAQ](FAQ.md).
|
||||
|
||||
Some portions of the documentation are still under migration from the original README, which can be found [here](original_readme.md)
|
||||
We also have a [Colab notebook](https://colab.research.google.com/github/aqlaboratory/openfold/blob/main/notebooks/OpenFold.ipynb) that can be used for single structure / multimer prediction.
|
||||
|
||||
Some portions of the documentation are still under migration from the original README, which can be found [here](original_readme.md).
|
||||
|
||||
# Features
|
||||
|
||||
|
||||
Reference in New Issue
Block a user