put manuals to a separate folder
32
manuals/Developing.md
Normal file
@@ -0,0 +1,32 @@
|
||||
1. Clone the GitHub repo
|
||||
```
|
||||
git clone --recurse-submodules git@github.com:KosinskiLab/AlphaPulldown.git
|
||||
cd AlphaPulldown
|
||||
git submodule init
|
||||
git submodule update
|
||||
```
|
||||
1. Create the Conda environment and install the latest version of AlphaPulldown as described in https://github.com/KosinskiLab/AlphaPulldown/tree/DevelopReadme#for-users-pip-installation
|
||||
1. Add AlphaPulldown package and its submodules to the Conda environment
|
||||
```
|
||||
cd AlphaPulldown
|
||||
pip install -e .
|
||||
pip install -e alphapulldown/ColabFold --no-deps
|
||||
pip install -e alphafold --no-deps
|
||||
```
|
||||
You need to do it only once.
|
||||
1. When you want to develop, activate the environment, modify files, and the changes should be automatically recognized.
|
||||
1. Test your package during development using tests in ```test/```, e.g.:
|
||||
```
|
||||
pip install pytest
|
||||
pytest -s test/
|
||||
pytest -s test/test_predictions_slurm.py
|
||||
pytest -s test/test_features_with_templates.py::TestCreateIndividualFeaturesWithTemplates::test_1a_run_features_generation
|
||||
```
|
||||
1. Before pushing to the remote or submitting pull request
|
||||
```
|
||||
pip install .
|
||||
pytest -s test/
|
||||
```
|
||||
to install the package and test. Pytest for predictions only work if slurm is available. Check the created log files in your current directory.
|
||||
|
||||
|
||||
BIN
manuals/all_vs_all_demo.png
Normal file
|
After Width: | Height: | Size: 35 KiB |
BIN
manuals/apms_demo_2.png
Normal file
|
After Width: | Height: | Size: 245 KiB |
BIN
manuals/custom_demo_2.png
Normal file
|
After Width: | Height: | Size: 24 KiB |
BIN
manuals/custom_mode_demo.png
Normal file
|
After Width: | Height: | Size: 43 KiB |
319
manuals/example_1.md
Normal file
@@ -0,0 +1,319 @@
|
||||
# Example1
|
||||
# Aim: Find proteins involving human translation pathway that might interact with eIF4G2
|
||||
|
||||
## 1st step: compute multiple sequence alignment (MSA) and template features (run on CPUs)
|
||||
|
||||
For the purpose of this manual, the expected file is already provided here: [```./example_data/example_1_sequences.fasta```](./example_data/example_1_sequences.fasta). If you want to run a smaller test, you can use [```./example_data/example_1_sequences_shorter.fasta```](./example_data/example_1_sequences_shorter.fasta) instead.
|
||||
|
||||
|
||||
:memo: *The example file was generated by downloading all 294 proteins that belong to human translation pathway from: [Reactome](https://reactome.org/PathwayBrowser/#/R-HSA-72766&DTAB=MT). eIF4G2 sequence was downloaded from (Uniprot:[P78344](https://www.uniprot.org/uniprot/P78344)).*
|
||||
|
||||
### Run using default AlphaFold databases (slower):
|
||||
|
||||
```bash
|
||||
source activate AlphaPulldown
|
||||
create_individual_features.py \
|
||||
--fasta_paths=baits.fasta,example_1_sequences.fasta \
|
||||
--data_dir=<path to alphafold databases> \
|
||||
--save_msa_files=False \
|
||||
--output_dir=<dir to save the output objects> \
|
||||
--use_precomputed_msas=False \
|
||||
--max_template_date=<any date you want, format like: 2050-01-01> \
|
||||
--skip_existing=False \
|
||||
--seq_index=<any number you want or skip the flag to run all one after another>
|
||||
```
|
||||
|
||||
### Run using MMseqs2 and ColabFold databases (faster):
|
||||
|
||||
MMSeqs2 and ColabFold allow for much quicker calculation of MSAs than the default AlphaFold method above. To use MMSeqs2 in AlphaPulldown, please refer to [this manual](./mmseqs2_manual.md).
|
||||
|
||||
:memo: Please be aware that MMseqs2/ColabFold and AlphaFold/HHBlits methods give different MSAs. Therefore, the resulting models may be also different. However, the models from these two pipelines usually have a comparable accuracy.
|
||||
|
||||
### Expected output
|
||||
```create_individual_features.py``` will compute necessary features each protein in [```./example_data/example_1_sequences.fasta```](./example_data/example_1_sequences.fasta) and store them in the ```output_dir```. Please be aware that everything after ```>``` will be
|
||||
taken as the description of the protein and **please be aware** that any special symbol, such as ```| : ; #```, after ```>``` will be replaced with ```_```.
|
||||
|
||||
The name of the pickles will be the same as the descriptions of the sequences in fasta files (e.g. ">protein_A" in the fasta file will yield "protein_A.pkl")
|
||||
|
||||
### Running on a computer cluster in parallel
|
||||
|
||||
On a compute cluster, you may want to run all jobs in parallel as a [job array](https://slurm.schedmd.com/job_array.html). For example, on SLURM queuing system at EMBL we could use the following ```create_individual_features_SLURM.sh``` script:
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
|
||||
#A typical run takes couple of hours but may be much longer
|
||||
#SBATCH --job-name=array
|
||||
#SBATCH --time=10:00:00
|
||||
|
||||
#log files:
|
||||
#SBATCH -e logs/create_individual_features_%A_%a_err.txt
|
||||
#SBATCH -o logs/create_individual_features_%A_%a_out.txt
|
||||
|
||||
#qos sets priority
|
||||
#SBATCH --qos=low
|
||||
|
||||
#Limit the run to a single node
|
||||
#SBATCH -N 1
|
||||
|
||||
#Adjust this depending on the node
|
||||
#SBATCH --ntasks=8
|
||||
#SBATCH --mem=64000
|
||||
|
||||
module load HMMER/3.3.2-gompic-2020b
|
||||
module load HH-suite/3.3.0-gompic-2020b
|
||||
module load Anaconda3
|
||||
source activate AlphaPulldown
|
||||
|
||||
create_individual_features.py \
|
||||
--fasta_paths=baits.fasta,example_1_sequences_shorter.fasta \
|
||||
--data_dir=/scratch/AlphaFold_DBs/2.2.2/ \
|
||||
--save_msa_files=False \
|
||||
--output_dir=/scratch/user/output/features \
|
||||
--use_precomputed_msas=False \
|
||||
--max_template_date=2050-01-01 \
|
||||
--skip_existing=True \
|
||||
--seq_index=$SLURM_ARRAY_TASK_ID
|
||||
```
|
||||
and then run using:
|
||||
|
||||
```
|
||||
mkdir logs
|
||||
#Count the number of jobs corresponding to the number of sequences:
|
||||
baits=`grep ">" baits.fasta | wc -l`
|
||||
candidates=`grep ">" example_1_sequences_shorter.fasta | wc -l`
|
||||
count=$(( $baits + $candidates ))
|
||||
#Run the job array, 100 jobs at a time:
|
||||
sbatch --array=1-$count%100 create_individual_features_SLURM.sh
|
||||
```
|
||||
|
||||
------------------------
|
||||
|
||||
## Explanation about the parameters
|
||||
#### **```save_msa_files```**
|
||||
By default is **False** to save storage stage but can be changed into **True**. If it is set to ```True```, the programme will
|
||||
create individual folder for each protein. The output directory will look like:
|
||||
```
|
||||
output_dir
|
||||
|- protein_A.pkl
|
||||
|- protein_A
|
||||
|- uniref90_hits.sto
|
||||
|- pdb_hits.sto
|
||||
|- etc.
|
||||
|- protein_B.pkl
|
||||
|- protein_B
|
||||
|- uniref90_hits.sto
|
||||
|- pdb_hits.sto
|
||||
|- etc.
|
||||
```
|
||||
|
||||
|
||||
If ```save_msa_files=False``` then the ```output_dir``` will look like:
|
||||
```
|
||||
output_dir
|
||||
|- protein_A.pkl
|
||||
|- protein_B.pkl
|
||||
```
|
||||
|
||||
--------------------
|
||||
|
||||
|
||||
#### **```use_precomputed_msas```**
|
||||
Default value is ```False```. However, if you have already had msa files for your proteins, please set the parameter to be True and arrange your msa files in the format as below:
|
||||
```
|
||||
example_directory
|
||||
|- protein_A
|
||||
|- uniref90_hits.sto
|
||||
|- pdb_hits.sto
|
||||
|-***.a3m
|
||||
|- etc
|
||||
|- protein_B
|
||||
|- ***.sto
|
||||
|- etc
|
||||
```
|
||||
Then, in the command line, set the ```output_dir=/path/to/example_directory```
|
||||
|
||||
#### **```skip_existing```**
|
||||
Default is ```False``` but if you have run the 1st step already for some proteins and now add new proteins to the list, you can change ```skip_existing``` to ```True``` in the
|
||||
command line to avoid rerunning the same procedure for the previously calculated proteins.
|
||||
|
||||
#### **```seq_index```**
|
||||
Default is `None` and the programme will run predictions one by one in the given files. However, you can set ```seq_index``` to
|
||||
different number if you wish to run an array of jobs in parallel then the programme will only run the corresponding job specified by the ```seq_index```. e.g. the programme only calculate features for the 1st protein in your fasta file if ```seq_index``` is set to be 1. See also the Slurm sbatch script above for example how to use it for parallel execution.
|
||||
|
||||
:exclamation: ```seq_index``` starts from 1.
|
||||
|
||||
---------------------
|
||||
|
||||
## 2nd step: Predict structures (run on GPU)
|
||||
|
||||
#### **Run in pulldown mode**
|
||||
Inspired by pull-down assays, one can specify one or more proteins as "bait" and another list of proteins as "candidates". Then the programme will use AlphafoldMultimerV2 to predict interactions between baits (as in [example_data/baits.txt](./example_data/baits.txt)) and candidates (as in [example_data/candidates.txt](./example_data/candidates.txt)).
|
||||
|
||||
**Note** If you want to save time and run fewer jobs, you can use [example_data/candidates_shorter.txt](./example_data/candidates_shorter.txt) instead of [example_data/candidates.txt](./example_data/candidates.txt)
|
||||
|
||||
In this example, we selected pulldown mode and made eIF4G2(Uniprot:[P78344](https://www.uniprot.org/uniprot/P78344)) as a bait while the other 294 proteins as candidates. Thus, in total, there will be 1 * 294 = 294 predictions.
|
||||
|
||||

|
||||
|
||||
|
||||
The command line interface for using pulldown mode will then become:
|
||||
```
|
||||
run_multimer_jobs.py --mode=pulldown \
|
||||
--num_cycle=3 \
|
||||
--num_predictions_per_model=1 \
|
||||
--output_path=<output directory> \
|
||||
--data_dir=<path to alphafold databases> \
|
||||
--protein_lists=baits.txt,candidates.txt \
|
||||
--monomer_objects_dir=/path/to/monomer_objects_directory \
|
||||
--job_index=<any number you want>
|
||||
```
|
||||
|
||||
:memo: To reproduce the results of Lassa virus Z protein vs L protein fragments written in our paper, simply use [baits_Z_protein.txt](./example_data/baits_Z_protein.txt) and [L_protein_fragments.txt](./example_data/L_protein_fragments.txt) as the ```--protein_lists```inputs. This example shows also how to run the interaction screen for fragments of proteins, keeping the original full-length residue numbering in the output!
|
||||
|
||||
✨ **New Features** Now AlphaPulldown supports integrative structural modelling if the user has experimental cross-link data. Please refer to [this manual](run_with_AlphaLink2.md) if you'd like to model your protein complexes with cross-link MS data as extra input.
|
||||
|
||||
## Explanation about the parameters
|
||||
|
||||
#### **```monomer_objects_dir```**
|
||||
It should be the same directory as ```output_dir``` specified in **Step 1**. It can be one directory or contain multiple directories if you stored pre-calculated objects in different locations. In the case of
|
||||
multiple ```monomer_objects_dir```, remember to put a `,` between each e.g. ``` --monomer_objects_dir=<dir_1>,<dir_2>```
|
||||
|
||||
#### **```job_index```**
|
||||
Default is `None` and the programme will run predictions one by one in the given files. However, you can set ```job_index``` to
|
||||
different number if you wish to run an array of jobs in parallel then the programme will only run the corresponding job specified by the ```job_index```
|
||||
|
||||
:exclamation: ```job_index``` starts from 1
|
||||
|
||||
### Running on a computer cluster in parallel
|
||||
|
||||
On a compute cluster, you may want to run all jobs in parallel as a [job array](https://slurm.schedmd.com/job_array.html). For example, on SLURM queuing system at EMBL we could use the following ```run_multimer_jobs_SLURM.sh``` sbatch script:
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
|
||||
#A typical run takes couple of hours but may be much longer
|
||||
#SBATCH --job-name=array
|
||||
#SBATCH --time=2-00:00:00
|
||||
|
||||
#log files:
|
||||
#SBATCH -e logs/run_multimer_jobs_%A_%a_err.txt
|
||||
#SBATCH -o logs/run_multimer_jobs_%A_%a_out.txt
|
||||
|
||||
#qos sets priority
|
||||
#SBATCH --qos=low
|
||||
|
||||
#SBATCH -p gpu
|
||||
#lower end GPUs might be sufficient for pairwise screens:
|
||||
#SBATCH -C "gpu=2080Ti|gpu=3090"
|
||||
|
||||
#Reserve the entire GPU so no-one else slows you down
|
||||
#SBATCH --gres=gpu:1
|
||||
|
||||
#Limit the run to a single node
|
||||
#SBATCH -N 1
|
||||
|
||||
#Adjust this depending on the node
|
||||
#SBATCH --ntasks=8
|
||||
#SBATCH --mem=64000
|
||||
|
||||
module load Anaconda3
|
||||
module load CUDA/11.3.1
|
||||
module load cuDNN/8.2.1.32-CUDA-11.3.1
|
||||
source activate AlphaPulldown
|
||||
|
||||
MAXRAM=$(echo `ulimit -m` '/ 1024.0'|bc)
|
||||
GPUMEM=`nvidia-smi --query-gpu=memory.total --format=csv,noheader,nounits|tail -1`
|
||||
export XLA_PYTHON_CLIENT_MEM_FRACTION=`echo "scale=3;$MAXRAM / $GPUMEM"|bc`
|
||||
export TF_FORCE_UNIFIED_MEMORY='1'
|
||||
|
||||
run_multimer_jobs.py --mode=pulldown \
|
||||
--num_cycle=3 \
|
||||
--num_predictions_per_model=1 \
|
||||
--output_path=/scratch/user/output/models \
|
||||
--data_dir=/scratch/AlphaFold_DBs/2.2.2/ \
|
||||
--protein_lists=baits.txt,candidates_shorter.txt \
|
||||
--monomer_objects_dir=/scratch/user/output/features \
|
||||
--job_index=$SLURM_ARRAY_TASK_ID
|
||||
```
|
||||
and then run using:
|
||||
|
||||
```
|
||||
mkdir -p logs
|
||||
#Count the number of jobs corresponding to the number of sequences:
|
||||
baits=`grep -c "" baits.txt` #count lines even if the last one has no end of line
|
||||
candidates=`grep -c "" candidates_shorter.txt` #count lines even if the last one has no end of line
|
||||
count=$(( $baits * $candidates ))
|
||||
sbatch --array=1-$count example_data/run_multimer_jobs_SLURM.sh
|
||||
```
|
||||
|
||||
--------------------
|
||||
|
||||
|
||||
|
||||
## 3rd step: Evalutaion and visualisation
|
||||
|
||||
**Feature 1**
|
||||
|
||||
When a batch of jobs is finished, AlphaPulldown can create a [Jupyter](https://jupyter.org/) notebook that presents a neat overview of the models, as seen in the example screenshot 
|
||||
|
||||
On the left side, there is a bookmark listing all the jobs and when clicking a bookmark, and executing the corresponding cells, the notebook will show: 1) PAE plots 2) predicted model coloured by pLDDT scores 3) predicted models coloured by chains.
|
||||
|
||||
In order to create the notebook, within the same conda environment, run:
|
||||
```bash
|
||||
source activate AlphaPulldown
|
||||
cd <models_output_dir>
|
||||
create_notebook.py --cutoff=5.0 --output_dir=<models_output_dir>
|
||||
```
|
||||
:warning: The command must be run within the ```<output_dir>```!
|
||||
|
||||
This command will yield an ```output.ipynb```, which you can open it via Jupyterlab. Jupyterlab is already installed when installing AlphaPulldown with pip. Thus, to view the notebook:
|
||||
|
||||
```bash
|
||||
source activate AlphaPulldown
|
||||
cd <models_output_dir>
|
||||
jupyter-lab output.ipynb
|
||||
```
|
||||
:memo: *If you run AlphaPulldown on a remote computer cluster, you will need a graphical connection to open the notebook in a browser, mount the remote directory to your local computer as a network directory, or copy the entire ```<models_output_dir>``` to the local computer.*
|
||||
|
||||
**About the parameters**
|
||||
|
||||
```cutoff``` is to check the value of PAE between chains. In the case of multimers, the analysis programme will create the notebook only from models with inter-chain PAE values smaller than the cutoff.
|
||||
|
||||
**Feature 2**
|
||||
|
||||
We have also provided a singularity image called ```alpha-analysis.sif```to generate a CSV table with structural properties and scores.
|
||||
Firstly, download the singularity image from [here](https://www.embl-hamburg.de/AlphaPulldown/downloads/alpha-analysis.sif). Chrome user may not be able to download it after clicking the link. If so, please right click and select "Save link as".
|
||||
|
||||
|
||||
Then execute the singularity image (i.e. the sif file) by:
|
||||
```
|
||||
singularity exec \
|
||||
--no-home \
|
||||
--bind /path/to/your/output/dir:/mnt \
|
||||
<path to your downloaded image>/alpha-analysis.sif \
|
||||
run_get_good_pae.sh \
|
||||
--output_dir=/mnt \
|
||||
--cutoff=10
|
||||
```
|
||||
|
||||
**About the outputs**
|
||||
By default, you will have a csv file named ```predictions_with_good_interpae.csv``` created in the directory ```/path/to/your/output/dir``` as you have given in the command above. ```predictions_with_good_interpae.csv``` reports: 1. iptm, iptm+ptm scores provided by AlphaFold 2. mpDockQ score developed by[ Bryant _et al._, 2022](https://gitlab.com/patrickbryant1/molpc) 3. PI_score developed by [Malhotra _et al._, 2021](https://gitlab.com/sm2185/ppi_scoring/-/wikis/home). The detailed explainations on these scores can be found in our paper and an example screenshot of the table is below. 
|
||||
|
||||
------------------------------------------------------------
|
||||
## Appendix: Instructions on running in all_vs_all mode
|
||||
As the name suggest, all_vs_all means predict all possible pairwise comparisons within a single input file. The input can be either full-length proteins or regions of a protein, as illustrated in the [example_all_vs_all_list.txt](./example_data/example_all_vs_all_list.txt) and the figure below:
|
||||

|
||||
|
||||
The corresponding command is:
|
||||
```bash
|
||||
run_multimer_jobs.py \
|
||||
--mode=all_vs_all \
|
||||
--num_cycle=3 \
|
||||
--num_predictions_per_model=1 \
|
||||
--output_path=<path to output directory> \
|
||||
--data_dir=<path to AlphaFold data directory> \
|
||||
--protein_lists=example_all_vs_all_list.txt \
|
||||
--monomer_objects_dir=/path/to/monomer_objects_directory \
|
||||
--job_index=<any number you want>
|
||||
```
|
||||
283
manuals/example_2.md
Normal file
@@ -0,0 +1,283 @@
|
||||
# AlphaPulldown manual:
|
||||
# Example2
|
||||
# Aims: Model interactions between Lassa virus L protein and Z matrix protein; Determine the oligomer state of _E.coli_ Single-stranded DNA-binding protein (SSB)
|
||||
## 1st step: compute multiple sequence alignment (MSA) and template features (run on CPUs)
|
||||
|
||||
Firstly, download sequences of L(Uniprot: [O09705](https://www.uniprot.org/uniprotkb/O09705/entry)) and Z(uniprot:[O73557](https://www.uniprot.org/uniprotkb/O73557/entry)) proteins. The result is [```example_data/example_2_sequences.fasta```](./example_data/example_2_sequences.fasta)
|
||||
|
||||
Now run:
|
||||
```bash
|
||||
create_individual_features.py \
|
||||
--fasta_paths=example_2_sequences.fasta \
|
||||
--data_dir=<path to alphafold databases> \
|
||||
--save_msa_files=False \
|
||||
--output_dir=<dir to save the output objects> \
|
||||
--use_precomputed_msas=False \
|
||||
--max_template_date=<any date you want> \
|
||||
--skip_existing=False --seq_index=<any number you want>
|
||||
```
|
||||
|
||||
```create_individual_features.py``` will compute necessary features for O73557 and O09705 then store them as individual pickle files in the ```output_dir```. Please be aware that in the fasta files, everything after ```>``` will be
|
||||
taken as the description of the protein and **please be aware** that any special symbol, such as ```| : ; #```, after ```>``` will be replaced with ```_```.
|
||||
The name of the pickles will be the same as the descriptions of the sequences in fasta files (e.g. ">protein_A" in the fasta file will yield "protein_A.pkl")
|
||||
|
||||
------------------------
|
||||
|
||||
## 1.1 Explanation about the parameters
|
||||
|
||||
See [Example 1](https://github.com/KosinskiLab/AlphaPulldown/blob/main/example_1.md#11-explanation-about-the-parameters)
|
||||
|
||||
## 2nd step: Predict structures (run on GPU)
|
||||
|
||||
#### **Task 1**
|
||||
We want to predict the structure of full-length L protein together with Z protein. However, as the L protein is very long, many users would not have a GPU card with sufficient memory. Moreover, when attempting modeling the full L-Z, the resulting model does not match the known cryo-EM structure. In [Example 1](https://github.com/KosinskiLab/AlphaPulldown/blob/main/example_1.md), we showed how to use AlphaPulldown to find the interaction site by screening fragments using the ```pullldown``` mode. Here, to demonstrate the ```custom``` mode, we will assume the we know the interaction site and model the fragment using this mode, as demonstrated in the figure below :
|
||||
|
||||
|
||||
Different proteins are seperated by ```;```. If a particular region is wanted from one protein, simply add ```,``` after that protein and followed by the region. Region comes in the format of ```number1-number2```. An example input file is: [```example_data/cutom_mode.txt```](./example_data/custom_mode.txt)
|
||||
|
||||
The command line interface for using custom mode will then become:
|
||||
|
||||
```
|
||||
run_multimer_jobs.py \
|
||||
--mode=custom \
|
||||
--num_cycle=3 \
|
||||
--num_predictions_per_model=1 \
|
||||
--output_path=<path to output directory> \
|
||||
--data_dir=<path to AlphaFold data directory> \
|
||||
--protein_lists=custom_mode.txt \
|
||||
--monomer_objects_dir=/path/to/monomer_objects_directory \
|
||||
--job_index=<any number you want>
|
||||
```
|
||||
|
||||
### (**Optionally**) Running with mmseqs2 against colabfold databases instead.
|
||||
Some of the users may be more familiar with mmseqs2 and colabfold databases. Sometimes, using remote mmseqs2 server can finish MSA calculation even faster than the above method. If you are interested in running mmseqs2, please refer to [this manual](./mmseqs2_manual.md)
|
||||
:memo: Please be aware that mmseqs2 and colabfold databases give different MSA from alphafold's database and HHBlits. Therefore, the predicted models are not always the same when using the these two different ways of generating MSA alignments.
|
||||
|
||||
### Running on a computer cluster in parallel
|
||||
|
||||
On a compute cluster, you may want to run all jobs in parallel as a [job array](https://slurm.schedmd.com/job_array.html). For example, on SLURM queuing system at EMBL we could use the following ```run_multimer_jobs_SLURM.sh``` sbatch script:
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
|
||||
#A typical run takes couple of hours but may be much longer
|
||||
#SBATCH --job-name=array
|
||||
#SBATCH --time=2-00:00:00
|
||||
|
||||
#log files:
|
||||
#SBATCH -e logs/run_multimer_jobs_%A_%a_err.txt
|
||||
#SBATCH -o logs/run_multimer_jobs_%A_%a_out.txt
|
||||
|
||||
#qos sets priority
|
||||
#SBATCH --qos=low
|
||||
|
||||
#SBATCH -p gpu-el8
|
||||
#You might want to use a higher-end card in case higher oligomeric state get big:
|
||||
#SBATCH -C "gpu=A40|gpu=A100"
|
||||
|
||||
#Reserve the entire GPU so no-one else slows you down
|
||||
#SBATCH --gres=gpu:1
|
||||
|
||||
#Limit the run to a single node
|
||||
#SBATCH -N 1
|
||||
|
||||
#Adjust this depending on the node
|
||||
#SBATCH --ntasks=8
|
||||
#SBATCH --mem=128000
|
||||
|
||||
module load Anaconda3
|
||||
module load CUDA/11.3.1
|
||||
module load cuDNN/8.2.1.32-CUDA-11.3.1
|
||||
source activate AlphaPulldown
|
||||
|
||||
MAXRAM=$(echo `ulimit -m` '/ 1024.0'|bc)
|
||||
GPUMEM=`nvidia-smi --query-gpu=memory.total --format=csv,noheader,nounits|tail -1`
|
||||
export XLA_PYTHON_CLIENT_MEM_FRACTION=`echo "scale=3;$MAXRAM / $GPUMEM"|bc`
|
||||
export TF_FORCE_UNIFIED_MEMORY='1'
|
||||
|
||||
run_multimer_jobs.py \
|
||||
--mode=custom \
|
||||
--num_cycle=3 \
|
||||
--num_predictions_per_model=1 \
|
||||
--output_path=<path to output directory> \
|
||||
--data_dir=<path to AlphaFold data directory> \
|
||||
--protein_lists=custom_mode.txt \
|
||||
--monomer_objects_dir=/path/to/monomer_objects_directory \
|
||||
--job_index=$SLURM_ARRAY_TASK_ID
|
||||
```
|
||||
and then run using:
|
||||
|
||||
```
|
||||
mkdir -p logs
|
||||
count=`grep -c "" custom_mode.txt` #count lines even if the last one has no end of line
|
||||
sbatch --array=1-$count run_multimer_jobs_SLURM.sh
|
||||
```
|
||||
|
||||
#### **Task 2**
|
||||
This taks is to determine the oligomer state of SSB protein [(Uniprot:P0AGE0)](https://www.uniprot.org/uniprotkb/P0AGE0/entry#function) by modelling its monomeric, homodimeric, homotrimeric, and homoquatrameric structures. Thus, homo-oligomer mode is needed. An oligomer state file will tell the programme the number of units. An example is: [```example_data/example_oligomer_state_file.txt```](./example_data/example_oligomer_state_file.txt)
|
||||
|
||||
In the file, oligomeric states of the corresponding proteins should be separated by ```,``` e.g. ```protein_A,3```means a homotrimer for protein_A
|
||||

|
||||
|
||||
Instead of homo-oligomers, this mode can also be used to predict monomeric structure by simply adding ```1``` or nothing after the protein.
|
||||
The command for homo-oligomer mode is:
|
||||
|
||||
```
|
||||
run_multimer_jobs.py \
|
||||
--mode=homo-oligomer \
|
||||
--output_path=<path to output directory> \
|
||||
--num_cycle=3 \
|
||||
--oligomer_state_file=example_oligomer_state_file.txt \
|
||||
--monomer_objects_dir=<directory that stores monomer pickle files> \
|
||||
--data_dir=/path-to-Alphafold-data-dir \
|
||||
--job_index=<any number you want>
|
||||
```
|
||||
|
||||
Having screened the oligomeric states of SSB protein, we found our tetramer model agrees with the experimental structure (PDB:4MZ9).
|
||||
|
||||
----------------------------------
|
||||
|
||||
## Explanation about the parameters
|
||||
|
||||
#### **```monomer_objects_dir```**
|
||||
It should be the same directory as ```output_dir``` specified in **Step 1**. It can be one directory or contain multiple directories if you stored pre-calculated objects in different locations. In the case of
|
||||
multiple ```monomer_objects_dir```, remember to put a `,` between each e.g. ``` --monomer_objects_dir=<dir_1>,<dir_2>```
|
||||
|
||||
#### **```job_index```**
|
||||
Default is `None` and the programme will run predictions one by one in the given files. However, you can set ```job_index``` to
|
||||
different number if you wish to run an array of jobs in parallel then the programme will only run the corresponding job specified by the ```job_index```
|
||||
|
||||
:exclamation: ```job_index``` starts from 1
|
||||
|
||||
|
||||
✨ **New Features** Now AlphaPulldown supports integrative structural modelling if the user has experimental cross-link data. Please refer to [this manual](run_with_AlphaLink2.md) if you'd like to model your protein complexes with cross-link MS data as extra input.
|
||||
|
||||
### Running on a computer cluster in parallel
|
||||
|
||||
On a compute cluster, you may want to run all jobs in parallel as a [job array](https://slurm.schedmd.com/job_array.html). For example, on SLURM queuing system at EMBL we could use the following ```run_multimer_jobs_SLURM.sh``` sbatch script:
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
|
||||
#A typical run takes couple of hours but may be much longer
|
||||
#SBATCH --job-name=array
|
||||
#SBATCH --time=2-00:00:00
|
||||
|
||||
#log files:
|
||||
#SBATCH -e logs/run_multimer_jobs_%A_%a_err.txt
|
||||
#SBATCH -o logs/run_multimer_jobs_%A_%a_out.txt
|
||||
|
||||
#qos sets priority
|
||||
#SBATCH --qos=low
|
||||
|
||||
#SBATCH -p gpu-el8
|
||||
#You might want to use a higher-end card in case higher oligomeric state get big:
|
||||
#SBATCH -C "gpu=A40|gpu=A100"
|
||||
|
||||
#Reserve the entire GPU so no-one else slows you down
|
||||
#SBATCH --gres=gpu:1
|
||||
|
||||
#Limit the run to a single node
|
||||
#SBATCH -N 1
|
||||
|
||||
#Adjust this depending on the node
|
||||
#SBATCH --ntasks=8
|
||||
#SBATCH --mem=128000
|
||||
|
||||
module load Anaconda3
|
||||
module load CUDA/11.3.1
|
||||
module load cuDNN/8.2.1.32-CUDA-11.3.1
|
||||
source activate AlphaPulldown
|
||||
|
||||
MAXRAM=$(echo `ulimit -m` '/ 1024.0'|bc)
|
||||
GPUMEM=`nvidia-smi --query-gpu=memory.total --format=csv,noheader,nounits|tail -1`
|
||||
export XLA_PYTHON_CLIENT_MEM_FRACTION=`echo "scale=3;$MAXRAM / $GPUMEM"|bc`
|
||||
export TF_FORCE_UNIFIED_MEMORY='1'
|
||||
|
||||
run_multimer_jobs.py \
|
||||
--mode=homo-oligomer \
|
||||
--output_path=<path to output directory> \
|
||||
--num_cycle=3 \
|
||||
--oligomer_state_file=example_oligomer_state_file.txt \
|
||||
--monomer_objects_dir=<directory that stores monomer pickle files> \
|
||||
--data_dir=/path-to-Alphafold-data-dir \
|
||||
--job_index=$SLURM_ARRAY_TASK_ID
|
||||
```
|
||||
and then run using:
|
||||
|
||||
```
|
||||
mkdir -p logs
|
||||
count=`grep -c "" example_oligomer_state_file.txt` #count lines even if the last one has no end of line
|
||||
sbatch --array=1-$count run_multimer_jobs_SLURM.sh
|
||||
```
|
||||
|
||||
--------------------
|
||||
|
||||
|
||||
|
||||
## 3rd step: Evalutaion and visualisation
|
||||
|
||||
**Feature 1**
|
||||
|
||||
When a batch of jobs is finished, AlphaPulldown can create a [Jupyter](https://jupyter.org/) notebook that presents a neat overview of the models, as seen in the example screenshot 
|
||||
|
||||
On the left side, there is a bookmark listing all the jobs and when clicking a bookmark, and executing the corresponding cells, the notebook will show: 1) PAE plots 2) predicted model coloured by pLDDT scores 3) predicted models coloured by chains.
|
||||
|
||||
In order to create the notebook, within the same conda environment, run:
|
||||
```bash
|
||||
source activate AlphaPulldown
|
||||
cd <models_output_dir>
|
||||
create_notebook.py --cutoff=5.0 --output_dir=<models_output_dir>
|
||||
```
|
||||
:warning: The command must be run within the ```<output_dir>```!
|
||||
|
||||
This command will yield an ```output.ipynb```, which you can open it via Jupyterlab. Jupyterlab is already installed when installing AlphaPulldown with pip. Thus, to view the notebook:
|
||||
|
||||
```bash
|
||||
source activate AlphaPulldown
|
||||
cd <models_output_dir>
|
||||
jupyter-lab output.ipynb
|
||||
```
|
||||
:memo: *If you run AlphaPulldown on a remote computer cluster, you will need a graphical connection to open the notebook in a browser, mount the remote directory to your local computer as a network directory, or copy the entire ```<models_output_dir>``` to the local computer.*
|
||||
|
||||
**About the parameters**
|
||||
|
||||
```cutoff``` is to check the value of PAE between chains. In the case of multimers, the analysis programme will create the notebook only from models with inter-chain PAE values smaller than the cutoff.
|
||||
|
||||
**Feature 2**
|
||||
|
||||
We have also provided a singularity image called ```alpha-analysis.sif```to generate a CSV table with structural properties and scores.
|
||||
Firstly, download the singularity image from [here](https://www.embl-hamburg.de/AlphaPulldown/downloads/alpha-analysis.sif). Chrome user may not be able to download it after clicking the link. If so, please right click and select "Save link as".
|
||||
|
||||
|
||||
Then execute the singularity image (i.e. the sif file) by:
|
||||
```
|
||||
singularity exec \
|
||||
--no-home \
|
||||
--bind /path/to/your/output/dir:/mnt \
|
||||
<path to your downloaded image>/alpha-analysis.sif \
|
||||
run_get_good_pae.sh \
|
||||
--output_dir=/mnt \
|
||||
--cutoff=10
|
||||
```
|
||||
|
||||
**About the outputs**
|
||||
By default, you will have a csv file named ```predictions_with_good_interpae.csv``` created in the directory ```/path/to/your/output/dir``` as you have given in the command above. ```predictions_with_good_interpae.csv``` reports: 1. iptm, iptm+ptm scores provided by AlphaFold 2. mpDockQ score developed by[ Bryant _et al._, 2022](https://gitlab.com/patrickbryant1/molpc) 3. PI_score developed by [Malhotra _et al._, 2021](https://gitlab.com/sm2185/ppi_scoring/-/wikis/home). The detailed explainations on these scores can be found in our paper and an example screenshot of the table is below. 
|
||||
|
||||
------------------------------------------------------------
|
||||
## Appendix: Instructions on running in all_vs_all mode
|
||||
As the name suggest, all_vs_all means predict all possible combinations within a single input file. The input can be either full-length proteins or regions of a protein, as illustrated in the [example_all_vs_all_list.txt](./example_data/example_all_vs_all_list.txt) and the figure below:
|
||||

|
||||
|
||||
The corresponding command is:
|
||||
```bash
|
||||
run_multimer_jobs.py \
|
||||
--mode=all_vs_all \
|
||||
--num_cycle=3 \
|
||||
--num_predictions_per_model=1 \
|
||||
--output_path=<path to output directory> \
|
||||
--data_dir=/path-to-Alphafold-data-dir \
|
||||
--protein_lists=example_all_vs_all_list.txt \
|
||||
--monomer_objects_dir=/path/to/monomer_objects_directory \
|
||||
--job_index=<any number you want>
|
||||
```
|
||||
173
manuals/example_3.md
Normal file
@@ -0,0 +1,173 @@
|
||||
# AlphaPulldown manual:
|
||||
# Example3
|
||||
# Aims: Model activation of phosphoinositide 3-kinase by the influenza A virus NS1 protein (PDB: 3L4Q)
|
||||
## 1st step: compute multiple sequence alignment (MSA) and template features using provided pbd templates (run on CPU)
|
||||
|
||||
This complex can not be modeled with vanilla AlphaFold Multimer, since it is a host-pathogen interaction.
|
||||
Firstly, download sequences of NS1(Uniprot: [P03496](https://www.uniprot.org/uniprotkb/P03496/entry)) and P85B(uniprot:[P23726](https://www.uniprot.org/uniprotkb/P23726/entry)) proteins.
|
||||
Then download the multimeric template in either pdb or mmCIF format(PDB: [3L4Q](https://www.rcsb.org/structure/3L4Q)).
|
||||
Create directories named "fastas" and "templates" and put the sequences and pdb/cif files in the corresponding directories.
|
||||
Finally, create a text file with description for generating features (description.csv).
|
||||
|
||||
**Please note**, the first column must be an exact copy of the protein description from your fasta files. Please consider shortening them in fasta files using your favorite text editor for convenience. These names will be used to generate pickle files with monomeric features!
|
||||
The description.csv for the NS1-P85B complex should look like:
|
||||
```
|
||||
>sp|P03496|NS1_I34A1, 3L4Q.cif, A
|
||||
>sp|P23726|P85B_BOVIN, 3L4Q.cif, C
|
||||
```
|
||||
In this example we refer to the NS1 protein as chain A and to the P85B protein as chain C in multimeric template 3L4Q.cif.
|
||||
|
||||
**Please note**, that your template will be renamed to a PDB code taken from *_entry_id*. If you use a *.pdb file instead of *.cif, AlphaPulldown will first try to parse the PDB code from the file. Then it will check if the filename is 4-letter long. If it is not, it will generate a random 4-letter code and use it as the PDB code.
|
||||
|
||||
Now run:
|
||||
```bash
|
||||
create_individual_features_with_templates.py \
|
||||
--description_file=description.csv \
|
||||
--fasta_paths=fastas/P03496.fasta,fastas/P23726.fasta \
|
||||
--path_to_mmt=templates/ \
|
||||
--data_dir=/scratch/AlphaFold_DBs/2.3.2/ \
|
||||
--save_msa_files=True \
|
||||
--output_dir=features\
|
||||
--use_precomputed_msas=True \
|
||||
--max_template_date=2050-01-01 \
|
||||
--skip_existing=True
|
||||
```
|
||||
It is also possible to combine all your fasta files into a single fasta file.
|
||||
```create_individual_features_with_templates.py``` will compute the features similarly to the create_individual_features.py, but will utilize the provided templates instead of the PDB database.
|
||||
|
||||
------------------------
|
||||
|
||||
## 1.1 Explanation about the parameters
|
||||
|
||||
See [Example 1](https://github.com/KosinskiLab/AlphaPulldown/blob/main/example_1.md#11-explanation-about-the-parameters)
|
||||
|
||||
## 2nd step: Predict structures (run on GPU)
|
||||
|
||||
#### **Task 1**
|
||||
To predict structure we can use the usual ```run_multimer_jobs.py``` in custom mode (See [Example 2](https://github.com/KosinskiLab/AlphaPulldown/blob/main/example_2.md#2nd-step-predict-structures-run-on-gpu)) with an extra ```--multimeric_mode=True``` flag, that deactivates per-chain multimeric binary mask.
|
||||
The user can also specify the depth of the MSA that is taken for modelling to increase the influence of the template on the predicted model. This can be done by using the flag ```--msa_depth```. Please note, that only the first 2 AlphaFold models are guided by the templates. To specify the model name you want to apply use the following flag: ```--model_names=model_1_multimer_v3,model_2_multimer_v3``` (for models 1 and 2).
|
||||
If you do not know the exact MSA depth, there is another flag ```--gradient_msa_depth=True``` for exploring the desired MSA depth. This flag generates a set of logarithmically distributed points (denser at lower end) with the number of points equal to the number of predictions. The MSA depth (```num_msa```) starts from 16 and ends with the maximum value taken from the model config file. The ```extra_num_msa``` is always calculated as ```4*num_msa```.
|
||||
The command line interface for using custom mode will then become:
|
||||
|
||||
```
|
||||
run_multimer_jobs.py \
|
||||
--mode=custom \
|
||||
--num_cycle=3 \
|
||||
--num_predictions_per_model=<any number you want> \
|
||||
--output_path=<path to output directory> \
|
||||
--data_dir=<path to AlphaFold data directory> \
|
||||
--protein_lists=custom_mode.txt \
|
||||
--monomer_objects_dir=<path to features generated by create_individual_features_with_templates.py> \
|
||||
--multimeric_mode=True \
|
||||
--msa_depth=<any number you want> \
|
||||
--gradient_msa_depth=<True or False, overwrites msa_depth if provided> \
|
||||
--model_names=<coma separated names of the models> \
|
||||
--job_index=<corresponds to the string number from custom_mode.txt, don't provide for sequential execution>
|
||||
```
|
||||
|
||||
|
||||
### Running on a computer cluster in parallel
|
||||
|
||||
On a compute cluster, you may want to run all jobs in parallel as a [job array](https://slurm.schedmd.com/job_array.html). For example, on SLURM queuing system at EMBL we could use the following ```create_feature_jobs_SLURM.sh``` sbatch script:
|
||||
```bash
|
||||
#!/bin/bash
|
||||
|
||||
#A typical run takes couple of hours but may be much longer
|
||||
#SBATCH --job-name=array
|
||||
#SBATCH --time=5:00:00
|
||||
|
||||
#log files:
|
||||
#SBATCH -e logs/create_individual_features_%A_%a_err.txt
|
||||
#SBATCH -o logs/create_individual_features_%A_%a_out.txt
|
||||
|
||||
#qos sets priority
|
||||
#SBATCH --qos=normal
|
||||
|
||||
#SBATCH -p htc-el8
|
||||
#Limit the run to a single node
|
||||
#SBATCH -N 1
|
||||
|
||||
#Adjust this depending on the node
|
||||
#SBATCH --ntasks=8
|
||||
#SBATCH --mem=32000
|
||||
|
||||
module load HMMER/3.3.2-gompic-2020b
|
||||
module load HH-suite/3.3.0-gompic-2020b
|
||||
module load Anaconda3
|
||||
source activate AlphaPulldown
|
||||
|
||||
create_individual_features_with_templates.py \
|
||||
--description_file=description.csv \
|
||||
--fasta_paths=fastas/P03496.fasta,fastas/P23726.fasta \
|
||||
--path_to_mmt=templates/ \
|
||||
--data_dir=/scratch/AlphaFold_DBs/2.3.2/ \
|
||||
--save_msa_files=True \
|
||||
--output_dir=features \
|
||||
--use_precomputed_msas=True \
|
||||
--max_template_date=2050-01-01 \
|
||||
--skip_existing=True \
|
||||
--job_index=$SLURM_ARRAY_TASK_ID
|
||||
```
|
||||
|
||||
and the following ```run_multimer_jobs_SLURM.sh``` sbatch script:
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
|
||||
#A typical run takes couple of hours but may be much longer
|
||||
#SBATCH --job-name=array
|
||||
#SBATCH --time=2-00:00:00
|
||||
|
||||
#log files:
|
||||
#SBATCH -e logs/run_multimer_jobs_%A_%a_err.txt
|
||||
#SBATCH -o logs/run_multimer_jobs_%A_%a_out.txt
|
||||
|
||||
#qos sets priority
|
||||
#SBATCH --qos=normal
|
||||
|
||||
#SBATCH -p gpu-el8
|
||||
|
||||
#Reserve the entire GPU so no-one else slows you down
|
||||
#SBATCH --gres=gpu:1
|
||||
|
||||
#Limit the run to a single node
|
||||
#SBATCH -N 1
|
||||
|
||||
#Adjust this depending on the node
|
||||
#SBATCH --ntasks=8
|
||||
#SBATCH --mem=64000
|
||||
|
||||
module load Anaconda3
|
||||
module load CUDA/11.3.1
|
||||
module load cuDNN/8.2.1.32-CUDA-11.3.1
|
||||
source activate AlphaPulldown
|
||||
|
||||
MAXRAM=$(echo `ulimit -m` '/ 1024.0'|bc)
|
||||
GPUMEM=`nvidia-smi --query-gpu=memory.total --format=csv,noheader,nounits|tail -1`
|
||||
export XLA_PYTHON_CLIENT_MEM_FRACTION=`echo "scale=3;$MAXRAM / $GPUMEM"|bc`
|
||||
export TF_FORCE_UNIFIED_MEMORY='1'
|
||||
|
||||
run_multimer_jobs.py \
|
||||
--mode=custom \
|
||||
--num_cycle=3 \
|
||||
--num_predictions_per_model=5 \
|
||||
--output_path=<path to output directory> \
|
||||
--data_dir=<path to AlphaFold data directory> \
|
||||
--protein_lists=custom_mode.txt \
|
||||
--monomer_objects_dir=/path/to/monomer_objects_directory \
|
||||
--multimeric_mode=True \
|
||||
--msa_depth=128 \
|
||||
--model_names=model_1_multimer_v3,model_2_multimer_v3 \
|
||||
--gradient_msa_depth=False \
|
||||
--job_index=$SLURM_ARRAY_TASK_ID
|
||||
```
|
||||
and then run using:
|
||||
|
||||
```
|
||||
mkdir -p logs
|
||||
count=`grep -c "" description.csv` #count lines even if the last one has no end of line
|
||||
sbatch --array=1-$count create_feature_jobs_SLURM.sh
|
||||
count=`grep -c "" custom_mode.txt` #likewise for predictions
|
||||
sbatch --array=1-$count run_multimer_jobs_SLURM.sh
|
||||
```
|
||||
After the successful run one can evaluate and visualise the results in a usual manner (see e.g. [Example 2](https://github.com/KosinskiLab/AlphaPulldown/blob/main/example_2.md#3rd-step-evalutaion-and-visualisation))
|
||||
BIN
manuals/example_notebook_screenshot.png
Normal file
|
After Width: | Height: | Size: 600 KiB |
BIN
manuals/example_table_screenshot.png
Normal file
|
After Width: | Height: | Size: 334 KiB |
BIN
manuals/homooligomer_demo.png
Normal file
|
After Width: | Height: | Size: 32 KiB |
92
manuals/mmseqs2_manual.md
Normal file
@@ -0,0 +1,92 @@
|
||||
# Use mmseqs2 to calculate MSAs
|
||||
|
||||
# option 1: run mmseqs2 remotely
|
||||
|
||||
For the purpose of this manual, the expected file is already provided here: [```./example_data/example_1_sequences.fasta```](./example_data/example_1_sequences.fasta). If you want to run a smaller test, you can use [```./example_data/example_1_sequences_shorter.fasta```](./example_data/example_1_sequences_shorter.fasta) instead.
|
||||
|
||||
:memo: *The example file was generated by downloading all 294 proteins that belong to human translation pathway from: [Reactome](https://reactome.org/PathwayBrowser/#/R-HSA-72766&DTAB=MT). eIF4G2 sequence was downloaded from (Uniprot:[P78344](https://www.uniprot.org/uniprot/P78344)).*
|
||||
|
||||
Now run:
|
||||
```bash
|
||||
source activate AlphaPulldown
|
||||
create_individual_features.py \
|
||||
--fasta_paths=example_1_sequences.fasta \
|
||||
--data_dir=<path to alphafold databases> \
|
||||
--output_dir=<dir to save the output objects> \
|
||||
--skip_existing=False \
|
||||
--use_mmseqs2=True \
|
||||
--max_template_date=<any date you want, format like: 2050-01-01> \
|
||||
--seq_index=<any number you want or skip the flag to run all one after another>
|
||||
```
|
||||
|
||||
and your output_dir will look like:
|
||||
```bash
|
||||
output_dir
|
||||
|-protein_A.a3m
|
||||
|-protein_A_env/
|
||||
|-protein_A.pkl
|
||||
|-protein_B.a3m
|
||||
|-protein_B_env/
|
||||
|-protein_B.pkl
|
||||
...
|
||||
```
|
||||
|
||||
```create_individual_features.py``` will compute necessary features each protein in [```./example_data/example_1_sequences.fasta```](./example_data/example_1_sequences.fasta) and store them in the ```output_dir```. Please be aware that everything after ```>``` will be
|
||||
taken as the description of the protein and **please be aware** that any special symbol, such as ```| : ; #```, after ```>``` will be replaced with ```_```
|
||||
|
||||
|
||||
# option 2: run mmseqs2 locally.
|
||||
|
||||
AlphaPulldown does **NOT** provide interface or codes that will run mmseqs2 locally. Neither will it install mmseqs or any other programme required. The user has to
|
||||
install mmseqs, colabfold databases, colab_search and other required dependencies and run msa alignments first. An example guide can be found on [Colabfold github](https://github.com/sokrypton/ColabFold).
|
||||
|
||||
Suppose you have run mmseqs locally successfully using ```colab_search``` programme, for each protein of your interest, it will generate an a3m file. Thus, your output_dir
|
||||
should look like this:
|
||||
|
||||
```
|
||||
output_dir
|
||||
|-0.a3m
|
||||
|-1.a3m
|
||||
|-2.a3m
|
||||
|-3.a3m
|
||||
...
|
||||
```
|
||||
These a3m files from```colabfold_search``` are named in such inconvenient way. Thus, we have provided a ```rename_colab_search_a3m.py``` script that will help you rename all these files. Simply run:
|
||||
```bash
|
||||
# within the same conda env where you have installed AlphaPulldown
|
||||
cd output_dir
|
||||
rename_colab_search_a3m.py
|
||||
```
|
||||
Then your ```output_dir``` will become:
|
||||
|
||||
```
|
||||
output_dir
|
||||
|-protein_A.a3m
|
||||
|-protein_B.a3m
|
||||
|-protein_C.a3m
|
||||
|-protein_D.a3m
|
||||
...
|
||||
```
|
||||
where ```protein_A``` ```protein_B``` ... correspond to the names you have in your input fasta file (">protein_A" will give you "protein_A.a3m", "protein_B" -> "protein_B.a3m" etc.).
|
||||
After this, go back to your project directory with the original FASTA file and point to this directory in the command:
|
||||
```bash
|
||||
source activate AlphaPulldown
|
||||
create_individual_features.py \
|
||||
--fasta_paths=example_1_sequences.fasta \
|
||||
--data_dir=<path to alphafold databases> \
|
||||
--output_dir=output_dir \
|
||||
--skip_existing=False \
|
||||
--use_mmseqs2=True \
|
||||
--seq_index=<any number you want or skip the flag to run all one after another>
|
||||
```
|
||||
and AlphaPulldown will automatically search each protein's corresponding a3m files. In the end, your output_dir will look like:
|
||||
```
|
||||
output_dir
|
||||
|-protein_A.a3m
|
||||
|-protein_A.pkl
|
||||
|-protein_B.a3m
|
||||
|-protein_B.pkl
|
||||
|-protein_C.a3m
|
||||
|-protein_C.pkl
|
||||
...
|
||||
```
|
||||
BIN
manuals/pulldown_mode_demo_1.png
Normal file
|
After Width: | Height: | Size: 40 KiB |
65
manuals/run_with_AlphaLink2.md
Normal file
@@ -0,0 +1,65 @@
|
||||
# Instruction of running structural predictions with cross-link data via [AlphaLink2](https://github.com/Rappsilber-Laboratory/AlphaLink2/tree/main)
|
||||
## Introduction
|
||||
As [Stahl et al., 2023](https://www.nature.com/articles/s41587-023-01704-z) showed, integrating cross-link data with AlphaFold could improve the modelling quality in
|
||||
some challenging cases. Thus AlphaPulldown has integrated [AlphaLink2](https://github.com/Rappsilber-Laboratory/AlphaLink2/tree/main) pipeline
|
||||
and allows the user to combine cross-link data with AlphaFold Multimer inference, without the need of calculating MSAs from the scratch again.
|
||||
|
||||
In addition, this integration retains all the other benefits from AlphaPulldown, such as the interface for fragmenting protein into regions; automatically
|
||||
generating PAE plots after the predictions etc.
|
||||
|
||||
## 1st step: configure the Conda environment
|
||||
After you initialise the same conda environment, where you normally run AlphaPulldown, firstly, you need to compile [UniCore](https://github.com/dptech-corp/Uni-Core).
|
||||
|
||||
```bash
|
||||
git clone https://github.com/dptech-corp/Uni-Core.git
|
||||
cd Uni-Core
|
||||
python setup.py install --disable-cuda-ext
|
||||
|
||||
# test whether unicore is successfully installed
|
||||
python -c "import unicore"
|
||||
```
|
||||
You may see the following warning but it's fine:
|
||||
```
|
||||
fused_multi_tensor is not installed corrected
|
||||
fused_rounding is not installed corrected
|
||||
fused_layer_norm is not installed corrected
|
||||
fused_softmax is not installed corrected
|
||||
```
|
||||
Next, make sure you have PyTorch corresponding to the CUDA version installed. For example, [PyTorch 1.13.0+cu117](https://pytorch.org/get-started/previous-versions/)
|
||||
and CUDA/11.7.0
|
||||
## 2nd step: download AlphaLink2 checkpoint
|
||||
Now please download the PyTorch checkpoints from [Zenodo](https://zenodo.org/records/8007238), unzip it, then you should obtain a file named: ```AlphaLink-Multimer_SDA_v3.pt```
|
||||
|
||||
## 3rd step: prepare cross-link input data
|
||||
As instructed by [AlphaLink2](https://github.com/Rappsilber-Laboratory/AlphaLink2/tree/main), information of cross-linked residues
|
||||
between 2 proteins, inter-protein crosslinks A->B 1,50 and 30,80 and an FDR=20%, should look like:
|
||||
```
|
||||
{'protein_A': {'protein_B': [(1, 50, 0.2), (30, 80, 0.2)]}}
|
||||
```
|
||||
and intra-protein crosslinks follow the same format:
|
||||
```
|
||||
{'protein_A': {'protein_A': [(5, 20, 0.2)]}}
|
||||
```
|
||||
The keys in these dictionaries should be the same as your pickle files created by [the first stage of AlphaPulldown](https://github.com/KosinskiLab/AlphaPulldown/blob/main/example_1.md). e.g. you should have ```protein_A.pkl```
|
||||
and ```protein_B.pkl``` already calculated.
|
||||
|
||||
Dictionaries like these should be stored in **```.pkl.gz```** files and provided to AlphaPulldown in the next step. You can use the script from [AlphaLink2](https://github.com/Rappsilber-Laboratory/AlphaLink2/tree/main)
|
||||
to prepare these pickle files.
|
||||
### **NB** The dictionaries are 0-indexed, i.e., residues start from 0.
|
||||
|
||||
## 4th step: run with AlphaLink2 prediction via AlphaPulldown
|
||||
Within the same conda environment, run in e.g. ```custom``` mode:
|
||||
```bash
|
||||
run_multimer_jobs.py --mode=custom \
|
||||
--num_predictions_per_model=1 \
|
||||
--output_path=/scratch/scratch/user/output/models \
|
||||
--data_dir=/g/alphafold/AlphaFold_DBs/2.3.0/ \
|
||||
--protein_lists=custom.txt \
|
||||
--monomer_objects_dir=/scratch/user/output/features \
|
||||
--job_index=$SLURM_ARRAY_TASK_ID --alphalink_weight=/scratch/user/alphalink_weights/AlphaLink-Multimer_SDA_v3.pt \
|
||||
--use_alphalink=True --crosslinks=/path/to/crosslinks.pkl.gz
|
||||
```
|
||||
The other modes provided by AlphaPulldown also work in the same way.
|
||||
|
||||
|
||||
|
||||