mirror of
https://github.com/rdk/p2rank.git
synced 2026-06-04 12:44:24 +08:00
fix typos in documantation
This commit is contained in:
25
README.md
25
README.md
@@ -20,7 +20,8 @@ P2Rank is a stand-alone command line program that predicts ligand-binding pocket
|
||||
* Java 8 to 15
|
||||
* PyMOL 1.7 (or newer) for viewing visualizations (optional)
|
||||
|
||||
Program is tested on Linux, macOS and Windows. On Windows, using `bash` console is recommended to execute the program instead of `cmd` or `PowerShell`.
|
||||
P2Rank is tested on Linux, macOS, and Windows.
|
||||
On Windows, it is recommended to use the `bash` console to execute the program instead of `cmd` or `PowerShell`.
|
||||
|
||||
### Setup
|
||||
|
||||
@@ -44,7 +45,7 @@ P2Rank makes predictions by scoring and clustering points on the protein's solve
|
||||
Ligandability score of individual points is determined by a machine learning based model trained on the dataset of known protein-ligand complexes.
|
||||
For more details see the slides and publications.
|
||||
|
||||
Presentation slides introducing original version of the algotithm: [Slides (pdf)](http://bit.ly/p2rank_slides)
|
||||
Presentation slides introducing the original version of the algorithm: [Slides (pdf)](http://bit.ly/p2rank_slides)
|
||||
|
||||
### Publications
|
||||
|
||||
@@ -54,15 +55,15 @@ If you use P2Rank, please cite relevant papers:
|
||||
Krivak R, Hoksza D. *P2Rank: machine learning based tool for rapid and accurate prediction of ligand binding sites from protein structure.* Journal of Cheminformatics. 2018 Aug.
|
||||
* [Web-server article](https://doi.org/10.1093/nar/gkz424) in NAR about the web interface accessible at [prankweb.cz](https://prankweb.cz)
|
||||
Jendele L, Krivak R, Skoda P, Novotny M, Hoksza D. *PrankWeb: a web server for ligand binding site prediction and visualization.* Nucleic Acids Research, Volume 47, Issue W1, 02 July 2019, Pages W345-W349
|
||||
* [Conference paper](https://doi.org/10.1007/978-3-319-21233-3_4) inroducing P2Rank prediction algorithm
|
||||
Krivak R, Hoksza D. *P2RANK: Knowledge-Based Ligand Binding Site Prediction Using Aggregated Local Features.* InInternational Conference on Algorithms for Computational Biology 2015 Aug 4 (pp. 41-52). Springer
|
||||
* [Conference paper](https://doi.org/10.1007/978-3-319-21233-3_4) introducing P2Rank prediction algorithm
|
||||
Krivak R, Hoksza D. *P2RANK: Knowledge-Based Ligand Binding Site Prediction Using Aggregated Local Features.* International Conference on Algorithms for Computational Biology 2015 Aug 4 (pp. 41-52). Springer
|
||||
* [Research article](https://doi.org/10.1186/s13321-015-0059-5) in JChem about PRANK rescoring algorithm
|
||||
Krivak R, Hoksza D. *Improving protein-ligand binding site prediction accuracy by classification of inner pocket points using local features.* Journal of Cheminformatics. 2015 Dec.
|
||||
|
||||
### Build from sources
|
||||
|
||||
This project uses [Gradle](https://gradle.org/) build system via included Gradle wrapper.
|
||||
On Windows use `bash` to execute build comands (`bash` is installed as a part of [Git for Windows](https://git-scm.com/download/win)).
|
||||
On Windows use `bash` to execute build commands (`bash` is installed as a part of [Git for Windows](https://git-scm.com/download/win)).
|
||||
|
||||
```bash
|
||||
git clone https://github.com/rdk/p2rank.git && cd p2rank
|
||||
@@ -91,7 +92,7 @@ prank help
|
||||
### Predict ligand binding sites (P2Rank algorithm)
|
||||
|
||||
~~~bash
|
||||
prank predict test.ds # run on whole dataset (containing list of pdb files)
|
||||
prank predict test.ds # run on the whole dataset (containing list of pdb files)
|
||||
|
||||
prank predict -f test_data/1fbl.pdb # run on single pdb file
|
||||
prank predict -f test_data/1fbl.pdb.gz # run on single gzipped pdb file
|
||||
@@ -121,7 +122,7 @@ prank eval-predict test.ds
|
||||
|
||||
If coordinates of SAS points that belong to predicted pockets are needed, they can be found
|
||||
in `visualizations/data/<pdb_file_name>_points.pdb`. There "Residue sequence number" (23-26) of HETATM record
|
||||
corresponds to the rank of corresponding pocket (points with value 0 do not belong to any pocket).
|
||||
corresponds to the rank of the corresponding pocket (points with value 0 do not belong to any pocket).
|
||||
|
||||
### Configuration
|
||||
|
||||
@@ -136,7 +137,7 @@ prank predict -c example.groovy test.ds
|
||||
It is also possible to override the default params on the command line using their full name.
|
||||
|
||||
~~~bash
|
||||
prank predict -seed 151 -threads 8 test.ds # set random seed and number of threads, override defeults
|
||||
prank predict -seed 151 -threads 8 test.ds # set random seed and number of threads, override defaults
|
||||
prank predict -c example.groovy -seed 151 -threads 8 test.ds # override defaults as well as values from example.groovy
|
||||
~~~
|
||||
|
||||
@@ -160,25 +161,25 @@ prank eval-rescore fpocket.ds # evaluate rescoring model
|
||||
|
||||
## Comparison with Fpocket
|
||||
|
||||
[Fpocket](https://github.com/Discngine/fpocket) is widely used open source ligand binding site prediction program.
|
||||
[Fpocket](https://github.com/Discngine/fpocket) is a widely used open source ligand binding site prediction program.
|
||||
It is fast, easy to use and well documented. As such, it was a great inspiration for this project.
|
||||
Fpocket is written in C, and it is based on a different geometric algorithm.
|
||||
|
||||
Some practical differences:
|
||||
|
||||
* **Fpocket**
|
||||
- has much smaller memory footprint
|
||||
- has a much smaller memory footprint
|
||||
- runs faster when executed on a single protein
|
||||
- produces a high number of less relevant pockets (and since the default scoring function isn't very effective the most relevant pockets often doesn't get to the top)
|
||||
- contains MDpocket algorithm for pocket predictions from molecular trajectories
|
||||
- still better documented
|
||||
* **P2Rank**
|
||||
- achieves significantly higher identification success rates when considering top-ranked pockets
|
||||
- produces smaller number of more relevant pockets
|
||||
- produces a smaller number of more relevant pockets
|
||||
- speed:
|
||||
+ slower when running on a single protein (due to JVM startup cost)
|
||||
+ approximately as fast on average running on a big dataset on a single core
|
||||
+ due to parallel implementation potentially much faster on multi core machines
|
||||
+ due to parallel implementation potentially much faster on multi-core machines
|
||||
- higher memory footprint (~1G but doesn't grow much with more parallel threads)
|
||||
|
||||
Both Fpocket and P2Rank have many configurable parameters that influence behaviour of the algorithm and can be tweaked to achieve better results for particular requirements.
|
||||
|
||||
@@ -5,7 +5,7 @@ This directory contains P2Rank config files.
|
||||
|
||||
Initially, P2Rank loads configuration from `default.groovy` (and from `default-rescore.groovy` in case you run `prank rescore ...`).
|
||||
|
||||
Parameters can be then overriden in a custom config file (`-c <config.file>`) or directly on the command line.
|
||||
Parameters can be then overridden in a custom config file (`-c <config.file>`) or directly on the command line.
|
||||
|
||||
## Details
|
||||
|
||||
@@ -22,5 +22,5 @@ Parameter application priority (last wins):
|
||||
4. parameters in custom config file `-c <config.file>`
|
||||
5. parameters on the command line
|
||||
|
||||
To see comprehensive list of all possible params see Params.groovy in the source code:
|
||||
To see a comprehensive list of all possible params see Params.groovy in the source code:
|
||||
https://github.com/rdk/p2rank/blob/master/src/main/groovy/cz/siret/prank/program/params/Params.groovy
|
||||
|
||||
@@ -8,24 +8,24 @@ Dataset file specifies a list of files (typically proteins) to be processed by t
|
||||
2W83.pdb
|
||||
1fbl.pdb
|
||||
~~~
|
||||
Basic single-culumn dataset that specifies list of proteins.
|
||||
A basic single-column dataset that specifies a list of proteins.
|
||||
|
||||
|
||||
## Multi-column dataset format
|
||||
|
||||
Optionally, dataset files can have a multi-column format that allows to specify complementary data. This is relevant only if you are interested in training and evaluating new models.
|
||||
Optionally, dataset files can have a multi-column format that allows specifying complementary data. This is relevant only if you are interested in training and evaluating new models.
|
||||
|
||||
Multi-column datasets need to declare a header (see `test_data/fpocket-pairs.ds` for example of a dataset for evaluation of Fpocket predictions).
|
||||
|
||||
**Valid coulmn names**:
|
||||
* `"protein"` column is mandatory if program is used for pocket prediction
|
||||
* `"prediction"` column is mandatory if program is used for pocket rescoring
|
||||
* `"chains"` allows to explicitly specify whhich protein chains from the structure should be considered. Structures will be reduced to specified chains when loaded. Value `*` means all chains.
|
||||
**Valid column names**:
|
||||
* `"protein"` column is mandatory if the program is used for pocket prediction
|
||||
* `"prediction"` column is mandatory if the program is used for pocket rescoring
|
||||
* `"chains"` allows to explicitly specify which protein chains from the structure should be considered. Structures will be reduced to specified chains when loaded. Value `*` means all chains.
|
||||
* `"ligands"` allows to explicitly specify which ligands should be considered (see test-ligand-codes.ds)
|
||||
* `"ligand_codes"` same as "ligands" (for backward compatibility)
|
||||
* `"conservation"` contains link to sequence conservation data
|
||||
|
||||
If the header is not specified, default header is `HEADER: protein` i.e. dataset contains just a list of protein files.
|
||||
If the header is not specified, the default implicit header is `HEADER: protein` i.e. dataset contains just a list of protein files.
|
||||
|
||||
Additionally, it is possible to specify global dataset parameters.
|
||||
|
||||
@@ -45,12 +45,12 @@ HEADER: <column_names>
|
||||
|
||||
### Examples
|
||||
|
||||
Folllowing examples are valid multi-column dataset files. See other examples in `test_data` folter.
|
||||
Following examples are valid multi-column dataset files. See other examples in `test_data` folder.
|
||||
Ligands can be specified by a group name (e.g. `PHI`) in which case all ligands with this name will be considered relevant.
|
||||
To specify particular molecules you can optionally use `atom_id` and `group_id` specifiers.
|
||||
No whitespace in the column value is allowed.
|
||||
|
||||
##### Example 2: Dataset with explicitely specified relevant ligands
|
||||
##### Example 2: Dataset with explicitly specified relevant ligands
|
||||
~~~sh
|
||||
HEADER: protein ligands
|
||||
|
||||
@@ -60,17 +60,17 @@ liganated/1nlua.pdb PHI[atom_id:1234]
|
||||
liganated/1t7qa.pdb COA[group_id:C_234A]
|
||||
liganated/2ck3b.pdb ANP
|
||||
~~~
|
||||
Dataset with explicitly specified ligands. Useful only for training and evaluation datasets.
|
||||
A dataset with explicitly specified ligands. Useful only for training and evaluation datasets.
|
||||
|
||||
##### Example 3: Dataset of protein/prediction pairs
|
||||
~~~sh
|
||||
PARAM.PREDICTION_METHOD=fpocket # specifies method that was used to create predictions
|
||||
PARAM.LIGANDS_SEPARATED_BY_TER=true # specifies that ligands are separated by TER record (relevant only for lagacy CHEN11 dataset)
|
||||
PARAM.PREDICTION_METHOD=fpocket # specifies the method that was used to create predictions
|
||||
PARAM.LIGANDS_SEPARATED_BY_TER=true # specifies that ligands are separated by TER record (relevant only for legacy CHEN11 dataset)
|
||||
|
||||
HEADER: protein prediction
|
||||
|
||||
liganated/1a82a.pdb predictions/fpocket/1a82a_out/1a82a_out.pdb
|
||||
liganated/1aaxa.pdb predictions/fpocket/1aaxa_out/1aaxa_out.pdb
|
||||
~~~
|
||||
Dataset that allows to define pairs of liganated protein and binding site pedictions for this protein made by some prediction method, in this case Fpocket.
|
||||
A dataset that defines pairs of liganated protein and binding site pedictions for this protein made by some prediction method, in this case, Fpocket.
|
||||
It is used for rescoring and evaluating predictions of other methods (using `prank rescore <dataset-whih-pairs.ds>`).
|
||||
@@ -1,9 +1,9 @@
|
||||
|
||||
Directory with pre-trained models.
|
||||
|
||||
Prank looks here for model specified by (`-model`/`-m`) parameter.
|
||||
Prank looks here for the model specified by (`-model`/`-m`) parameter.
|
||||
|
||||
Model should be always used only in combination with the parameters or config file that was used to train it.
|
||||
The model should be always used only in combination with the parameters or config file that was used to train it.
|
||||
I.e.: the feature extraction has to be executed with the same parameters.
|
||||
|
||||
## List of models
|
||||
|
||||
@@ -2,7 +2,7 @@
|
||||
|
||||
PARAM.PREDICTION_METHOD=concavity
|
||||
|
||||
# specifies that ligands are separated by TER record (relevant only for lagacy CHEN11 and derived datasets)
|
||||
# specifies that ligands are separated by TER record (relevant only for legacy CHEN11 and derived datasets)
|
||||
PARAM.LIGANDS_SEPARATED_BY_TER=true
|
||||
|
||||
HEADER: prediction protein
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
# Dataset that explicitely specifies which ligands should be considered in training and evaluation phase.
|
||||
# Dataset that explicitly specifies which ligands should be considered in the training and evaluation phase.
|
||||
# Ligands are declared as a list of PDB HET group names with optional specifiers.
|
||||
|
||||
HEADER: protein ligands
|
||||
|
||||
@@ -1,5 +1,5 @@
|
||||
# Dataset that explicitely specifies which ligands should be considered in training and evaluation phase.
|
||||
# Ligands are declared as list of PDB HET group names separated with a coma without spaces.
|
||||
# Dataset that explicitly specifies which ligands should be considered in the training and evaluation phase.
|
||||
# Ligands are declared as a list of PDB HET group names separated with a comma without spaces.
|
||||
|
||||
HEADER: protein ligands
|
||||
|
||||
|
||||
@@ -9,10 +9,10 @@ P2Rank version: 2.3-dev.1
|
||||
## Introduction
|
||||
|
||||
P2Rank is based on predicting scores of SAS points that are described by feature vectors.
|
||||
Feature vector is basically an array of real numbers (`double[]`) with a header (i.e. each element has a unique name).
|
||||
A feature vector is basically an array of real numbers (`double[]`) with a header (i.e. each element has a unique name).
|
||||
|
||||
P2Rank comes with a set of impelmented feature calculators.
|
||||
Each calculator has a name and calculates an array of certain length (e.g. for `volsite` n=5, `bfactor` n=1).
|
||||
P2Rank comes with a set of implemented feature calculators.
|
||||
Each calculator has a name and calculates an array of a certain length (e.g. for `volsite` n=5, `bfactor` n=1).
|
||||
|
||||
We will use the term *feature* for feature calculator (e.g. `chem`) and *sub-feature* for an individual element - single scalar number (e.g. `chem.atoms`).
|
||||
|
||||
@@ -38,7 +38,7 @@ Note that the syntax for list-of-strings parameter value is different on the com
|
||||
|
||||
#### Check enabled features
|
||||
|
||||
To check which features are enabled for particular configuraion run `print features` command:
|
||||
To check which features are enabled for a particular configuration run `print features` command:
|
||||
```bash
|
||||
./prank print features
|
||||
```
|
||||
@@ -109,28 +109,28 @@ Effective feature vector header (i.e. enabled sub-features):
|
||||
|
||||
If you want to add new features that are not implemented in P2Rank you have 3 options:
|
||||
* Implement a new feature calculator in Java or Groovy
|
||||
* this is not too difficult and has an advantage that feature will be calculated automatically for new datsets
|
||||
* this is not too difficult and has an advantage that feature will be calculated automatically for new datasets
|
||||
* For introduction see [new feature tutorial](new-feature-evaluation-tutorial.md)
|
||||
* Provide custom atom type and residue type tables for `atom_table` and `residue_table` features
|
||||
* allow to define values for residue types and atom types
|
||||
* allow defining values for residue types and atom types
|
||||
* residue types are: (ALA,ARG,ASN,...)
|
||||
* atom types are: (ALA.C,ALA.CA,ALA.CB,...)
|
||||
* useful only if the values are the same for all proteins in the dataset (for example: hydrophobicity index of amino acids).
|
||||
* see example tables: `aa-propensities.csv` and `atomic-properties.csv`
|
||||
* NOTE: providing custom tables is not implemented yet (planned for 2.3-dev.2)
|
||||
* Use `csv` feature
|
||||
* allows to define values for evary protein residue and/or every protein atom (for each protein separately) via external csv files
|
||||
* allows defining values for every protein residue and/or every protein atom (for each protein separately) via external csv files
|
||||
* disadvantage: csv files must be manually calculated for each dataset
|
||||
* Configuration:
|
||||
* looks for csv files named `{peorein_file_name}.csv` in directories defined in `-feat_csv_directories` parameter
|
||||
* enabled value columns from csv files must be declared in `-feat_csv_columns`
|
||||
* `-feat_csv_ignore_missing` allows to ignore missing csv files, columns and rows
|
||||
* `-feat_csv_ignore_missing` allows ignoring missing csv files, columns and rows
|
||||
* TODO: add more detailed documentation for csv feature
|
||||
|
||||
## Filtering features
|
||||
|
||||
You can selectively enable/disable certain features and sub-features with `-feature_filters` parameter.
|
||||
Filters are applied only to features that are first enabled by `-features` prameter.
|
||||
Filters are applied only to the features that are first enabled by `-features` parameter.
|
||||
If the value of `-feature_filters` is empty, all sub-features are used (i.e no filtering is applied).
|
||||
|
||||
Examples of individual filters:
|
||||
@@ -145,7 +145,7 @@ Filters are applied sequentially.
|
||||
|
||||
If the first filter starts with `-`, everything is implicitly enabled. Otherwise, everything is implicitly disabled.
|
||||
For example:
|
||||
* `-feature_filters '(-chem.atoms)'` - include everything excape `chem.atoms`
|
||||
* `-feature_filters '(-chem.atoms)'` - include everything except `chem.atoms`
|
||||
* `-feature_filters '(chem.atoms)'` - include only `chem.atoms`
|
||||
|
||||
|
||||
@@ -201,5 +201,5 @@ Example:
|
||||
```
|
||||
./prank ploop -t train.ds -e eval.ds -loop 10 -feature_filters '((-chem.*),(-chem.atoms,-chem.ploar),(protrusion.*,bfactor.*))'
|
||||
```
|
||||
This command will run train-eval experiments for 3 dfferent feature setups by applying different list of feature filters.
|
||||
For each feature setup it will run 10 train-eval cycles (using different random seed) and calculate average results.
|
||||
This command will run train-eval experiments for 3 dfferent feature setups by applying a different list of feature filters.
|
||||
For each feature setup, it will run 10 train-eval cycles (using different random seed) and calculate average results.
|
||||
|
||||
@@ -41,7 +41,7 @@ Analyze a dataset with an explicitly specified residue labeling.
|
||||
|
||||
|
||||
~~~
|
||||
# predict using model trained with conservation
|
||||
# predict using the model trained with conservation
|
||||
|
||||
./prank.sh eval-predict ../p2rank-datasets/coach420.ds -l conserv -out_subdir CONS \
|
||||
-c distro/config/conservation \
|
||||
|
||||
@@ -4,8 +4,8 @@ P2Rank has routines for optimizing arbitrary parameters with Grid and Bayesian o
|
||||
|
||||
Here by hyper-parameters we mean actual hyper-parameters of the machine learning models (eg. number of trees in RF) but also any arbitrary parameter ot the whole algorithm.
|
||||
|
||||
Comprehensive list of all parameters with descriptions is in `Params.groovy`.
|
||||
|
||||
To see the complete commented list of all (including undocumented)
|
||||
parameterss see [Params.groovy](https://github.com/rdk/p2rank/blob/develop/src/main/groovy/cz/siret/prank/program/params/Params.groovy) in the source code.
|
||||
|
||||
**Grid optimization**:
|
||||
* generates plots for all stats
|
||||
@@ -20,9 +20,9 @@ Comprehensive list of all parameters with descriptions is in `Params.groovy`.
|
||||
## Grid optimization (ploop command)
|
||||
|
||||
P2Rank allows you to iterate experiments (train/eval and crossvalidation) through lists of different parameter values on the command line.
|
||||
For that you need to use `prank ploop` command and list or range expression instead of param value for one or more params.
|
||||
For that, you need to use the `prank ploop` command and list or range expression instead of param value for one or more params.
|
||||
|
||||
Supported parameter types: numerical, boolean, string and 'list of strings' (e.g feature set).
|
||||
Supported parameter types: numerical, boolean, string, and 'list of strings' (e.g. value of param `-features` has type 'list of strings').
|
||||
|
||||
#### Defining grid
|
||||
**List expression**: `(val1,val2,...)`
|
||||
@@ -50,7 +50,7 @@ Related parameters:
|
||||
### R plots
|
||||
|
||||
In case you optimize exactly 1 or 2 parameters, P2Rank will try to produce plots of various statistics using R language.
|
||||
For that you need to have `Rscript` on the Path. Some libraries in R need to be installed.
|
||||
For that, you need to have `Rscript` on the PATH. Some libraries in R need to be installed first.
|
||||
~~~sh
|
||||
sudo apt install r-base
|
||||
sudo R -e "install.packages('ggplot2', dependencies=TRUE, repos='http://cran.us.r-project.org')"
|
||||
@@ -85,9 +85,9 @@ Feature set comparisons:
|
||||
|
||||
## Bayesian optimization (hopt command)
|
||||
|
||||
Hopt command (`p2rank hopt`) implements Bayesian optimization using program Speramint.
|
||||
(Other optimization tools might be employed in similar fashion with little additional work.
|
||||
See how integration wihth Spearmint is implemented in HSpearmintOptimizer.groovy).
|
||||
Hopt command (`prank hopt`) implements Bayesian optimization using the program *Speramint*.
|
||||
(Other optimization tools might be employed in a similar fashion with little additional work.
|
||||
See how integration with *Spearmint* is implemented in HSpearmintOptimizer.groovy).
|
||||
|
||||
Supported parameter types: numerical, boolean.
|
||||
|
||||
|
||||
@@ -5,13 +5,13 @@ Read this if you want to implement a new feature and evaluate if it contributes
|
||||
## Implementation
|
||||
|
||||
New features can be added by implementing `FeatureCalculator` interface and registering the implementation in `FeatureRegistry`.
|
||||
You can implement the feature by extending one of convenience abstract classes `AtomFeatureCalculator` or `SasFeatureCalculator`.
|
||||
You can implement the feature by extending one of the convenience abstract classes `AtomFeatureCalculator` or `SasFeatureCalculator`.
|
||||
|
||||
You need to decide if the new feature will be associated with protein surface (i.e. solvent exposed) atoms or with SAS (Solvent Accessible Surface) points.
|
||||
P2Rank works by classifying SAS point feature vectors.
|
||||
If you associate the feature with atoms its value will be projected to SAS point feature vectors by P2Rank from neighbouring atoms.
|
||||
|
||||
Some features are more easily defined for atoms than SAS points and other way around. See `BfactorFeature` and `ProtrusionFeature` for comparison.
|
||||
Some features are more naturally defined for atoms rather than for SAS points and other way around. See `BfactorFeature` and `ProtrusionFeature` for comparison.
|
||||
|
||||
|
||||
## Evaluation
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
|
||||
# P2Rank model training and optimization tutorial
|
||||
|
||||
This file provides introduction for people who want to train and evaluate their own models or optimize different parameters of the algorithm.
|
||||
This file provides an introduction for people who want to train and evaluate their own models or optimize different parameters of the algorithm.
|
||||
|
||||
## Kick-start examples
|
||||
|
||||
@@ -64,43 +64,43 @@ Related parameters:
|
||||
|
||||
Memory consumption can be drastically influenced by some parameters.
|
||||
|
||||
Random Forest implementations train trees in parallell using number of threads defined in`-rf_threads` variable.
|
||||
Ideally, this would be set to number of CPU cores in the machine.
|
||||
However, required memory during training grows linearly with number trees trained in paralell (`-rf_threads`),
|
||||
Random Forest implementations train trees in parallel using the number of threads defined in`-rf_threads` variable.
|
||||
Ideally, this would be set to the number of physical CPU cores in the machine.
|
||||
However, required memory during training grows linearly with the number trees trained in parallel (`-rf_threads`),
|
||||
so you may need to lower the number of threads.
|
||||
|
||||
Parameters that influence memory/time trade-off:
|
||||
* `-cache_datasets` determines whether datasets of proteins are kept in memory between runs**. Related parameters:
|
||||
- `-clear_prim_caches` clear primary caches (protein structures) between runs (when iterating params or seed)
|
||||
- `-clear_sec_caches` clear secondary caches (protein surfaces etc.) between runs (when iterating params or seed)
|
||||
* `-rf_threads` number of trees trained in parallell
|
||||
* `-rf_threads` number of trees trained in parallel
|
||||
* `-rf_trees`, `-fr_depth` influence the size of the model in memory
|
||||
* `-crossval_threads` when running crossvalidation it determines how many models are trained at the same time. Set to `1` if you don't have enough memory.
|
||||
|
||||
* `-cache_datasets <bool>`: keep datasets (structures and SAS points) in memory between crossval/traineval iterations.
|
||||
For single pass training (`-loop 1`) it does not make sense to keep it on.
|
||||
Turn off when evaluating model on huge datasets that won't fit to memory (e.g. whole PDB).
|
||||
For single-pass training (`-loop 1`) it does not make sense to keep it on.
|
||||
Turn off when evaluating the model on huge datasets that won't fit to memory (e.g. whole PDB).
|
||||
When switched off it will leave more memory for RF at the cost of needing to parse all structure files (PDBs) again.
|
||||
|
||||
Additional notes:
|
||||
* Subsampling and supersampling influence the size of training vercor dataset and required memory (see _Dealing with class imbalances_).
|
||||
* Subsampling and supersampling influence the size of training vector dataset and required memory (see _Dealing with class imbalances_).
|
||||
* Memory also grows linearly with "bag size" (`-rf_bagsize`) but this would generally be in range (50%-100%).
|
||||
* Keep in mind how JVM deals with compressed OOPs. Basically it doesn't make sense to have heap size between 32G and ~48G.
|
||||
|
||||
|
||||
### Historical note on the dataset format
|
||||
(This secton should be moved no historical notes as soon as there will be new default P2Rank model.)
|
||||
(This section should be moved to historical notes as soon as there will be a new default P2Rank model.)
|
||||
|
||||
Parameter `-sample_negatives_from_decoys` determines how points are sampled from the proteins in a training dataset.
|
||||
If `sample_negatives_from_decoys = false` all of the points from the protein surface are used.
|
||||
If `sample_negatives_from_decoys = true` only points from decoy pockets (not true ligand binding sites found by other method like Fpocket) are used.
|
||||
For that **you need to supply a training dataset that contains pocket predictions by other method** (i.e. for predictions of Fpocket use `joined-fpocket.ds` instead of `joined.ds`).
|
||||
If `sample_negatives_from_decoys = true` only points from decoy pockets (false-positives ligand binding sites found by other methods like Fpocket) are used.
|
||||
For that **you need to supply a training dataset that contains pocket predictions by another method** (i.e. for predictions of Fpocket use `joined-fpocket.ds` instead of `joined.ds`).
|
||||
|
||||
`sample_negatives_from_decoys = true` in combination with Fpocket predictions was historically giving slightly better results.
|
||||
It focuses the classifier to learn to distinguish between true and decoy pockets which is, in theory, a harder task than to distinguish between ligandable vs. unligandable protein surface.
|
||||
It also changes the ratio of sampled positives/negatives in favour of positives.
|
||||
|
||||
I recent versions it might be possible to achieve better results by training from whole protein surface in combination with class balancing techniques (see the next section).
|
||||
I recent versions it might be possible to achieve better results by training from the whole protein surface in combination with class balancing techniques (see the next section).
|
||||
Note that default values of other parameters (related to feature extraction and classification results aggregation) were optimized for the case where `sample_negatives_from_decoys = true`.
|
||||
|
||||
Here are the most relevant ones (for descriptions see `Params.groovy`):
|
||||
@@ -110,7 +110,7 @@ Here are the most relevant ones (for descriptions see `Params.groovy`):
|
||||
* `-pred_point_threshold`
|
||||
* `-pred_min_cluster_size`
|
||||
|
||||
Their values may need to be optimized again for case of `sample_negatives_from_decoys = false`.
|
||||
Their values may need to be optimized again for the case when `sample_negatives_from_decoys = false`.
|
||||
|
||||
### Dealing with class imbalance
|
||||
|
||||
@@ -134,7 +134,7 @@ Ways to deal with class imbalances:
|
||||
|
||||
|
||||
## Crossvalidation
|
||||
To run crossvalidation on a single dataset use `prank crossval` command.
|
||||
To run crossvalidation on a single dataset use the `prank crossval` command.
|
||||
|
||||
Example:
|
||||
~~~sh
|
||||
@@ -149,7 +149,7 @@ Related parameters:
|
||||
|
||||
## Output directory location
|
||||
|
||||
Location of output directory for any given run is influenced by several parameters. You can organize results of your experiments with their help.
|
||||
The location of the output directory for any given run is influenced by several parameters. You can organize the results of your experiments with their help.
|
||||
|
||||
* `-output_base_dir <dir>`: top-level default output directory
|
||||
* `-out_subdir <dir>`: subdirectory of output_base_dir (optional)
|
||||
|
||||
Reference in New Issue
Block a user