mirror of
https://github.com/rdk/p2rank.git
synced 2026-06-04 12:44:24 +08:00
Move misc/tutorials/ to documentation/ and add index readme. Update rescoring.md: add quick-start examples, paper links for all methods, add Pocketeer to supported methods list. Fix stale links in README.md (tutorials path, local-env.sh typo).
195 lines
7.1 KiB
Markdown
195 lines
7.1 KiB
Markdown
# (Hyper-)parameter optimization
|
|
|
|
P2Rank has routines for optimizing arbitrary parameters with Grid and Bayesian optimization.
|
|
|
|
Here, by hyper-parameters we mean actual hyper-parameters of the machine learning models (e.g. number of trees in RF), and also any arbitrary parameters of the algorithm as a whole.
|
|
|
|
To see the complete commented list of all (including undocumented)
|
|
parameters see [Params.groovy](https://github.com/rdk/p2rank/blob/develop/src/main/groovy/cz/siret/prank/program/params/Params.groovy) in the source code.
|
|
|
|
**Grid optimization**:
|
|
* generates plots for all stats
|
|
* gives better overview of objective landscapes (and relationships of different objectives/metrics by comparing plots)
|
|
* sensible for optimizing up to 2 parameters simultaneously
|
|
|
|
**Bayesian optimization**:
|
|
* more efficient
|
|
* allows to feasibly optimize multiple (6+) parameters simultaneously
|
|
* see https://doi.org/10.1109/BIBM.2017.8218024
|
|
|
|
## Grid optimization (ploop command)
|
|
|
|
P2Rank allows you to iterate experiments (train/eval and crossvalidation) through lists of different parameter values on the command line.
|
|
For that, you need to use the `prank ploop` command and list or range expression instead of param value for one or more params.
|
|
|
|
Supported parameter types: numerical, boolean, string, and 'list of strings' (e.g. value of param `-features` has type 'list of strings').
|
|
|
|
#### Defining grid
|
|
**List expression**: `(val1,val2,...)`
|
|
|
|
Examples:
|
|
* list of numbers: `'(1,2,3,4)'`
|
|
* list of strings: `'(RandomForest,FasterForest,FasterForest2)'`
|
|
* list of lists: `'((protrusion,bfactor,volsite),(protrusion,bfactor),(protrusion),())'`
|
|
|
|
**Range expression**: `[min:max:step]` example: `[-1:1.5:0.5]`
|
|
Valid only for numerical parameters.
|
|
|
|
Examples:
|
|
~~~sh
|
|
./prank.sh ploop -t <training_dataset> -e <evaluation_dataset> -<param1> '[min:max:step]' -<param2> '(val1,val2,val3,val4)'
|
|
./prank.sh ploop -t <dataset> -<param1> '[min:max:step]' -<param2> '(val1,val2,val3,val4)' # runs crossvalidation
|
|
~~~
|
|
|
|
Random seed iteration (`-loop` and `-seed` params) works here as well.
|
|
|
|
Related parameters:
|
|
* `-clear_prim_caches <bool>`: clear primary caches (protein structures) when iterating params
|
|
* `-clear_sec_caches <bool>`: clear secondary caches (protein surfaces etc.) when iterating params
|
|
|
|
### R plots
|
|
|
|
In case you optimize exactly 1 or 2 parameters, P2Rank will try to produce plots of various statistics using R language.
|
|
For that, you need to have `Rscript` on the PATH. Some libraries in R need to be installed first.
|
|
~~~sh
|
|
sudo apt install r-base
|
|
sudo R -e "install.packages(c('ggplot2','gplots','RColorBrewer'), dependencies=TRUE, repos='http://cran.us.r-project.org')"
|
|
sudo R -e "update.packages(repos='http://cran.us.r-project.org', ask = FALSE)" # possible fix for dependency conflicts
|
|
~~~
|
|
|
|
Script to re-generate all plots
|
|
~~~sh
|
|
cd plots
|
|
find rcode | xargs -P 16 -I '{}' Rscript '{}'
|
|
~~~
|
|
|
|
#### Real examples
|
|
|
|
Quick test run:
|
|
~~~sh
|
|
./prank.sh ploop
|
|
-c config/train-new-default \ # override default config with config/train-new-default.groovy config file
|
|
-t chen11-fpocket.ds \ # crossvalidate on chen11 datasest
|
|
-loop 1 -rf_trees 5 -rf_depth 5 \ # make it quick (1 pass, small model)
|
|
-features '((protrusion,bfactor),(protrusion,bfactor,new_feature))'`
|
|
~~~
|
|
|
|
(Then check `run.log` in n results directory for errors. Check if R plots are generated correctly.)
|
|
|
|
Feature set comparisons:
|
|
~~~sh
|
|
./prank.sh ploop -c config/train-new-default \
|
|
-t chen11-fpocket.ds \ # crossvalidate on chen11 dataset
|
|
-loop 10 -rf_trees 100 -rf_depth 10 \
|
|
-features '((protrusion,bfactor),(protrusion,bfactor,new_feature))'`
|
|
|
|
./prank.sh ploop -c config/train-new-default \
|
|
-t chen11-fpocket.ds \ # train on chen11
|
|
-e joined.ds \ # and evaluate on a different dataset
|
|
-loop 10 -rf_trees 100 -rf_depth 10 \
|
|
-features '((protrusion,bfactor),(protrusion,bfactor,new_feature))'`
|
|
~~~
|
|
|
|
## Bayesian optimization (hopt command)
|
|
|
|
```sh
|
|
./prank.sh hopt -t <dataset> -<param1> '(<min>,<max>)' # crossvalidation
|
|
./prank.sh hopt -t <dataset> -e <dataset> -<param1> '(<min>,<max>)'
|
|
```
|
|
|
|
Hopt command (`prank hopt`) implements Bayesian optimization using one of the integrated optimizers.
|
|
|
|
Integrated optimizers (values of `-hopt_optimizer` parameter):
|
|
* `pygpgo` : __pyGPGO__ (https://github.com/josejimenezluna/pyGPGO)
|
|
* `spearmint` : __Speramint__ (https://github.com/HIPS/Spearmint.git)
|
|
|
|
(Other optimization tools might be integrated with little work. See how integration with *pyGPGO* is implemented in `HPyGpgoOptimizer.groovy`).
|
|
|
|
By default, optimization goal is to maximize value of a metric in `-hopt_objective` parameter (e.g. `-hopt_objective DCA_4_0)`).
|
|
For minimization, prefix metric name with minus sign: `-hopt_objective "'-point_LOG_LOSS'"`.
|
|
|
|
Supported parameter types: `double`, `int`, `boolean`.
|
|
|
|
## Optimization with pyGPGO
|
|
|
|
### Install pyGPGO
|
|
|
|
Requirements: Python >3.5.
|
|
|
|
```sh
|
|
pip install pyGPGO
|
|
```
|
|
|
|
## Run optimization
|
|
|
|
Examples:
|
|
```sh
|
|
./prank.sh hopt -c config/train-new-default -out_subdir HOPT -label TREES \
|
|
-t chen11-fpocket.ds \
|
|
-e joined.ds \
|
|
-hopt_optimizer 'pygpgo' \
|
|
-hopt_python_command 'python' \
|
|
-hopt_objective 'DCA_4_0' \
|
|
-classifier 'FasterForest' \
|
|
-loop 3 \
|
|
-ploop_delete_runs 0 \
|
|
-rf_trees '(10,200)' \
|
|
-rf_depth '(2,14)' \
|
|
-rf_features '(2,30)'
|
|
|
|
# Optimizing parameters that are not involved in training new classifier,
|
|
# but rather in aggregating results into pockets.
|
|
# We can allow to train only one RF model in the beginning (-hopt_train_only_once 1).
|
|
# Note: this is not really ideal because of overfitting to a one particular RF model.
|
|
./prank.sh hopt -c config/train-new-default -out_subdir HOPT -label TREES \
|
|
-t chen11-fpocket.ds \
|
|
-e joined.ds \
|
|
-hopt_optimizer 'pygpgo' \
|
|
-hopt_python_command 'python3' \
|
|
-hopt_objective 'DCA_4_0' \
|
|
-classifier 'FasterForest' \
|
|
-loop 1 \
|
|
-hopt_train_only_once 1 \
|
|
-pred_point_threshold '(0.2,0.6)' \
|
|
-point_score_pow '(1,5)'
|
|
```
|
|
|
|
|
|
## Optimization with Spearmint
|
|
|
|
### Install Spearmint (on ubuntu)
|
|
|
|
Requirements: Python 2.7 and MongoDB.
|
|
|
|
```sh
|
|
sudo apt install -y mongodb python python-pip
|
|
sudo pip install --upgrade pip
|
|
sudo pip install numpy scipy pymongo weave
|
|
# git clone https://github.com/HIPS/Spearmint.git # Spearmint home repo
|
|
git clone https://github.com/rdk/Spearmint.git # fork fixing scipy.weave problem (weave-fix branch)
|
|
sudo pip install -e Spearmint
|
|
```
|
|
|
|
## Run optimization
|
|
|
|
Example:
|
|
```sh
|
|
pkill python; sudo pkill mongo; # prepare clean slate (careful, your other python programs might die too)
|
|
|
|
./prank.sh hopt -c config/train-new-default -out_subdir HOPT -label TREES \
|
|
-t chen11-fpocket.ds \
|
|
-e joined.ds \
|
|
-hopt_optimizer 'spearmint' \
|
|
-hopt_python_command 'python' \
|
|
-hopt_spearmint_dir '../Spearmint/spearmint' \
|
|
-hopt_objective 'DCA_4_0' \
|
|
-classifier 'FasterForest' \
|
|
-loop 3 \
|
|
-ploop_delete_runs 0 \
|
|
-rf_trees '(10,200)' \
|
|
-rf_depth '(2,14)' \
|
|
-rf_features '(2,30)'
|
|
```
|
|
|
|
|