Files
p2rank/documentation/hyperparameter-optimization-tutorial.md
rdk 126a0653f0 move tutorials to documentation/, update rescoring tutorial and README
Move misc/tutorials/ to documentation/ and add index readme.
Update rescoring.md: add quick-start examples, paper links for all
methods, add Pocketeer to supported methods list.
Fix stale links in README.md (tutorials path, local-env.sh typo).
2026-02-11 10:52:20 +01:00

195 lines
7.1 KiB
Markdown

# (Hyper-)parameter optimization
P2Rank has routines for optimizing arbitrary parameters with Grid and Bayesian optimization.
Here, by hyper-parameters we mean actual hyper-parameters of the machine learning models (e.g. number of trees in RF), and also any arbitrary parameters of the algorithm as a whole.
To see the complete commented list of all (including undocumented)
parameters see [Params.groovy](https://github.com/rdk/p2rank/blob/develop/src/main/groovy/cz/siret/prank/program/params/Params.groovy) in the source code.
**Grid optimization**:
* generates plots for all stats
* gives better overview of objective landscapes (and relationships of different objectives/metrics by comparing plots)
* sensible for optimizing up to 2 parameters simultaneously
**Bayesian optimization**:
* more efficient
* allows to feasibly optimize multiple (6+) parameters simultaneously
* see https://doi.org/10.1109/BIBM.2017.8218024
## Grid optimization (ploop command)
P2Rank allows you to iterate experiments (train/eval and crossvalidation) through lists of different parameter values on the command line.
For that, you need to use the `prank ploop` command and list or range expression instead of param value for one or more params.
Supported parameter types: numerical, boolean, string, and 'list of strings' (e.g. value of param `-features` has type 'list of strings').
#### Defining grid
**List expression**: `(val1,val2,...)`
Examples:
* list of numbers: `'(1,2,3,4)'`
* list of strings: `'(RandomForest,FasterForest,FasterForest2)'`
* list of lists: `'((protrusion,bfactor,volsite),(protrusion,bfactor),(protrusion),())'`
**Range expression**: `[min:max:step]` example: `[-1:1.5:0.5]`
Valid only for numerical parameters.
Examples:
~~~sh
./prank.sh ploop -t <training_dataset> -e <evaluation_dataset> -<param1> '[min:max:step]' -<param2> '(val1,val2,val3,val4)'
./prank.sh ploop -t <dataset> -<param1> '[min:max:step]' -<param2> '(val1,val2,val3,val4)' # runs crossvalidation
~~~
Random seed iteration (`-loop` and `-seed` params) works here as well.
Related parameters:
* `-clear_prim_caches <bool>`: clear primary caches (protein structures) when iterating params
* `-clear_sec_caches <bool>`: clear secondary caches (protein surfaces etc.) when iterating params
### R plots
In case you optimize exactly 1 or 2 parameters, P2Rank will try to produce plots of various statistics using R language.
For that, you need to have `Rscript` on the PATH. Some libraries in R need to be installed first.
~~~sh
sudo apt install r-base
sudo R -e "install.packages(c('ggplot2','gplots','RColorBrewer'), dependencies=TRUE, repos='http://cran.us.r-project.org')"
sudo R -e "update.packages(repos='http://cran.us.r-project.org', ask = FALSE)" # possible fix for dependency conflicts
~~~
Script to re-generate all plots
~~~sh
cd plots
find rcode | xargs -P 16 -I '{}' Rscript '{}'
~~~
#### Real examples
Quick test run:
~~~sh
./prank.sh ploop
-c config/train-new-default \ # override default config with config/train-new-default.groovy config file
-t chen11-fpocket.ds \ # crossvalidate on chen11 datasest
-loop 1 -rf_trees 5 -rf_depth 5 \ # make it quick (1 pass, small model)
-features '((protrusion,bfactor),(protrusion,bfactor,new_feature))'`
~~~
(Then check `run.log` in n results directory for errors. Check if R plots are generated correctly.)
Feature set comparisons:
~~~sh
./prank.sh ploop -c config/train-new-default \
-t chen11-fpocket.ds \ # crossvalidate on chen11 dataset
-loop 10 -rf_trees 100 -rf_depth 10 \
-features '((protrusion,bfactor),(protrusion,bfactor,new_feature))'`
./prank.sh ploop -c config/train-new-default \
-t chen11-fpocket.ds \ # train on chen11
-e joined.ds \ # and evaluate on a different dataset
-loop 10 -rf_trees 100 -rf_depth 10 \
-features '((protrusion,bfactor),(protrusion,bfactor,new_feature))'`
~~~
## Bayesian optimization (hopt command)
```sh
./prank.sh hopt -t <dataset> -<param1> '(<min>,<max>)' # crossvalidation
./prank.sh hopt -t <dataset> -e <dataset> -<param1> '(<min>,<max>)'
```
Hopt command (`prank hopt`) implements Bayesian optimization using one of the integrated optimizers.
Integrated optimizers (values of `-hopt_optimizer` parameter):
* `pygpgo` : __pyGPGO__ (https://github.com/josejimenezluna/pyGPGO)
* `spearmint` : __Speramint__ (https://github.com/HIPS/Spearmint.git)
(Other optimization tools might be integrated with little work. See how integration with *pyGPGO* is implemented in `HPyGpgoOptimizer.groovy`).
By default, optimization goal is to maximize value of a metric in `-hopt_objective` parameter (e.g. `-hopt_objective DCA_4_0)`).
For minimization, prefix metric name with minus sign: `-hopt_objective "'-point_LOG_LOSS'"`.
Supported parameter types: `double`, `int`, `boolean`.
## Optimization with pyGPGO
### Install pyGPGO
Requirements: Python >3.5.
```sh
pip install pyGPGO
```
## Run optimization
Examples:
```sh
./prank.sh hopt -c config/train-new-default -out_subdir HOPT -label TREES \
-t chen11-fpocket.ds \
-e joined.ds \
-hopt_optimizer 'pygpgo' \
-hopt_python_command 'python' \
-hopt_objective 'DCA_4_0' \
-classifier 'FasterForest' \
-loop 3 \
-ploop_delete_runs 0 \
-rf_trees '(10,200)' \
-rf_depth '(2,14)' \
-rf_features '(2,30)'
# Optimizing parameters that are not involved in training new classifier,
# but rather in aggregating results into pockets.
# We can allow to train only one RF model in the beginning (-hopt_train_only_once 1).
# Note: this is not really ideal because of overfitting to a one particular RF model.
./prank.sh hopt -c config/train-new-default -out_subdir HOPT -label TREES \
-t chen11-fpocket.ds \
-e joined.ds \
-hopt_optimizer 'pygpgo' \
-hopt_python_command 'python3' \
-hopt_objective 'DCA_4_0' \
-classifier 'FasterForest' \
-loop 1 \
-hopt_train_only_once 1 \
-pred_point_threshold '(0.2,0.6)' \
-point_score_pow '(1,5)'
```
## Optimization with Spearmint
### Install Spearmint (on ubuntu)
Requirements: Python 2.7 and MongoDB.
```sh
sudo apt install -y mongodb python python-pip
sudo pip install --upgrade pip
sudo pip install numpy scipy pymongo weave
# git clone https://github.com/HIPS/Spearmint.git # Spearmint home repo
git clone https://github.com/rdk/Spearmint.git # fork fixing scipy.weave problem (weave-fix branch)
sudo pip install -e Spearmint
```
## Run optimization
Example:
```sh
pkill python; sudo pkill mongo; # prepare clean slate (careful, your other python programs might die too)
./prank.sh hopt -c config/train-new-default -out_subdir HOPT -label TREES \
-t chen11-fpocket.ds \
-e joined.ds \
-hopt_optimizer 'spearmint' \
-hopt_python_command 'python' \
-hopt_spearmint_dir '../Spearmint/spearmint' \
-hopt_objective 'DCA_4_0' \
-classifier 'FasterForest' \
-loop 3 \
-ploop_delete_runs 0 \
-rf_trees '(10,200)' \
-rf_depth '(2,14)' \
-rf_features '(2,30)'
```