# (Hyper-)parameter optimization P2Rank has routines for optimizing arbitrary parameters with Grid and Bayesian optimization. Here, by hyper-parameters we mean actual hyper-parameters of the machine learning models (e.g. number of trees in RF), and also any arbitrary parameters of the algorithm as a whole. To see the complete commented list of all (including undocumented) parameters see [Params.groovy](https://github.com/rdk/p2rank/blob/develop/src/main/groovy/cz/siret/prank/program/params/Params.groovy) in the source code. **Grid optimization**: * generates plots for all stats * gives better overview of objective landscapes (and relationships of different objectives/metrics by comparing plots) * sensible for optimizing up to 2 parameters simultaneously **Bayesian optimization**: * more efficient * allows to feasibly optimize multiple (6+) parameters simultaneously * see https://doi.org/10.1109/BIBM.2017.8218024 ## Grid optimization (ploop command) P2Rank allows you to iterate experiments (train/eval and crossvalidation) through lists of different parameter values on the command line. For that, you need to use the `prank ploop` command and list or range expression instead of param value for one or more params. Supported parameter types: numerical, boolean, string, and 'list of strings' (e.g. value of param `-features` has type 'list of strings'). #### Defining grid **List expression**: `(val1,val2,...)` Examples: * list of numbers: `'(1,2,3,4)'` * list of strings: `'(RandomForest,FasterForest,FasterForest2)'` * list of lists: `'((protrusion,bfactor,volsite),(protrusion,bfactor),(protrusion),())'` **Range expression**: `[min:max:step]` example: `[-1:1.5:0.5]` Valid only for numerical parameters. Examples: ~~~sh ./prank.sh ploop -t -e - '[min:max:step]' - '(val1,val2,val3,val4)' ./prank.sh ploop -t - '[min:max:step]' - '(val1,val2,val3,val4)' # runs crossvalidation ~~~ Random seed iteration (`-loop` and `-seed` params) works here as well. Related parameters: * `-clear_prim_caches `: clear primary caches (protein structures) when iterating params * `-clear_sec_caches `: clear secondary caches (protein surfaces etc.) when iterating params ### R plots In case you optimize exactly 1 or 2 parameters, P2Rank will try to produce plots of various statistics using R language. For that, you need to have `Rscript` on the PATH. Some libraries in R need to be installed first. ~~~sh sudo apt install r-base sudo R -e "install.packages(c('ggplot2','gplots','RColorBrewer'), dependencies=TRUE, repos='http://cran.us.r-project.org')" sudo R -e "update.packages(repos='http://cran.us.r-project.org', ask = FALSE)" # possible fix for dependency conflicts ~~~ Script to re-generate all plots ~~~sh cd plots find rcode | xargs -P 16 -I '{}' Rscript '{}' ~~~ #### Real examples Quick test run: ~~~sh ./prank.sh ploop -c config/train-new-default \ # override default config with config/train-new-default.groovy config file -t chen11-fpocket.ds \ # crossvalidate on chen11 datasest -loop 1 -rf_trees 5 -rf_depth 5 \ # make it quick (1 pass, small model) -features '((protrusion,bfactor),(protrusion,bfactor,new_feature))'` ~~~ (Then check `run.log` in n results directory for errors. Check if R plots are generated correctly.) Feature set comparisons: ~~~sh ./prank.sh ploop -c config/train-new-default \ -t chen11-fpocket.ds \ # crossvalidate on chen11 dataset -loop 10 -rf_trees 100 -rf_depth 10 \ -features '((protrusion,bfactor),(protrusion,bfactor,new_feature))'` ./prank.sh ploop -c config/train-new-default \ -t chen11-fpocket.ds \ # train on chen11 -e joined.ds \ # and evaluate on a different dataset -loop 10 -rf_trees 100 -rf_depth 10 \ -features '((protrusion,bfactor),(protrusion,bfactor,new_feature))'` ~~~ ## Bayesian optimization (hopt command) ```sh ./prank.sh hopt -t - '(,)' # crossvalidation ./prank.sh hopt -t -e - '(,)' ``` Hopt command (`prank hopt`) implements Bayesian optimization using one of the integrated optimizers. Integrated optimizers (values of `-hopt_optimizer` parameter): * `pygpgo` : __pyGPGO__ (https://github.com/josejimenezluna/pyGPGO) * `spearmint` : __Speramint__ (https://github.com/HIPS/Spearmint.git) (Other optimization tools might be integrated with little work. See how integration with *pyGPGO* is implemented in `HPyGpgoOptimizer.groovy`). By default, optimization goal is to maximize value of a metric in `-hopt_objective` parameter (e.g. `-hopt_objective DCA_4_0)`). For minimization, prefix metric name with minus sign: `-hopt_objective "'-point_LOG_LOSS'"`. Supported parameter types: `double`, `int`, `boolean`. ## Optimization with pyGPGO ### Install pyGPGO Requirements: Python >3.5. ```sh pip install pyGPGO ``` ## Run optimization Examples: ```sh ./prank.sh hopt -c config/train-new-default -out_subdir HOPT -label TREES \ -t chen11-fpocket.ds \ -e joined.ds \ -hopt_optimizer 'pygpgo' \ -hopt_python_command 'python' \ -hopt_objective 'DCA_4_0' \ -classifier 'FasterForest' \ -loop 3 \ -ploop_delete_runs 0 \ -rf_trees '(10,200)' \ -rf_depth '(2,14)' \ -rf_features '(2,30)' # Optimizing parameters that are not involved in training new classifier, # but rather in aggregating results into pockets. # We can allow to train only one RF model in the beginning (-hopt_train_only_once 1). # Note: this is not really ideal because of overfitting to a one particular RF model. ./prank.sh hopt -c config/train-new-default -out_subdir HOPT -label TREES \ -t chen11-fpocket.ds \ -e joined.ds \ -hopt_optimizer 'pygpgo' \ -hopt_python_command 'python3' \ -hopt_objective 'DCA_4_0' \ -classifier 'FasterForest' \ -loop 1 \ -hopt_train_only_once 1 \ -pred_point_threshold '(0.2,0.6)' \ -point_score_pow '(1,5)' ``` ## Optimization with Spearmint ### Install Spearmint (on ubuntu) Requirements: Python 2.7 and MongoDB. ```sh sudo apt install -y mongodb python python-pip sudo pip install --upgrade pip sudo pip install numpy scipy pymongo weave # git clone https://github.com/HIPS/Spearmint.git # Spearmint home repo git clone https://github.com/rdk/Spearmint.git # fork fixing scipy.weave problem (weave-fix branch) sudo pip install -e Spearmint ``` ## Run optimization Example: ```sh pkill python; sudo pkill mongo; # prepare clean slate (careful, your other python programs might die too) ./prank.sh hopt -c config/train-new-default -out_subdir HOPT -label TREES \ -t chen11-fpocket.ds \ -e joined.ds \ -hopt_optimizer 'spearmint' \ -hopt_python_command 'python' \ -hopt_spearmint_dir '../Spearmint/spearmint' \ -hopt_objective 'DCA_4_0' \ -classifier 'FasterForest' \ -loop 3 \ -ploop_delete_runs 0 \ -rf_trees '(10,200)' \ -rf_depth '(2,14)' \ -rf_features '(2,30)' ```