pub_soft/p2rank

Fork 0

mirror of https://github.com/rdk/p2rank.git synced 2026-06-04 20:54:23 +08:00

Files

rdk aae633f557 minor documentation updates

2024-10-19 16:19:57 +02:00

7.1 KiB

Raw Blame History

(Hyper-)parameter optimization

P2Rank has routines for optimizing arbitrary parameters with Grid and Bayesian optimization.

Here, by hyper-parameters we mean actual hyper-parameters of the machine learning models (e.g. number of trees in RF), and also any arbitrary parameters of the algorithm as a whole.

To see the complete commented list of all (including undocumented) parameters see Params.groovy in the source code.

Grid optimization:

generates plots for all stats
gives better overview of objective landscapes (and relationships of different objectives/metrics by comparing plots)
sensible for optimizing up to 2 parameters simultaneously

Bayesian optimization:

more efficient
allows to feasibly optimize multiple (6+) parameters simultaneously
see https://doi.org/10.1109/BIBM.2017.8218024

Grid optimization (ploop command)

P2Rank allows you to iterate experiments (train/eval and crossvalidation) through lists of different parameter values on the command line. For that, you need to use the prank ploop command and list or range expression instead of param value for one or more params.

Supported parameter types: numerical, boolean, string, and 'list of strings' (e.g. value of param -features has type 'list of strings').

Defining grid

List expression: (val1,val2,...)

Examples:

list of numbers: '(1,2,3,4)'
list of strings: '(RandomForest,FasterForest,FasterForest2)'
list of lists: '((protrusion,bfactor,volsite),(protrusion,bfactor),(protrusion),())'

Range expression: [min:max:step] example: [-1:1.5:0.5] Valid only for numerical parameters.

Examples:

./prank.sh ploop -t <training_dataset> -e <evaluation_dataset> -<param1> '[min:max:step]' -<param2> '(val1,val2,val3,val4)'
./prank.sh ploop -t <dataset>                                  -<param1> '[min:max:step]' -<param2> '(val1,val2,val3,val4)'   # runs crossvalidation

Random seed iteration (-loop and -seed params) works here as well.

Related parameters:

-clear_prim_caches <bool>: clear primary caches (protein structures) when iterating params
-clear_sec_caches <bool>: clear secondary caches (protein surfaces etc.) when iterating params

R plots

In case you optimize exactly 1 or 2 parameters, P2Rank will try to produce plots of various statistics using R language. For that, you need to have Rscript on the PATH. Some libraries in R need to be installed first.

sudo apt install r-base
sudo R -e "install.packages(c('ggplot2','gplots','RColorBrewer'), dependencies=TRUE, repos='http://cran.us.r-project.org')"
sudo R -e "update.packages(repos='http://cran.us.r-project.org', ask = FALSE)"  # possible fix for dependency conflicts

Script to re-generate all plots

cd plots
find rcode | xargs -P 16 -I '{}' Rscript '{}'

Real examples

Quick test run:

./prank.sh ploop 
    -c config/train-new-default     \      # override default config with config/train-new-default.groovy config file
    -t chen11-fpocket.ds            \      # crossvalidate on chen11 datasest
    -loop 1 -rf_trees 5 -rf_depth 5 \      # make it quick (1 pass, small model)
    -features '((protrusion,bfactor),(protrusion,bfactor,new_feature))'`

(Then check run.log in n results directory for errors. Check if R plots are generated correctly.)

Feature set comparisons:

./prank.sh ploop -c config/train-new-default \      
    -t chen11-fpocket.ds                \  # crossvalidate on chen11 dataset    
    -loop 10 -rf_trees 100 -rf_depth 10 \      
    -features '((protrusion,bfactor),(protrusion,bfactor,new_feature))'` 

./prank.sh ploop -c config/train-new-default \      
    -t chen11-fpocket.ds                \  # train on chen11 
    -e joined.ds                        \  # and evaluate on a different dataset
    -loop 10 -rf_trees 100 -rf_depth 10 \      
    -features '((protrusion,bfactor),(protrusion,bfactor,new_feature))'`

Bayesian optimization (hopt command)

./prank.sh hopt -t <dataset>               -<param1> '(<min>,<max>)'     # crossvalidation
./prank.sh hopt -t <dataset> -e <dataset>  -<param1> '(<min>,<max>)'

Hopt command (prank hopt) implements Bayesian optimization using one of the integrated optimizers.

Integrated optimizers (values of -hopt_optimizer parameter):

pygpgo : pyGPGO (https://github.com/josejimenezluna/pyGPGO)
spearmint : Speramint (https://github.com/HIPS/Spearmint.git)

(Other optimization tools might be integrated with little work. See how integration with pyGPGO is implemented in HPyGpgoOptimizer.groovy).

By default, optimization goal is to maximize value of a metric in -hopt_objective parameter (e.g. -hopt_objective DCA_4_0)). For minimization, prefix metric name with minus sign: -hopt_objective "'-point_LOG_LOSS'".

Supported parameter types: double, int, boolean.

Optimization with pyGPGO

Install pyGPGO

Requirements: Python >3.5.

pip install pyGPGO

Run optimization

Examples:

./prank.sh hopt -c config/train-new-default -out_subdir HOPT -label TREES  \
    -t chen11-fpocket.ds \
    -e joined.ds \
    -hopt_optimizer 'pygpgo' \
    -hopt_python_command 'python' \
    -hopt_objective 'DCA_4_0' \
    -classifier 'FasterForest' \
    -loop 3 \
    -ploop_delete_runs 0 \
    -rf_trees '(10,200)' \
    -rf_depth '(2,14)' \
    -rf_features '(2,30)'  
    
# Optimizing parameters that are not involved in training new classifier,
# but rather in aggregating results into pockets.
# We can allow to train only one RF model in the beginning (-hopt_train_only_once 1).
# Note: this is not really ideal because of overfitting to a one particular RF model.    
./prank.sh hopt -c config/train-new-default -out_subdir HOPT -label TREES  \
    -t chen11-fpocket.ds \
    -e joined.ds \
    -hopt_optimizer 'pygpgo' \
    -hopt_python_command 'python3' \
    -hopt_objective 'DCA_4_0' \
    -classifier 'FasterForest' \
    -loop 1 \
    -hopt_train_only_once 1 \
    -pred_point_threshold '(0.2,0.6)' \
    -point_score_pow '(1,5)'

Optimization with Spearmint

Install Spearmint (on ubuntu)

Requirements: Python 2.7 and MongoDB.

sudo apt install -y mongodb python python-pip
sudo pip install --upgrade pip
sudo pip install numpy scipy pymongo weave
# git clone https://github.com/HIPS/Spearmint.git  # Spearmint home repo
git clone https://github.com/rdk/Spearmint.git     # fork fixing scipy.weave problem (weave-fix branch)
sudo pip install -e Spearmint

Run optimization

Example:

pkill python; sudo pkill mongo;   # prepare clean slate (careful, your other python programs might die too)

./prank.sh hopt -c config/train-new-default -out_subdir HOPT -label TREES  \
    -t chen11-fpocket.ds \
    -e joined.ds \
    -hopt_optimizer 'spearmint' \
    -hopt_python_command 'python' \
    -hopt_spearmint_dir '../Spearmint/spearmint' \
    -hopt_objective 'DCA_4_0' \
    -classifier 'FasterForest' \
    -loop 3 \
    -ploop_delete_runs 0 \
    -rf_trees '(10,200)' \
    -rf_depth '(2,14)' \
    -rf_features '(2,30)'

7.1 KiB Raw Blame History

(Hyper-)parameter optimization

Grid optimization (ploop command)

Defining grid

R plots

Real examples

Bayesian optimization (hopt command)

Optimization with pyGPGO

Install pyGPGO

Run optimization

Optimization with Spearmint

Install Spearmint (on ubuntu)

Run optimization

7.1 KiB

Raw Blame History