11 KiB
P2RANK model traning and optimization turorial
This file should provide interduction for people who want to train and evaluate their own models or optimize different parameters of the algorithm.
Kick-start examples
prank traineval -t <training_dataset> -e <evaluation_dataset> # train and evaluate model (execute n run with difefrent random seed, see -loop and -seed params)
prank crossval <dataset> # run crossvalidation on a single dataset (see -folds param)
prank ploop -t <training_dataset> -e <evaluation_dataset> -paramA '[min:max:step]' # iterate through param values
prank ploop -t <training_dataset> -paramB '(1,2,3,4)' # iterate through param values (crossvalidation)
Parameters
P2RANK uses global static parameters object. In code it can be accessed with Params.getInst() or through Parametrized trait. For full list of parameters see Params.groovy.
Parameters can be set in 2 ways:
- on the command line
-<param_name> <value> - in config groovy file specified with
-c <config.file>(see working.groovy for an example...prank -c working.groovy).
Parameters on the command line override those in the config file, which override defaults.
Parameter application priority (last wins):
- default values in
Params.groovy - defaults in
config/default.groovy - (optionally) defaults in
config/default-rescore.groovyonly if you runprank rescore ... -c <config.file>- command line
Note: some parameters (-c,-o, -l/-label) are sctrictly command line attributes and are not defined in Params.groovy.
Training and evaluation
To train a model on one dataset and evaluate its performance on the other use prank traineval command.
Example:
prank traineval -loop 10 -seed 42 -t <training_dataset> -e <evaluation_dataset>`
Runs 10 training/evaluation cycles with different values of a random seed starting at 42. Results of any single train/eval run and averaged results will be written to the output directory.
Related parameters:
- use
-delete_models 0to keep model files after evaluation. -cache_datasets <bool>: keep datasets (structures and Connolly points) in memory between crossval/traineval iterations. Turn off for huge datasets that won't fit to memory.-feature_importances <bool>: calculate feature importances (works only ifclassifier = "FastRandomForest")-fail_fast <bool>: stop processing the datsaset on the first unrecoverable error with a dataset item
Note on the dataset format (important!)
Parameter -sample_negatives_from_decoys determins how points are sampled from the proteins in a training dataset.
If sample_negatives_from_decoys = false all of the points from the protein surface are used.
If sample_negatives_from_decoys = true only points from decoy pockets (not true ligand binding sites found by other method like Fpocket) are used.
For that you need to supply a training dataset that contains pocket predictions by other method (i.e. for predictions of Fpocket use joined-fpocket.ds instead of joined.ds).
sample_negatives_from_decoys = true in combination with Fpocket predictions was historically giving slightly better results.
It focuses classifier to learn to distinguish between true and decoy pockets which is in theory harder task than to distinguish between ligandable vs. unligandable protein surface.
It also changes the ratio of sampled positives/negatives in favour of positives.
I recent versions it might be possible to achieve better results by training from whole protein surface in combination with class balancing techniques (see the next section).
Note that default values of other parameters (related to feature extraction and classification results aggregation) were optimized for the case where sample_negatives_from_decoys = true.
Here are the most relevant ones (for descrptions see Params.groovy):
-protrusion_radiusand-neighbourhood_radius-average_feat_vectors-weight_function-pred_point_threshold-pred_min_cluster_size
Their vallues may need to be optimized again for case of sample_negatives_from_decoys = false.
Dealing with class imbalances
When using all of the protein surface for traning (sample_negatives_from_decoys = false) you may need to deal with class imbalances to achieve good results.
Typically the ratio of positives vs. negatives will be around (1:30) depending on chosen cutoffs and margins.
Ways to deal with class imbalances:
- cutoffs and margins (in relation to distance
D = <dist. to closest ligand atom>)-positive_point_ligand_distancepoints withD < positive_point_ligand_distanceare considered positives-neutral_points_marginif> 0points between(positive_point_ligand_distance, neutral_point_margin)are ignored-train_lig_cutoffif> 0points withtrain_lig_cutoff < Dare ignored
- subsampling and supersampling
-subsample-supersample- use in combination with
-target_class_ratio
- class weight balancing
- use
-balance_class_weights 1in combination with-target_class_weight_ratio - works only with weight sensitive classifiers (
RandomForest,FastRandomForest)
- use
Crossvalidation
To run crossvalidation on a single dataset use prank crossval command.
Example:
prank crossval -loop 10 -seed 42 -folds 5 <dataset>
Runs 10 independent 5-fold crossvalidation runs with different values of a random seed starting at 42. Averaged results will be written to the output directory.
Related parameters:
-crossval_threads <int>: number of folds to work on simultaneously
Grid optimization
P2RANK allows you to iterate exeriments (train/eval and crossvalidation) through lists of different parameter values on the command line.
For that you need to use prank ploop command and list or range experssion instead of param value for one or more params. Only numerical and boolean parameters are suppeotrd.
List expression: (val1,val2,...) example: '(1,2,3,4)'
Range expression: [min:max:step] example: [-1:1.5:0.5]
Examples:
prank ploop -t <training_dataset> -e <evaluation_dataset> -paramA '[min:max:step]' -paramB '(val1,val2,val3,val4)'
prank ploop -t <dataset> -paramA '[min:max:step]' -paramB '(val1,val2,val3,val4)' # runs crossvalidation
Random seed iteration (-loop and -seed params) works here as well.
Related parameters:
-clear_prim_caches <bool>: clear primary caches (protein structures) when iterating params-clear_sec_caches <bool>: clear secondary caches (protein surfaces etc.) when iterating params
R plots
In case you iterate through exactly 1 or 2 parameters P2RANK will try to procuce plots of various statistics using R language. For thet you need to have Rscript on the Path. Some libraries in R need to be installed.
Output directory location
Location of output directory for any given run is influenced by several paramaters. You can organize results of your experimants with their help.
-output_base_dir <dir>: top level default output directory-out_subdir <dir>: subdirectory of output_base_dir (optional)-out_prefix_date <bool>: prefix generated experiment output directory name with a timestamp-l <str>or-label <str>: define suffix to generated experiment output directory name-o <dir>: overrides previous params and places output in specified directory
Case study: Implementing and evaluating new feature
If you are reading ths tutorial there is a good chance you want to implement a new feature and evaluate if it contributes to prediction success rates.
Implementation
New features can be added by implementing FeatureCalculator interface and registering the imlementation in FeatureRegistry.
You can implement the feature by extending one of convenience abstrect classes AtomFeatureCalculator or SasFeatureCalculator.
You need to decide if the new feature will be associated with protein surface (=sovent exposed) atoms or with SAS (Solvent Accessible Surface) points. P2RANK works by classifying SAS point feature vectors. If you associate the feature with atoms its value will be projected to SAS point featre vectors by P2RANK from neighbouring atoms.
Some features are more easily defined for atoms than SAS points and other way around. See BfactorFeature and ProtrusionFeature for comparison.
Evaluation
-
Prepare the environment
- copy
misc/local-env-params.shto root directory of the project and edit it according to your machine (the file is then included byprank.sh)- you will need a lot of memory: at least to store the whole trainng dataset of feature vectors and a trained model and then some
- memory consumption can be drastically insfuenced by some paramaters...
- parameters that influence memory/time tradeoff:
-cache_datasetsdetermins whether datasets of proteins are kept in memory between runs. See also-clear_prim_cachesclear primary caches (protein structures) between runs (when iterating params or seed)-clear_sec_cachesclear secondary caches (protein surfaces etc.) between runs (when iterating params or seed)
-crossval_threadswhen running crossvalidation it determins how many models are trained at the same time. Set to1if you don't have enough memory.-rf_trees,-fr_depthinfluence the size of the model in memory
- copy
-
Check
working.groovyconfig file. It contains configuration ideal for training new models, but you might need to make changes or override some params on the command line. -
Train with the new feature
- train with the new feature by adding its name to the list of
-extra_features. i.e.:- in the groovy config file:
extra_features = ["protrusion","bfactor","new_feature"] - on the command line:
-extra_features '(protrusion.bfactor.new_feature)'(dot is used as separator)
- in the groovy config file:
- you can even compare different feature sets running
prank ploop .... i.e.:-extra_features '((protrusion),(new_feature),(protrusion.new_feature))'
- train with the new feature by adding its name to the list of
Real examples
Quick test run:
./prank.sh ploop
-c working \ # override default config with working.groovy config file
-t chen11-fpocket.ds \ # crossvalidate on chen11 datsest
-loop 1 -rf_trees 5 -rf_depth 5 \ # make it quick (1 pass, small model)
-extra_features '((protrusion.bfactor),(protrusion.bfactor.new_feature))'`
(Then check run.log in n results directory for errors. Check if R plots are generated correectly.)
Real comparison experiments:
./prank.sh ploop -c working \
-t chen11-fpocket.ds \
-loop 10 -rf_trees 100 -rf_depth 10 \
-extra_features '((protrusion.bfactor),(protrusion.bfactor.new_feature))'`
./prank.sh ploop -c working \
-t chen11-fpocket.ds \ # train on chen11
-e joined.ds \ # and evaluate on a different dataset
-loop 10 -rf_trees 100 -rf_depth 10 \
-extra_features '((protrusion.bfactor),(protrusion.bfactor.new_feature))'`