Files
p2rank/documentation/training-tutorial.md
rdk 126a0653f0 move tutorials to documentation/, update rescoring tutorial and README
Move misc/tutorials/ to documentation/ and add index readme.
Update rescoring.md: add quick-start examples, paper links for all
methods, add Pocketeer to supported methods list.
Fix stale links in README.md (tutorials path, local-env.sh typo).
2026-02-11 10:52:20 +01:00

13 KiB

P2Rank model training and optimization tutorial

This file provides an introduction for people who want to train and evaluate their own models or optimize different parameters of the algorithm.

Kick-start examples

./prank.sh traineval -t <training_dataset> -e <evaluation_dataset>  # train and evaluate model (execute n run with different random seed, see -loop and -seed params)
./prank.sh crossval <dataset>                                       # run crossvalidation on a single dataset (see -folds param)

./prank.sh ploop -t <training_dataset> -e <evaluation_dataset> -paramA '[min:max:step]'  # iterate through param values
./prank.sh ploop -t <training_dataset>                         -paramB '(1,2,3,4)'       # iterate through param values (crossvalidation)

Parameters

P2Rank uses global static parameters object. In the code, it can be accessed with Params.getInst() or through Parametrized trait. For full list of parameters see Params.groovy.

Parameters can be set in 2 ways:

  1. on the command line -<param_name> <value>
  2. in config groovy file specified with -c <config.file> (see working.groovy for an example... prank -c working.groovy).

Parameters on the command line override those in the config file, which override defaults.

Parameter application priority (last wins):

  1. default values in Params.groovy
  2. defaults in config/default.groovy
  3. (optionally) defaults in config/default_rescore.groovy only if you run prank rescore ...
  4. -c <config.file>
  5. command line

Note: some parameters (-c,-o, -l/-label) are strictly command line attributes and are not defined in Params.groovy.

Preparing the environment

Training and optimization runs can be run from the project directory (repo root) using ./prank.sh script.

  • clone P2Rank repo (https://github.com/rdk/p2rank)
  • clone dataset repo (https://github.com/rdk/p2rank-datasets) or prepare your datasets
  • copy misc/local-env.sh to root directory of the project and edit it according to your machine (the file is then included by prank.sh)
    • cd p2rank; cp misc/local-env.sh .
  • set available memory with -Xmx32G parameter. You will need a lot of memory: at least to store the whole training dataset of feature vectors and a trained model and then some (see Required memory and memory/time trade-offs)

Note: for continuous work/experimentation it is better to clone the git repo, have a local java config in local-env.sh, and use prank.sh for running experiments (as described here). The reason is that this way it will be easy to download updates (git pull) ot switch to a different P2Rank version (git checkout) while config will stay put in local-env.sh. If you decide to use downloaded .tar.gz distribution or .zip source package this will not be as easy, and you will need to manually update the config each time you download an update.

Training and evaluation

To train a model on one dataset and evaluate its performance on the other use prank traineval command.

Example:

./prank.sh traineval -loop 10 -seed 42 -t <training_dataset> -e <evaluation_dataset>`

Runs 10 training/evaluation cycles with different values of a random seed starting at 42. Results of any single train/eval run and averaged results will be written to the output directory.

Related parameters:

  • use -delete_models 0 to keep model files after evaluation.
  • use -delete_vactors 0 to export feature vector files
  • -feature_importances <bool>: calculate feature importances (works only if -classifier supports it, examples: RandomForest, FastRandomForest, FasterForest)
  • -fail_fast <bool>: stop processing the dataset on the first unrecoverable error with a dataset item

Required memory and memory/time trade-offs

Memory consumption can be drastically influenced by some parameters.

Random Forest implementations train trees in parallel using the number of threads defined in-rf_threads variable. Ideally, this would be set to the number of physical CPU cores in the machine. However, required memory during training grows linearly with the number trees trained in parallel (-rf_threads), so you may need to lower the number of threads.

Parameters that influence memory/time trade-off:

  • -cache_datasets determines whether datasets of proteins are kept in memory between runs**. Related parameters:

    • -clear_prim_caches clear primary caches (protein structures) between runs (when iterating params or seed)
    • -clear_sec_caches clear secondary caches (protein surfaces etc.) between runs (when iterating params or seed)
  • -rf_threads number of trees trained in parallel

  • -rf_trees, -fr_depth influence the size of the model in memory

  • -rf_bagsize influences memory needed for training and training time (default is 100% but good results can be achieved with 55 or less)

  • -crossval_threads when running crossvalidation it determines how many models are trained at the same time. Set to 1 if you don't have enough memory.

  • -cache_datasets <bool>: keep datasets (structures and SAS points) in memory between crossval/traineval iterations. For single-pass training (-loop 1) it does not make sense to keep it on. Turn off when evaluating the model on huge datasets that won't fit to memory (e.g. the whole PDB). When switched off, it will leave more memory for RF at the cost of needing to parse all pdb files again.

Additional notes:

  • Subsampling and supersampling influence the size of training vector dataset and required memory (see Dealing with class imbalances).
  • Memory also grows linearly with "bag size" (-rf_bagsize) but this would generally be in range (50%-100%).
  • Keep in mind how JVM deals with compressed OOPs. Basically it doesn't make sense to have heap size between 32G and ~48G.

Historical note on the dataset format

(This section should be moved to historical notes as soon as there will be a new default P2Rank model.)

Parameter -sample_negatives_from_decoys determines how points are sampled from the proteins in a training dataset. If sample_negatives_from_decoys = false all the points from the protein surface are used. If sample_negatives_from_decoys = true only points from decoy pockets (false-positives ligand binding sites found by other methods like Fpocket) are used. For that you need to supply a training dataset that contains pocket predictions by another method (i.e. for predictions of Fpocket use joined-fpocket.ds instead of joined.ds).

sample_negatives_from_decoys = true in combination with Fpocket predictions was historically giving slightly better results. It focuses the classifier to learn to distinguish between true and decoy pockets which is, in theory, a harder task than to distinguish between ligandable vs. unligandable protein surface. It also changes the ratio of sampled positives/negatives in favour of positives.

In recent P2Rank versions it might be possible to achieve better results by training from the whole protein surface in combination with class balancing techniques (see the next section). Note that default values of other parameters (related to feature extraction and classification results aggregation) were optimized for the case where sample_negatives_from_decoys = true.

Here are the most relevant ones (for descriptions see Params.groovy):

  • -protrusion_radius and -neighbourhood_radius
  • -average_feat_vectors
  • -weight_function
  • -pred_point_threshold
  • -pred_min_cluster_size

Their values may need to be optimized again for the case when sample_negatives_from_decoys = false.

Dealing with class imbalance

When using full protein surface for training (sample_negatives_from_decoys = false) you may need to deal with class imbalances to achieve good results. Typically, the ratio of positives vs. negatives will be around (1:30) depending on chosen cutoffs and margins.

Ways to deal with class imbalances:

  • cutoffs and margins (in relation to distance D = <dist. to closest ligand atom>)
    • -positive_point_ligand_distance points with D < positive_point_ligand_distance are considered positives
    • -neutral_points_margin if > 0 points between (positive_point_ligand_distance, neutral_point_margin) are ignored
    • -train_lig_cutoff if > 0 points with train_lig_cutoff < D are ignored
  • subsampling and supersampling
    • -subsample
    • -supersample
    • use in combination with -target_class_ratio
    • can substantially influence the size of the training vector dataset and consequentially the required memory
  • using different density of points on Solvent Accessible Surface for positives and negatives. It is also possible to use different density for training and evaluation.
    • -tessellation, -train_tessellation, -train_tessellation_negatives
    • by default tessellation = train_tessellation = train_tessellation_negatives = 2
    • higher tesselation equals higher density = more points
  • class weight balancing
    • use -balance_class_weights 1 in combination with -target_class_weight_ratio
    • works only with weight sensitive classifiers (RandomForest, FastRandomForest, FasterForest, FasterForest2)

Note: ratio of positives and negatives in the training dataset (as well as -target_class_weight_ratio if used) can strongly influence produced results. Change in ratios will most likely need to be compensated by optimizing -pred_point_threshold parameter (possibly also -point_score_pow). See hyperparameter optimization tutorial and particular example of optimization run.

Crossvalidation

To run crossvalidation on a single dataset use the prank crossval command.

Example:

./prank crossval -loop 10 -seed 42 -folds 5 <dataset>    

Runs 10 independent 5-fold crossvalidation runs with different values of a random seed starting at 42. Averaged results will be written to the output directory.

Related parameters:

  • -crossval_threads <int>: number of folds to work on simultaneously

Output directory location

The location of the output directory for any given run is influenced by several parameters. You can organize the results of your experiments with their help.

  • -output_base_dir <dir>: top-level default output directory
  • -out_subdir <dir>: subdirectory of output_base_dir (optional)
  • -out_prefix_date <bool>: add timestamp prefix to output directory name
  • -l <str> or -label <str>: append a suffix to generated experiment output directory name
  • -o <dir>: overrides previous params and places output in specified directory

Classifiers / Machine learning algorithms

P2Rank can use different ML algorithms by changing value of -classifter parameter (e.g. -classifter FasterForest).

Random Forests implementations:

  • RandomForest: Original implementation from Weka. Slow and memory consuming but can have marginally better predictive performance. Uses entropy.
  • FastRandomForest: New faster implementation by Dan Supek. Uses entropy.
  • FasterForest: Streamlined implementation of FastRandomForest. It s faster and uses less memory, should have the same predictive performance. Uses entropy.
  • FasterForest2: Even faster version. Can have slightly lower predictive performance. Uses GINI.

Notes:

  • FastRandomForest and FasterForest use basically the same algorithm, FasterForest is just more optimized.
  • the differences in predictive performance are low, but the difference in consumed time and memory are high (0.5-4x).
  • (for developers) to integrate new algorithms start in ClassifierFactory.groovy.

Comparing training time

For illustration here is a comparison of training times on one particular dataset. Value in cells is training time in minutes.

classifier / rf_trees 100 200 400
RandomForest 14.6 28.4 56.8
FastRandomForest 9.3 18.2 34.4
FasterForest 5.7 10.8 23.1
FasterForest2 4.4 8.7 16.0

(Produced with dataset of 6.8M data points on 12 core machine.)

Feature importances

Some classifiers can calculate feature importances during training using -feature_importances 1 parameter. Output will be saved to feature_importances_sorted.csv or feature_importances.txt (in case of RandomForest). Training time will be impacted.

Supported by classifiers:

  • RandomForest: using mean impurity decrease method. Uses entropy.
  • FastRandomForest: increase in out-of-bag error (as % misclassified instances) after feature permuted. Uses entropy.
  • FasterForest: same as FastRandomForest
  • FasterForest2: uses new experimental method

From the experience it seems that most useful are the methods used by FastRandomForest and FasterForest.

When running more train/eval iterations (using -loop 10 param), average importances will be calculated. Using more trees leads to more stable results (-rf_trees 200).