From 784f2357950ab4a427d14aba916e8f0a6a9fe577 Mon Sep 17 00:00:00 2001 From: rdk Date: Tue, 17 Nov 2020 18:24:18 +0100 Subject: [PATCH] fix typos in documantation --- README.md | 25 ++++++++-------- distro/config/readme.md | 4 +-- distro/doc/dataset-file-format.md | 26 ++++++++-------- distro/models/readme.md | 4 +-- distro/test_data/concavity.ds | 2 +- distro/test_data/specified-ligands-2.ds | 2 +- distro/test_data/specified-ligands.ds | 4 +-- misc/tutorials/feature-setup.md | 24 +++++++-------- misc/tutorials/hidden-commands.md | 2 +- .../hyperparameter-optimization-tutorial.md | 16 +++++----- .../new-feature-evaluation-tutorial.md | 4 +-- misc/tutorials/training-tutorial.md | 30 +++++++++---------- 12 files changed, 72 insertions(+), 71 deletions(-) diff --git a/README.md b/README.md index 01e0fd17..0593f9ff 100644 --- a/README.md +++ b/README.md @@ -20,7 +20,8 @@ P2Rank is a stand-alone command line program that predicts ligand-binding pocket * Java 8 to 15 * PyMOL 1.7 (or newer) for viewing visualizations (optional) -Program is tested on Linux, macOS and Windows. On Windows, using `bash` console is recommended to execute the program instead of `cmd` or `PowerShell`. +P2Rank is tested on Linux, macOS, and Windows. +On Windows, it is recommended to use the `bash` console to execute the program instead of `cmd` or `PowerShell`. ### Setup @@ -44,7 +45,7 @@ P2Rank makes predictions by scoring and clustering points on the protein's solve Ligandability score of individual points is determined by a machine learning based model trained on the dataset of known protein-ligand complexes. For more details see the slides and publications. -Presentation slides introducing original version of the algotithm: [Slides (pdf)](http://bit.ly/p2rank_slides) +Presentation slides introducing the original version of the algorithm: [Slides (pdf)](http://bit.ly/p2rank_slides) ### Publications @@ -54,15 +55,15 @@ If you use P2Rank, please cite relevant papers: Krivak R, Hoksza D. *P2Rank: machine learning based tool for rapid and accurate prediction of ligand binding sites from protein structure.* Journal of Cheminformatics. 2018 Aug. * [Web-server article](https://doi.org/10.1093/nar/gkz424) in NAR about the web interface accessible at [prankweb.cz](https://prankweb.cz) Jendele L, Krivak R, Skoda P, Novotny M, Hoksza D. *PrankWeb: a web server for ligand binding site prediction and visualization.* Nucleic Acids Research, Volume 47, Issue W1, 02 July 2019, Pages W345-W349 -* [Conference paper](https://doi.org/10.1007/978-3-319-21233-3_4) inroducing P2Rank prediction algorithm - Krivak R, Hoksza D. *P2RANK: Knowledge-Based Ligand Binding Site Prediction Using Aggregated Local Features.* InInternational Conference on Algorithms for Computational Biology 2015 Aug 4 (pp. 41-52). Springer +* [Conference paper](https://doi.org/10.1007/978-3-319-21233-3_4) introducing P2Rank prediction algorithm + Krivak R, Hoksza D. *P2RANK: Knowledge-Based Ligand Binding Site Prediction Using Aggregated Local Features.* International Conference on Algorithms for Computational Biology 2015 Aug 4 (pp. 41-52). Springer * [Research article](https://doi.org/10.1186/s13321-015-0059-5) in JChem about PRANK rescoring algorithm Krivak R, Hoksza D. *Improving protein-ligand binding site prediction accuracy by classification of inner pocket points using local features.* Journal of Cheminformatics. 2015 Dec. ### Build from sources This project uses [Gradle](https://gradle.org/) build system via included Gradle wrapper. -On Windows use `bash` to execute build comands (`bash` is installed as a part of [Git for Windows](https://git-scm.com/download/win)). +On Windows use `bash` to execute build commands (`bash` is installed as a part of [Git for Windows](https://git-scm.com/download/win)). ```bash git clone https://github.com/rdk/p2rank.git && cd p2rank @@ -91,7 +92,7 @@ prank help ### Predict ligand binding sites (P2Rank algorithm) ~~~bash -prank predict test.ds # run on whole dataset (containing list of pdb files) +prank predict test.ds # run on the whole dataset (containing list of pdb files) prank predict -f test_data/1fbl.pdb # run on single pdb file prank predict -f test_data/1fbl.pdb.gz # run on single gzipped pdb file @@ -121,7 +122,7 @@ prank eval-predict test.ds If coordinates of SAS points that belong to predicted pockets are needed, they can be found in `visualizations/data/_points.pdb`. There "Residue sequence number" (23-26) of HETATM record - corresponds to the rank of corresponding pocket (points with value 0 do not belong to any pocket). + corresponds to the rank of the corresponding pocket (points with value 0 do not belong to any pocket). ### Configuration @@ -136,7 +137,7 @@ prank predict -c example.groovy test.ds It is also possible to override the default params on the command line using their full name. ~~~bash -prank predict -seed 151 -threads 8 test.ds # set random seed and number of threads, override defeults +prank predict -seed 151 -threads 8 test.ds # set random seed and number of threads, override defaults prank predict -c example.groovy -seed 151 -threads 8 test.ds # override defaults as well as values from example.groovy ~~~ @@ -160,25 +161,25 @@ prank eval-rescore fpocket.ds # evaluate rescoring model ## Comparison with Fpocket -[Fpocket](https://github.com/Discngine/fpocket) is widely used open source ligand binding site prediction program. +[Fpocket](https://github.com/Discngine/fpocket) is a widely used open source ligand binding site prediction program. It is fast, easy to use and well documented. As such, it was a great inspiration for this project. Fpocket is written in C, and it is based on a different geometric algorithm. Some practical differences: * **Fpocket** - - has much smaller memory footprint + - has a much smaller memory footprint - runs faster when executed on a single protein - produces a high number of less relevant pockets (and since the default scoring function isn't very effective the most relevant pockets often doesn't get to the top) - contains MDpocket algorithm for pocket predictions from molecular trajectories - still better documented * **P2Rank** - achieves significantly higher identification success rates when considering top-ranked pockets - - produces smaller number of more relevant pockets + - produces a smaller number of more relevant pockets - speed: + slower when running on a single protein (due to JVM startup cost) + approximately as fast on average running on a big dataset on a single core - + due to parallel implementation potentially much faster on multi core machines + + due to parallel implementation potentially much faster on multi-core machines - higher memory footprint (~1G but doesn't grow much with more parallel threads) Both Fpocket and P2Rank have many configurable parameters that influence behaviour of the algorithm and can be tweaked to achieve better results for particular requirements. diff --git a/distro/config/readme.md b/distro/config/readme.md index a19e14aa..b666a185 100644 --- a/distro/config/readme.md +++ b/distro/config/readme.md @@ -5,7 +5,7 @@ This directory contains P2Rank config files. Initially, P2Rank loads configuration from `default.groovy` (and from `default-rescore.groovy` in case you run `prank rescore ...`). -Parameters can be then overriden in a custom config file (`-c `) or directly on the command line. +Parameters can be then overridden in a custom config file (`-c `) or directly on the command line. ## Details @@ -22,5 +22,5 @@ Parameter application priority (last wins): 4. parameters in custom config file `-c ` 5. parameters on the command line -To see comprehensive list of all possible params see Params.groovy in the source code: +To see a comprehensive list of all possible params see Params.groovy in the source code: https://github.com/rdk/p2rank/blob/master/src/main/groovy/cz/siret/prank/program/params/Params.groovy diff --git a/distro/doc/dataset-file-format.md b/distro/doc/dataset-file-format.md index afccc6d9..7d2d18c7 100644 --- a/distro/doc/dataset-file-format.md +++ b/distro/doc/dataset-file-format.md @@ -8,24 +8,24 @@ Dataset file specifies a list of files (typically proteins) to be processed by t 2W83.pdb 1fbl.pdb ~~~ -Basic single-culumn dataset that specifies list of proteins. +A basic single-column dataset that specifies a list of proteins. ## Multi-column dataset format -Optionally, dataset files can have a multi-column format that allows to specify complementary data. This is relevant only if you are interested in training and evaluating new models. +Optionally, dataset files can have a multi-column format that allows specifying complementary data. This is relevant only if you are interested in training and evaluating new models. Multi-column datasets need to declare a header (see `test_data/fpocket-pairs.ds` for example of a dataset for evaluation of Fpocket predictions). -**Valid coulmn names**: -* `"protein"` column is mandatory if program is used for pocket prediction -* `"prediction"` column is mandatory if program is used for pocket rescoring -* `"chains"` allows to explicitly specify whhich protein chains from the structure should be considered. Structures will be reduced to specified chains when loaded. Value `*` means all chains. +**Valid column names**: +* `"protein"` column is mandatory if the program is used for pocket prediction +* `"prediction"` column is mandatory if the program is used for pocket rescoring +* `"chains"` allows to explicitly specify which protein chains from the structure should be considered. Structures will be reduced to specified chains when loaded. Value `*` means all chains. * `"ligands"` allows to explicitly specify which ligands should be considered (see test-ligand-codes.ds) * `"ligand_codes"` same as "ligands" (for backward compatibility) * `"conservation"` contains link to sequence conservation data -If the header is not specified, default header is `HEADER: protein` i.e. dataset contains just a list of protein files. +If the header is not specified, the default implicit header is `HEADER: protein` i.e. dataset contains just a list of protein files. Additionally, it is possible to specify global dataset parameters. @@ -45,12 +45,12 @@ HEADER: ### Examples -Folllowing examples are valid multi-column dataset files. See other examples in `test_data` folter. +Following examples are valid multi-column dataset files. See other examples in `test_data` folder. Ligands can be specified by a group name (e.g. `PHI`) in which case all ligands with this name will be considered relevant. To specify particular molecules you can optionally use `atom_id` and `group_id` specifiers. No whitespace in the column value is allowed. -##### Example 2: Dataset with explicitely specified relevant ligands +##### Example 2: Dataset with explicitly specified relevant ligands ~~~sh HEADER: protein ligands @@ -60,17 +60,17 @@ liganated/1nlua.pdb PHI[atom_id:1234] liganated/1t7qa.pdb COA[group_id:C_234A] liganated/2ck3b.pdb ANP ~~~ -Dataset with explicitly specified ligands. Useful only for training and evaluation datasets. +A dataset with explicitly specified ligands. Useful only for training and evaluation datasets. ##### Example 3: Dataset of protein/prediction pairs ~~~sh -PARAM.PREDICTION_METHOD=fpocket # specifies method that was used to create predictions -PARAM.LIGANDS_SEPARATED_BY_TER=true # specifies that ligands are separated by TER record (relevant only for lagacy CHEN11 dataset) +PARAM.PREDICTION_METHOD=fpocket # specifies the method that was used to create predictions +PARAM.LIGANDS_SEPARATED_BY_TER=true # specifies that ligands are separated by TER record (relevant only for legacy CHEN11 dataset) HEADER: protein prediction liganated/1a82a.pdb predictions/fpocket/1a82a_out/1a82a_out.pdb liganated/1aaxa.pdb predictions/fpocket/1aaxa_out/1aaxa_out.pdb ~~~ -Dataset that allows to define pairs of liganated protein and binding site pedictions for this protein made by some prediction method, in this case Fpocket. +A dataset that defines pairs of liganated protein and binding site pedictions for this protein made by some prediction method, in this case, Fpocket. It is used for rescoring and evaluating predictions of other methods (using `prank rescore `). \ No newline at end of file diff --git a/distro/models/readme.md b/distro/models/readme.md index 64f44e92..28185065 100644 --- a/distro/models/readme.md +++ b/distro/models/readme.md @@ -1,9 +1,9 @@ Directory with pre-trained models. -Prank looks here for model specified by (`-model`/`-m`) parameter. +Prank looks here for the model specified by (`-model`/`-m`) parameter. -Model should be always used only in combination with the parameters or config file that was used to train it. +The model should be always used only in combination with the parameters or config file that was used to train it. I.e.: the feature extraction has to be executed with the same parameters. ## List of models diff --git a/distro/test_data/concavity.ds b/distro/test_data/concavity.ds index b4aa6e6e..a4f032cb 100644 --- a/distro/test_data/concavity.ds +++ b/distro/test_data/concavity.ds @@ -2,7 +2,7 @@ PARAM.PREDICTION_METHOD=concavity -# specifies that ligands are separated by TER record (relevant only for lagacy CHEN11 and derived datasets) +# specifies that ligands are separated by TER record (relevant only for legacy CHEN11 and derived datasets) PARAM.LIGANDS_SEPARATED_BY_TER=true HEADER: prediction protein diff --git a/distro/test_data/specified-ligands-2.ds b/distro/test_data/specified-ligands-2.ds index f3d2424b..336ef0e7 100644 --- a/distro/test_data/specified-ligands-2.ds +++ b/distro/test_data/specified-ligands-2.ds @@ -1,4 +1,4 @@ -# Dataset that explicitely specifies which ligands should be considered in training and evaluation phase. +# Dataset that explicitly specifies which ligands should be considered in the training and evaluation phase. # Ligands are declared as a list of PDB HET group names with optional specifiers. HEADER: protein ligands diff --git a/distro/test_data/specified-ligands.ds b/distro/test_data/specified-ligands.ds index da8e2e52..95c9a714 100644 --- a/distro/test_data/specified-ligands.ds +++ b/distro/test_data/specified-ligands.ds @@ -1,5 +1,5 @@ -# Dataset that explicitely specifies which ligands should be considered in training and evaluation phase. -# Ligands are declared as list of PDB HET group names separated with a coma without spaces. +# Dataset that explicitly specifies which ligands should be considered in the training and evaluation phase. +# Ligands are declared as a list of PDB HET group names separated with a comma without spaces. HEADER: protein ligands diff --git a/misc/tutorials/feature-setup.md b/misc/tutorials/feature-setup.md index 20626f68..82413964 100644 --- a/misc/tutorials/feature-setup.md +++ b/misc/tutorials/feature-setup.md @@ -9,10 +9,10 @@ P2Rank version: 2.3-dev.1 ## Introduction P2Rank is based on predicting scores of SAS points that are described by feature vectors. -Feature vector is basically an array of real numbers (`double[]`) with a header (i.e. each element has a unique name). +A feature vector is basically an array of real numbers (`double[]`) with a header (i.e. each element has a unique name). -P2Rank comes with a set of impelmented feature calculators. -Each calculator has a name and calculates an array of certain length (e.g. for `volsite` n=5, `bfactor` n=1). +P2Rank comes with a set of implemented feature calculators. +Each calculator has a name and calculates an array of a certain length (e.g. for `volsite` n=5, `bfactor` n=1). We will use the term *feature* for feature calculator (e.g. `chem`) and *sub-feature* for an individual element - single scalar number (e.g. `chem.atoms`). @@ -38,7 +38,7 @@ Note that the syntax for list-of-strings parameter value is different on the com #### Check enabled features -To check which features are enabled for particular configuraion run `print features` command: +To check which features are enabled for a particular configuration run `print features` command: ```bash ./prank print features ``` @@ -109,28 +109,28 @@ Effective feature vector header (i.e. enabled sub-features): If you want to add new features that are not implemented in P2Rank you have 3 options: * Implement a new feature calculator in Java or Groovy - * this is not too difficult and has an advantage that feature will be calculated automatically for new datsets + * this is not too difficult and has an advantage that feature will be calculated automatically for new datasets * For introduction see [new feature tutorial](new-feature-evaluation-tutorial.md) * Provide custom atom type and residue type tables for `atom_table` and `residue_table` features - * allow to define values for residue types and atom types + * allow defining values for residue types and atom types * residue types are: (ALA,ARG,ASN,...) * atom types are: (ALA.C,ALA.CA,ALA.CB,...) * useful only if the values are the same for all proteins in the dataset (for example: hydrophobicity index of amino acids). * see example tables: `aa-propensities.csv` and `atomic-properties.csv` * NOTE: providing custom tables is not implemented yet (planned for 2.3-dev.2) * Use `csv` feature - * allows to define values for evary protein residue and/or every protein atom (for each protein separately) via external csv files + * allows defining values for every protein residue and/or every protein atom (for each protein separately) via external csv files * disadvantage: csv files must be manually calculated for each dataset * Configuration: * looks for csv files named `{peorein_file_name}.csv` in directories defined in `-feat_csv_directories` parameter * enabled value columns from csv files must be declared in `-feat_csv_columns` - * `-feat_csv_ignore_missing` allows to ignore missing csv files, columns and rows + * `-feat_csv_ignore_missing` allows ignoring missing csv files, columns and rows * TODO: add more detailed documentation for csv feature ## Filtering features You can selectively enable/disable certain features and sub-features with `-feature_filters` parameter. -Filters are applied only to features that are first enabled by `-features` prameter. +Filters are applied only to the features that are first enabled by `-features` parameter. If the value of `-feature_filters` is empty, all sub-features are used (i.e no filtering is applied). Examples of individual filters: @@ -145,7 +145,7 @@ Filters are applied sequentially. If the first filter starts with `-`, everything is implicitly enabled. Otherwise, everything is implicitly disabled. For example: -* `-feature_filters '(-chem.atoms)'` - include everything excape `chem.atoms` +* `-feature_filters '(-chem.atoms)'` - include everything except `chem.atoms` * `-feature_filters '(chem.atoms)'` - include only `chem.atoms` @@ -201,5 +201,5 @@ Example: ``` ./prank ploop -t train.ds -e eval.ds -loop 10 -feature_filters '((-chem.*),(-chem.atoms,-chem.ploar),(protrusion.*,bfactor.*))' ``` -This command will run train-eval experiments for 3 dfferent feature setups by applying different list of feature filters. -For each feature setup it will run 10 train-eval cycles (using different random seed) and calculate average results. +This command will run train-eval experiments for 3 dfferent feature setups by applying a different list of feature filters. +For each feature setup, it will run 10 train-eval cycles (using different random seed) and calculate average results. diff --git a/misc/tutorials/hidden-commands.md b/misc/tutorials/hidden-commands.md index 7f8d3261..65c46ad8 100644 --- a/misc/tutorials/hidden-commands.md +++ b/misc/tutorials/hidden-commands.md @@ -41,7 +41,7 @@ Analyze a dataset with an explicitly specified residue labeling. ~~~ -# predict using model trained with conservation +# predict using the model trained with conservation ./prank.sh eval-predict ../p2rank-datasets/coach420.ds -l conserv -out_subdir CONS \ -c distro/config/conservation \ diff --git a/misc/tutorials/hyperparameter-optimization-tutorial.md b/misc/tutorials/hyperparameter-optimization-tutorial.md index 3a48dfa4..a79c29df 100644 --- a/misc/tutorials/hyperparameter-optimization-tutorial.md +++ b/misc/tutorials/hyperparameter-optimization-tutorial.md @@ -4,8 +4,8 @@ P2Rank has routines for optimizing arbitrary parameters with Grid and Bayesian o Here by hyper-parameters we mean actual hyper-parameters of the machine learning models (eg. number of trees in RF) but also any arbitrary parameter ot the whole algorithm. -Comprehensive list of all parameters with descriptions is in `Params.groovy`. - +To see the complete commented list of all (including undocumented) +parameterss see [Params.groovy](https://github.com/rdk/p2rank/blob/develop/src/main/groovy/cz/siret/prank/program/params/Params.groovy) in the source code. **Grid optimization**: * generates plots for all stats @@ -20,9 +20,9 @@ Comprehensive list of all parameters with descriptions is in `Params.groovy`. ## Grid optimization (ploop command) P2Rank allows you to iterate experiments (train/eval and crossvalidation) through lists of different parameter values on the command line. -For that you need to use `prank ploop` command and list or range expression instead of param value for one or more params. +For that, you need to use the `prank ploop` command and list or range expression instead of param value for one or more params. -Supported parameter types: numerical, boolean, string and 'list of strings' (e.g feature set). +Supported parameter types: numerical, boolean, string, and 'list of strings' (e.g. value of param `-features` has type 'list of strings'). #### Defining grid **List expression**: `(val1,val2,...)` @@ -50,7 +50,7 @@ Related parameters: ### R plots In case you optimize exactly 1 or 2 parameters, P2Rank will try to produce plots of various statistics using R language. -For that you need to have `Rscript` on the Path. Some libraries in R need to be installed. +For that, you need to have `Rscript` on the PATH. Some libraries in R need to be installed first. ~~~sh sudo apt install r-base sudo R -e "install.packages('ggplot2', dependencies=TRUE, repos='http://cran.us.r-project.org')" @@ -85,9 +85,9 @@ Feature set comparisons: ## Bayesian optimization (hopt command) -Hopt command (`p2rank hopt`) implements Bayesian optimization using program Speramint. -(Other optimization tools might be employed in similar fashion with little additional work. -See how integration wihth Spearmint is implemented in HSpearmintOptimizer.groovy). +Hopt command (`prank hopt`) implements Bayesian optimization using the program *Speramint*. +(Other optimization tools might be employed in a similar fashion with little additional work. +See how integration with *Spearmint* is implemented in HSpearmintOptimizer.groovy). Supported parameter types: numerical, boolean. diff --git a/misc/tutorials/new-feature-evaluation-tutorial.md b/misc/tutorials/new-feature-evaluation-tutorial.md index 4bf47373..0670adc3 100644 --- a/misc/tutorials/new-feature-evaluation-tutorial.md +++ b/misc/tutorials/new-feature-evaluation-tutorial.md @@ -5,13 +5,13 @@ Read this if you want to implement a new feature and evaluate if it contributes ## Implementation New features can be added by implementing `FeatureCalculator` interface and registering the implementation in `FeatureRegistry`. -You can implement the feature by extending one of convenience abstract classes `AtomFeatureCalculator` or `SasFeatureCalculator`. +You can implement the feature by extending one of the convenience abstract classes `AtomFeatureCalculator` or `SasFeatureCalculator`. You need to decide if the new feature will be associated with protein surface (i.e. solvent exposed) atoms or with SAS (Solvent Accessible Surface) points. P2Rank works by classifying SAS point feature vectors. If you associate the feature with atoms its value will be projected to SAS point feature vectors by P2Rank from neighbouring atoms. -Some features are more easily defined for atoms than SAS points and other way around. See `BfactorFeature` and `ProtrusionFeature` for comparison. +Some features are more naturally defined for atoms rather than for SAS points and other way around. See `BfactorFeature` and `ProtrusionFeature` for comparison. ## Evaluation diff --git a/misc/tutorials/training-tutorial.md b/misc/tutorials/training-tutorial.md index d6ec3d1d..4fb8d6e2 100644 --- a/misc/tutorials/training-tutorial.md +++ b/misc/tutorials/training-tutorial.md @@ -1,7 +1,7 @@ # P2Rank model training and optimization tutorial -This file provides introduction for people who want to train and evaluate their own models or optimize different parameters of the algorithm. +This file provides an introduction for people who want to train and evaluate their own models or optimize different parameters of the algorithm. ## Kick-start examples @@ -64,43 +64,43 @@ Related parameters: Memory consumption can be drastically influenced by some parameters. -Random Forest implementations train trees in parallell using number of threads defined in`-rf_threads` variable. -Ideally, this would be set to number of CPU cores in the machine. -However, required memory during training grows linearly with number trees trained in paralell (`-rf_threads`), +Random Forest implementations train trees in parallel using the number of threads defined in`-rf_threads` variable. +Ideally, this would be set to the number of physical CPU cores in the machine. +However, required memory during training grows linearly with the number trees trained in parallel (`-rf_threads`), so you may need to lower the number of threads. Parameters that influence memory/time trade-off: * `-cache_datasets` determines whether datasets of proteins are kept in memory between runs**. Related parameters: - `-clear_prim_caches` clear primary caches (protein structures) between runs (when iterating params or seed) - `-clear_sec_caches` clear secondary caches (protein surfaces etc.) between runs (when iterating params or seed) -* `-rf_threads` number of trees trained in parallell +* `-rf_threads` number of trees trained in parallel * `-rf_trees`, `-fr_depth` influence the size of the model in memory * `-crossval_threads` when running crossvalidation it determines how many models are trained at the same time. Set to `1` if you don't have enough memory. * `-cache_datasets `: keep datasets (structures and SAS points) in memory between crossval/traineval iterations. - For single pass training (`-loop 1`) it does not make sense to keep it on. - Turn off when evaluating model on huge datasets that won't fit to memory (e.g. whole PDB). + For single-pass training (`-loop 1`) it does not make sense to keep it on. + Turn off when evaluating the model on huge datasets that won't fit to memory (e.g. whole PDB). When switched off it will leave more memory for RF at the cost of needing to parse all structure files (PDBs) again. Additional notes: -* Subsampling and supersampling influence the size of training vercor dataset and required memory (see _Dealing with class imbalances_). +* Subsampling and supersampling influence the size of training vector dataset and required memory (see _Dealing with class imbalances_). * Memory also grows linearly with "bag size" (`-rf_bagsize`) but this would generally be in range (50%-100%). * Keep in mind how JVM deals with compressed OOPs. Basically it doesn't make sense to have heap size between 32G and ~48G. ### Historical note on the dataset format -(This secton should be moved no historical notes as soon as there will be new default P2Rank model.) +(This section should be moved to historical notes as soon as there will be a new default P2Rank model.) Parameter `-sample_negatives_from_decoys` determines how points are sampled from the proteins in a training dataset. If `sample_negatives_from_decoys = false` all of the points from the protein surface are used. -If `sample_negatives_from_decoys = true` only points from decoy pockets (not true ligand binding sites found by other method like Fpocket) are used. -For that **you need to supply a training dataset that contains pocket predictions by other method** (i.e. for predictions of Fpocket use `joined-fpocket.ds` instead of `joined.ds`). +If `sample_negatives_from_decoys = true` only points from decoy pockets (false-positives ligand binding sites found by other methods like Fpocket) are used. +For that **you need to supply a training dataset that contains pocket predictions by another method** (i.e. for predictions of Fpocket use `joined-fpocket.ds` instead of `joined.ds`). `sample_negatives_from_decoys = true` in combination with Fpocket predictions was historically giving slightly better results. It focuses the classifier to learn to distinguish between true and decoy pockets which is, in theory, a harder task than to distinguish between ligandable vs. unligandable protein surface. It also changes the ratio of sampled positives/negatives in favour of positives. -I recent versions it might be possible to achieve better results by training from whole protein surface in combination with class balancing techniques (see the next section). +I recent versions it might be possible to achieve better results by training from the whole protein surface in combination with class balancing techniques (see the next section). Note that default values of other parameters (related to feature extraction and classification results aggregation) were optimized for the case where `sample_negatives_from_decoys = true`. Here are the most relevant ones (for descriptions see `Params.groovy`): @@ -110,7 +110,7 @@ Here are the most relevant ones (for descriptions see `Params.groovy`): * `-pred_point_threshold` * `-pred_min_cluster_size` -Their values may need to be optimized again for case of `sample_negatives_from_decoys = false`. +Their values may need to be optimized again for the case when `sample_negatives_from_decoys = false`. ### Dealing with class imbalance @@ -134,7 +134,7 @@ Ways to deal with class imbalances: ## Crossvalidation -To run crossvalidation on a single dataset use `prank crossval` command. +To run crossvalidation on a single dataset use the `prank crossval` command. Example: ~~~sh @@ -149,7 +149,7 @@ Related parameters: ## Output directory location -Location of output directory for any given run is influenced by several parameters. You can organize results of your experiments with their help. +The location of the output directory for any given run is influenced by several parameters. You can organize the results of your experiments with their help. * `-output_base_dir `: top-level default output directory * `-out_subdir `: subdirectory of output_base_dir (optional)