7.7 KiB
Feature setup documentation
This file describes feature vector configuration and provides introduction to adding new features. Useful only for training and evaluating new models.
P2Rank version: 2.3-dev.1
Introduction
P2Rank is based on predicting scores of SAS points that are described by feature vectors.
A feature vector is basically an array of real numbers (double[]) with a header (i.e. each element has a unique name).
P2Rank comes with a set of implemented feature calculators.
Each calculator has a name and calculates an array of a certain length (e.g. for volsite n=5, bfactor n=1).
We will use the term feature for feature calculator (e.g. chem) and sub-feature for an individual element - single scalar number (e.g. chem.atoms).
Feature configuration
Composition of feature vector is influenced by parameters:
-features- lists enabled feature calculators
- deafult:
(chem,volsite,protrusion,bfactor) atom_tableandresidue_tablefeatures are implicitly enabled by default
-atom_table_featuresand-residue_table_features- determine which columns from atom type and residue type tables are enabled
-feature_filters- see "Filtering features" section below
Configuration syntax
Note that the syntax for list-of-strings parameter value is different on the command line and in a *.groovy config file:
- command line:
-features '(chem,volsite,protrusion,bfactor)' - config file:
features = ['chem','volsite','protrusion','bfactor']
Check enabled features
To check which features are enabled for a particular configuration run print features command:
./prank print features
Example: Default feature setup: (click to expand)
$ ./prank print features
----------------------------------------------------------------------------------------------
P2Rank 2.3-dev.1
----------------------------------------------------------------------------------------------
Effectively enabled features:
chem
volsite
protrusion
bfactor
atom_table
Effective feature vector header (i.e. enabled sub-features):
0: chem.hydrophobic
1: chem.hydrophilic
2: chem.hydrophatyIndex
3: chem.aliphatic
4: chem.aromatic
5: chem.sulfur
6: chem.hydroxyl
7: chem.basic
8: chem.acidic
9: chem.amide
10: chem.posCharge
11: chem.negCharge
12: chem.hBondDonor
13: chem.hBondAcceptor
14: chem.hBondDonorAcceptor
15: chem.polar
16: chem.ionizable
17: chem.atoms
18: chem.atomDensity
19: chem.atomC
20: chem.atomO
21: chem.atomN
22: chem.hDonorAtoms
23: chem.hAcceptorAtoms
24: volsite.vsAromatic
25: volsite.vsCation
26: volsite.vsAnion
27: volsite.vsHydrophobic
28: volsite.vsAcceptor
29: volsite.vsDonor
30: protrusion.protrusion
31: bfactor.bfactor
32: atom_table.apRawValids
33: atom_table.apRawInvalids
34: atom_table.atomicHydrophobicity
----------------------------------------------------------------------------------------------
finished successfully in 0 hours 0 minutes 1.044 seconds
----------------------------------------------------------------------------------------------
Adding new features
If you want to add new features that are not implemented in P2Rank you have 3 options:
- Implement a new feature calculator in Java or Groovy
- this is not too difficult and has an advantage that feature will be calculated automatically for new datasets
- For introduction see new feature tutorial
- Provide custom atom type and residue type tables for
atom_tableandresidue_tablefeatures- allow defining values for residue types and atom types
- residue types are: (ALA,ARG,ASN,...)
- atom types are: (ALA.C,ALA.CA,ALA.CB,...)
- useful only if the values are the same for all proteins in the dataset (for example: hydrophobicity index of amino acids).
- see example tables:
aa-propensities.csvandatomic-properties.csv - NOTE: providing custom tables is not implemented yet (planned for 2.3-dev.2)
- allow defining values for residue types and atom types
- Use
csvfeature- allows defining values for every protein residue and/or every protein atom (for each protein separately) via external csv files
- disadvantage: csv files must be manually calculated for each dataset
- Configuration:
- looks for csv files named
{peorein_file_name}.csvin directories defined in-feat_csv_directoriesparameter - enabled value columns from csv files must be declared in
-feat_csv_columns -feat_csv_ignore_missingallows ignoring missing csv files, columns and rows
- looks for csv files named
- TODO: add more detailed documentation for csv feature
Filtering features
You can selectively enable/disable certain features and sub-features with -feature_filters parameter.
Filters are applied only to the features that are first enabled by -features parameter.
If the value of -feature_filters is empty, all sub-features are used (i.e no filtering is applied).
Examples of individual filters:
*- include allchem.*- include all with prefix "chem."-chem.*- exclude all with prefix "chem."chem.hydrophobicity- include particular sub-feature-chem.hydrophobicity- exclude particular sub-feature
Filters are applied sequentially.
If the first filter starts with -, everything is implicitly enabled. Otherwise, everything is implicitly disabled.
For example:
-feature_filters '(-chem.atoms)'- include everything exceptchem.atoms-feature_filters '(chem.atoms)'- include onlychem.atoms
Further examples:
-feature_filters '()'- include all-feature_filters '(*)'- include all-feature_filters '(*,-chem.*)'- include all except those with prefix "chem."-feature_filters '(-chem.*)'- include all except those with prefix "chem."-feature_filters '(-chem.*,chem.hydrophobicity)'- include all except those with prefix "chem.", but include "chem.hydrophobicity"-feature_filters '(chem.hydrophobicity)'- include only "chem.hydrophobicity"-feature_filters '(chem.*,-chem.hydrophobicity,-chem.atoms)- include only those with prefix "chem.", except "chem.hydrophobicity" and "chem.atoms"
Example: `-feature_filters '(chem.atoms,volsite.*,bfactor.*)'`: (click to expand)
$ ./prank print features -features '(chem,volsite,bfactor)' -feature_filters '(chem.atoms,volsite.*,bfactor.*)'
----------------------------------------------------------------------------------------------
P2Rank 2.3-dev.1
----------------------------------------------------------------------------------------------
Effectively enabled features (after filtering):
chem
volsite
bfactor
Effective feature vector header (i.e. enabled sub-features):
0: chem.atoms
1: volsite.vsAromatic
2: volsite.vsCation
3: volsite.vsAnion
4: volsite.vsHydrophobic
5: volsite.vsAcceptor
6: volsite.vsDonor
7: bfactor.bfactor
----------------------------------------------------------------------------------------------
finished successfully in 0 hours 0 minutes 1.043 seconds
----------------------------------------------------------------------------------------------
Filtering and grid optimization
You can use -feature_filters param in combination with grid optimization (ploop command).
For datails see hyperparameter optimization tutorial.
Example:
./prank ploop -t train.ds -e eval.ds -loop 10 -feature_filters '((-chem.*),(-chem.atoms,-chem.ploar),(protrusion.*,bfactor.*))'
This command will run train-eval experiments for 3 dfferent feature setups by applying a different list of feature filters. For each feature setup, it will run 10 train-eval cycles (using different random seed) and calculate average results.