Files
p2rank/documentation/export-pocket-descriptors.md
rdk 6fad858bc6 Audit follow-ups: bug fix, doc refresh, exception taxonomy, test hardening
Bug fix:
- PrincipalMomentsDescriptor.clampNonNegative now also clamps NaN. The
  v<0 check was false for NaN, so a NaN eigenvalue (possible if a future
  code path bypasses GridGenerator.isFiniteBox) would have propagated
  to the CSV output.

Doc refresh:
- breaking-changes.md: 2.6 entry for the multi-column descriptor
  migration + the -vis_pocket_grid / pocket_grid_vis_* renames.
- export-pocket-descriptors.md: step 4 rewrites a self-contradicting
  rationale — adding to the default list IS a breaking change for
  index-based parsers; recommends parse-by-name + breaking-changes.md
  note for future additions.
- export-pocket-grid.md: added "Adding a new per-grid-point descriptor"
  recipe (parallel to the per-pocket one); unified √3/2 precision to
  0.866 across docs and Params.groovy.
- README.md: added an "Opt-in tabular exports" subsection mentioning
  -export_pocket_descriptors, -export_pocket_grid, -vis_pocket_grid.
- testsets.sh "Full descriptor menu" now lists all seven shipped
  descriptors (was six).

Exception taxonomy:
- PocketDescriptorsRows.groovy and PocketGridBuilder.java now throw
  PrankException (was IllegalArgumentException) for user-facing config
  errors, matching the rest of the codebase.

Registry hardening:
- Both PocketDescriptorRegistry and PocketGridPointDescriptorRegistry
  now assert columnNames.size() == columnTypes.size() in register().
  A future descriptor with mismatched lists fails fast at class-load.

Quality fixes:
- PocketGridRows.getColumn uses BASE_COLS-1 instead of literal 3 for
  the pocket column. Removed dead 2-arg PocketGridRows constructor
  (only 3 test sites used it; now inlined).
- PocketGridPointContext gets a compact-constructor validator that
  rejects negative pointIndex/pocketRank, limiting blast radius of an
  int-arg swap.

Test hardening:
- VolsiteSmoothGridPointDescriptorTest + VolsiteGridPointDescriptorTest
  now pin sigma/radius in @BeforeEach AND restore in @AfterEach, so
  the Params singleton is clean for subsequent test classes.
- New tests: HIS ND1 double-flag (single atom setting donor+acceptor),
  PrincipalMoments at cardinality=2, PrincipalMoments two coincident
  points, GridGenerator NaN-box throw, PocketDescriptorRegistry
  register/unregister round-trip, MorphologicalCloser maxIters=1.
- Renamed respectsMaxIters → maxItersZeroIsNoOp (the test only covered
  the maxIters=0 case despite the general name); added maxIters=1
  companion that verifies one iteration of fill actually runs.
- Extracted RendererTestFixtures.tinyGrid (was byte-identical in both
  renderer test files); unified the volsite atomAt signatures so the
  parameter order can't get swapped between the two volsite tests.
2026-05-19 15:36:12 +02:00

146 lines
7.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Exporting Pocket Descriptors
Per-pocket geometric/chemical descriptors (volume, sphericity, residue
counts, etc.) written to a tabular file alongside any `predict` or
`rescore` run when `-export_pocket_descriptors` is on.
> **Cost note.** Most descriptors (`volume`, `sphericity`,
> `radius_of_gyration`, `num_grid_points`) are derived from the pocket
> grid, so selecting any of them triggers the full grid build (lattice
> generation + per-pocket assignment + shape fill) even with
> `-export_pocket_grid 0` — `-export_pocket_grid 0` only suppresses the
> per-protein grid file, not the computation. `num_residues` and
> `num_surface_atoms` do **not** need the grid; selecting only those two
> with `-export_pocket_grid 0` skips the grid build entirely (a near
> zero-cost descriptors export).
## Quick start
```bash
# Default: every shipped descriptor (num_residues, num_surface_atoms,
# num_grid_points, volume, sphericity, radius_of_gyration, principal_moments)
prank predict -f protein.pdb -export_pocket_descriptors 1
# Narrow set + tighter grid for more accurate volume/sphericity
prank predict dataset.ds -export_pocket_descriptors 1 \
-pocket_descriptors "volume,sphericity" \
-pocket_grid_spacing 0.75
# Rescoring path also supports it
prank rescore fpocket.ds -export_pocket_descriptors 1
```
## Output format
One row per predicted pocket.
| Column | Type | Notes |
|---|---|---|
| `name` | string | `pocket.name` (e.g. `pocket.1`) |
| `rank` | i32 | 1-based pocket rank |
| `score` | f64 | Raw P2Rank pocket score |
| `probability` | f64 | Calibrated probability from the score transformer. **Column is omitted entirely** when no transformer ran |
| `center_x`, `center_y`, `center_z` | f64 | Pocket centroid coordinates |
| *(one or more columns per requested descriptor)* | f64 / i32 | See descriptor catalog below |
Descriptor columns appear in the order given on the command line via
`-pocket_descriptors`. Most descriptors emit a single column whose header is
the descriptor name; multi-column descriptors emit N columns prefixed with
`"{name}."` (e.g. `principal_moments.lambda1`, `principal_moments.lambda2`,
`principal_moments.lambda3`).
## Descriptor catalog
| Name | Columns | Definition |
|---|---|---|
| `volume` | 1 × f64 | Pocket volume in **ų**: `\|assigned grid points\| × pocket_grid_spacing³`. Accuracy scales with the lattice spacing (smaller `pocket_grid_spacing` → finer estimate). |
| `sphericity` | 1 × f64 ∈ [0, 1] | `V_pocket / V_bounding_sphere`. Bounding sphere is centered at the **centroid of the pocket's grid points** (not `pocket.centroid` which is atom-derived); radius is the max distance from that centroid. Quantization-free. 1 = perfect sphere; ≪ 1 = elongated / irregular. |
| `radius_of_gyration` | 1 × f64 | Radius of gyration in **Å**: `sqrt(mean(\|r_i - r_cm\|²))` over the pocket's grid points (equal weights). Absolute spatial extent — pairs well with `sphericity`, which only captures compactness. `0` for empty / single-point pockets. |
| `num_residues` | 1 × i32 | Number of distinct residues touching the pocket (reuses `Pocket.getResidues()`). |
| `num_surface_atoms` | 1 × i32 | Size of `pocket.surfaceAtoms`. |
| `num_grid_points` | 1 × i32 | Total grid points assigned to the pocket (cardinality of the BitSet after shape fill). Raw count complement to `volume`. |
| `principal_moments` | 3 × f64 | Three eigenvalues of the pocket grid points' gyration tensor (equal-weight PCA), sorted descending: `principal_moments.lambda1``lambda2``lambda3`. Unit Ų. Shape signature: λ₁≈λ₂≈λ₃ → sphere; λ₁≫λ₂,λ₃ → rod; λ₁≈λ₂≫λ₃ → disk. Sum equals `radius_of_gyration²`. `0`s for pockets with <2 grid points. |
`-pocket_descriptors` defaults to **all of the above**. The grid-derived
scalar descriptors share the same pocket-grid input, so adding or removing
them costs essentially nothing once the grid is built. `principal_moments`
adds a small 3×3 eigendecomposition per pocket — also negligible relative
to the grid build itself. To narrow the set, list the wanted names
comma-separated. Unknown names cause a fail-fast error at startup with
the list of registered names.
## Parameters
The descriptors file shares all of the pocket-grid params (the grid is
built once and reused). The descriptor-specific knobs are:
| Parameter | Default | Notes |
|---|---|---|
| `export_pocket_descriptors` | `false` | Master gate |
| `pocket_descriptors` | all shipped descriptors | List of descriptor names to compute. See catalog above. |
| `pocket_grid_format` | `csv.gz` | Same allowed values as the grid file |
The grid generator's params (`pocket_grid_spacing`, `_max_dist`,
`_atom_buffer`, `_assign_cutoff`, `_fill`, `_fill_*`) directly affect
the `volume` and `sphericity` descriptors — see
[`export-pocket-grid.md`](export-pocket-grid.md).
## Adding a new descriptor
Implementations live under
`src/main/groovy/cz/siret/prank/program/routines/predict/output/descriptors/`.
1. Implement the `PocketDescriptor` interface:
```java
String name(); // CLI token and multi-column header prefix
List<String> columnNames(); // sub-names; scalar entry IGNORED at output
List<ColumnType> columnTypes(); // parallel to columnNames()
double[] compute(PocketGridContext); // same length as columnNames()
boolean needsGrid(); // default true; override to false if compute()
// doesn't read ctx.grid() or ctx.gridPointIndices()
```
`PocketGridContext` exposes `pocket`, `protein`, `grid`, and the
per-pocket `gridPointIndices` set. If your `compute()` only reads
`ctx.pocket()` (i.e., domain fields like `surfaceAtoms` or `residues`),
override `needsGrid()` to return `false` — that lets the orchestrator
skip the full grid build when only grid-free descriptors are selected.
For **scalar** descriptors (one column), extend `AbstractScalarPocketDescriptor`
instead of implementing the interface directly — it boils the boilerplate down
to `name()`, `scalarType()`, and `computeScalar(ctx)`. Of the seven shipped
descriptors, six use this adapter; `principal_moments` (multi-column) implements
`PocketDescriptor` directly.
For **multi-column** descriptors (e.g. `principal_moments` with three
eigenvalues from a single decomposition), implement `PocketDescriptor`
directly; output column headers are `"{name()}.{columnNames()[i]}"`.
2. Register the implementation in `PocketDescriptorRegistry`'s static
initializer (Java; no auto-discovery). The registry rejects descriptors
that declare duplicate `columnNames` at registration time.
3. Users can opt into it by name via
`-pocket_descriptors "volume,my_new_descriptor"`.
4. **To include it in the default output**, also add the name to the
`pocket_descriptors` default list in `Params.groovy`. The default is
declared explicitly (rather than derived from `Registry.knownNames()`)
so each addition to the default schema is a conscious choice — but
adding to the default IS a user-visible breaking change for anyone
parsing the output by column index. Two recommendations:
- Parse the descriptors file by column **name**, not by column index.
- When you add a descriptor to the default list, note it in
[`breaking-changes.md`](../breaking-changes.md).
Skip step 4 if the new descriptor is opt-in only.
INT columns return their value as a `double` that the writer downcasts at
output time, matching the existing `TableData` convention. Implementations
must guarantee the value fits in i32.
## See also
- [`export-pocket-grid.md`](export-pocket-grid.md) — the underlying
grid that volume/sphericity are computed against
- [`export-points.md`](export-points.md) — SAS-points export