Drop frozen pocket-grid PLAN/SPEC; refine audit punch-list

PLAN.md and SPEC.md were pre-implementation design docs for the pocket-grid
feature. The feature has shipped, so they're frozen artifacts in the active
todo/ namespace. Delete them and strip the three "see SPEC.md" comments that
pointed at SPEC.md from Main.groovy and the predict/rescore routines.

Also reassess the PyMOL rank-gap entry in the audit: P2Rank ranks pockets
contiguously throughout the predict path and all in-tree loaders (except
SiteHoundLoader), so the previously-listed "renderer ignores rank gaps" is
cosmetic-only (empty objects in the Models panel for small pockets whose
filled BitSet ended up empty). Downgrade to a parity nit under
Inconsistencies; promote the PUResNet surfaceAtoms re-linking to the Top-5.
This commit is contained in:
rdk
2026-05-20 19:42:47 +02:00
parent 40d333f77f
commit af1d6eeb18
6 changed files with 19 additions and 822 deletions

View File

@@ -33,14 +33,6 @@ focused cleanups.
validate. Fix: `volatile` + DCL, or do one-shot init under a synchronized guard
in `preProcessProtein`; update the concurrency test to construct under contention.
- **PyMOL pocket-grid renderer ignores rank gaps.**
`src/main/groovy/cz/siret/prank/program/visualization/renderers/PocketGridPymolRenderer.groovy:167,201,242`
loop `for (rank = 1; rank <= maxRank; rank++)`. If a pocket rank is missing
(`{1,3}`), the script emits an empty `pocket_grid_2` and shifts palette indices.
ChimeraX was fixed in commit `a3efd084`; PyMOL still has the latent bug.
Comment at lines 131-132 even acknowledges it. Fix: iterate
`grid.pocketToPointIndices.keySet()` like ChimeraX does.
- **Coulomb plumbing is dead code.**
`EnergyCalculator.getAtomCharge` always returns 0
(`src/main/groovy/cz/siret/prank/features/implementation/energy2/calc/EnergyCalculator.groovy:351-357`).
@@ -161,6 +153,17 @@ focused cleanups.
everything into a `TreeMap` (`EvalResults.groovy:189`) — insertion order is
lost. Either drop the misleading comment or use `LinkedHashMap` downstream.
- **PyMOL pocket-grid renderer iterates `1..maxRank`; ChimeraX iterates
`perPocketBasenames.keySet()`.**
`PocketGridPymolRenderer.groovy:167,201,242`.
Cosmetic-only: P2Rank ranks pockets contiguously (every `predict`-path and
in-tree loader except `SiteHoundLoader` assign `i++`/`rank++`), and the
sidecar PDB strips ranks whose `filled` BitSet is empty. PyMOL therefore
emits empty `pocket_grid_N`/`pocket_vol_N`/`pocket_gauss_N`/`pocket_hull_N`
objects when the assigner produced no points for a small pocket — they
render as invisible but clutter the Models panel. Mirror the ChimeraX
iteration pattern (`a3efd084`) for parity; not a correctness fix.
- **PyMOL grid `solvent_radius=0` vs ChimeraX non-zero probe.**
`PocketGridPymolRenderer.groovy:189-190` vs `PocketGridChimeraXRenderer.groovy:264`.
`vis_pocket_grid_volume_radius` means different things to the two renderers.
@@ -219,9 +222,6 @@ focused cleanups.
- **`documentation/readme.md`** index misses `cofactors.md`, `conservation.md`,
`export-pocket-grid.md`, `export-pocket-descriptors.md`.
- **`misc/todo/pocket_grid/{SPEC,PLAN}.md`** still mention
`export_pocket_grid_pml` (renamed to `vis_pocket_grid`).
- **CI matrix is `17,21,25,26` only** (`.github/workflows/develop.yml:23`).
README claims "Java 17 or later (tested up to Java 25)"; 1820/22/23/24 not
exercised; "tested up to Java 25" lags the now-present 26.
@@ -369,14 +369,15 @@ focused cleanups.
1. **Fix `VoxelHashAssigner` cell-prune lower bound** (or drop it and rely on
the post-fetch distance check). Restores the assigner-strategy equivalence
the docs promise.
2. **Apply the rank-gap fix to `PocketGridPymolRenderer`** — mirror what
commit `a3efd084` did for ChimeraX.
3. **Make energy-feature lazy-init actually thread-safe**
2. **Make energy-feature lazy-init actually thread-safe**
(`MethylEnergyFeature`, `AbstractProbeEnergyFeature`); fix `ConcurrencyTest`
to construct calculators under contention.
4. **Guard `AhojSiteInfo.fromCsvRecord` with `record.isMapped(...)`** for the
3. **Guard `AhojSiteInfo.fromCsvRecord` with `record.isMapped(...)`** for the
new `rg`/`n_unp_pockets[_multichain]` columns, so the parser doesn't crash
on older "full" CSVs.
4. **Re-link `PUResNetLoader.surfaceAtoms` to `queryProtein`** by PDB serial
(mirror `FPocketLoader.groovy:137`); same identity-mismatch class as the
Concavity fix.
5. **README/help.txt/`distro/prank.bat` trio**: bump the version badge, fix
the `./make-disro.sh` typo, regenerate `help.txt` to list current commands,
and bring Windows launcher JVM flags up to parity with the Bash launchers.

View File

@@ -1,446 +0,0 @@
# Plan — Pocket grid points export + per-pocket descriptors
Companion to `SPEC.md`. Ordered, atomic phases. Each phase is a single
reviewable commit (or two if splitting tests helps). Compile + test must
be green at the end of every phase.
## Phase order rationale
Layered, foundation-first. Each phase only depends on phases above it.
```
1. TableData STRING refactor (foundation, no behavior change)
2. VdW radius helper + grid generator (foundation)
3. PocketGrid data class + fill strategies
4. PocketGridBuilder (orchestration)
5. Descriptors infrastructure + menu
6. Export-data classes + exporters
7. PyMOL renderer + PDB sidecar
8. Params + Main-startup validation
9. Wire into PredictPockets + RescorePockets routines
10. Documentation (2 new MD files + cross-ref)
11. Smoke test on real data
```
---
## Phase 1 — `TableData` STRING column-type refactor
**Goal:** Extend the export infrastructure to support string columns. No
behavioral change to existing SAS-points export.
**Changes:**
- `TableData.groovy` — add `ColumnType.STRING`; new method
`default String getString(int rowIndex, int colIndex) { throw ... }`
for STRING columns; default `getColumn` only meaningful for numeric.
- `TableExporter.groovy`:
- `writeCsv` — string branch with RFC 4180 quoting (escape `,`, `"`, newline).
- `writeArrow``VarCharVector` for STRING columns; `buildSchema` updated.
- `writeParquet``BINARY` with `LogicalTypeAnnotation.stringType()`;
`RowDehydrator` updated.
- `PointExportData.groovy` — no functional change; verify
`getColumnType` doesn't accidentally return STRING (it currently can't —
all columns are DOUBLE/INT).
**Tests:**
- `TableExporterTest` — new round-trip tests for a synthetic table with
one STRING column, one INT, one DOUBLE; csv, csv.gz, arrow, parquet.
- CSV quoting edge cases: value contains `,`, `"`, `\n`.
- Regression: existing `PointsExporterTest` still passes (no schema
changes to SAS export).
**Commit:** `Extend TableData with STRING column type`
---
## Phase 2 — VdW radius helper + `GridGenerator` extension
**Goal:** Make per-atom VdW radii available; extend the existing grid
sampler.
**Changes:**
- New `src/main/groovy/cz/siret/prank/program/routines/predict/output/grid/VdwRadiusTable.groovy`:
- `static double get(Atom atom)` — looks up via CDK `Elements` by
element symbol; if `null`, falls back to Krypton's 2.02 Å (matches
the existing pattern in `PatchedCdkNumericalSurface.groovy:54-56`).
- Caches `String elementSymbol → double radius` in a
`ConcurrentHashMap` (predict runs multi-threaded via
`Dataset.process(...)`, so the cache is shared across threads;
`computeIfAbsent` is safe and avoids races).
- `GridGenerator.java` — extend
`sampleGridPointsAroundAtoms(Atoms, edge, radius)` into a new variant
`sampleGridPointsBetween(Atoms, edge, maxDist, double atomBuffer)`:
- Keep existing method unchanged.
- New method uses `Atoms.withKdTreeConditional()`, walks the lattice,
for each cell computes `nearest = atoms.nearestSqrDist(p)`,
`vdw = VdwRadiusTable.get(nearestAtom)`, drops if
`sqrt(nearest) < vdw + atomBuffer` or `sqrt(nearest) > maxDist`.
- Note: `nearestSqrDist` returns squared distance only; for the per-atom
VdW check we need the actual nearest **atom**, not just distance.
Use `Atoms.findNearest(point)` (`Atoms.java:244`) which returns the
Atom; then compute `dist` once.
**Tests:**
- `VdwRadiusTableTest` — known elements (C, N, O, S, P, Fe, Cu, Co)
return non-null; Co/Ni/Cu use the Krypton fallback (2.02 Å); unknown
symbol → fallback.
- `GridGeneratorTest` (new file or existing if present) — synthetic
small `Atoms` set, verify min/max filtering on cubic lattice
produces expected count. Edge case: single-atom input.
**Commit:** `Add VdwRadiusTable and GridGenerator min/max sampler`
---
## Phase 3 — `PocketGrid` data class + fill strategies
**Goal:** Pure data + algorithms, no orchestration.
**Changes:**
- `PocketGrid.groovy`:
- Fields:
- `Atoms allPoints` — kept grid points after filtering, wrapped as
`Atoms` (since `Point implements Atom`). Reusing `Atoms` gives us
`cutoutShell`, `withKdTree`, `getByID` for free.
- `Map<Integer, Set<Integer>> pocketToPointIndices` (rank → indices
into `allPoints`).
- `Set<Integer> assignedIndices` (union of all per-pocket sets).
- `double spacing`.
- `Map<LatticeCoord, Integer> latticeIndex` — integer-lattice
coordinate `(i, j, k)` → point index. Computed from
`originX/Y/Z` + `spacing` during grid generation; **required by
`MorphologicalCloser`** for `O(1)` neighbor lookups (without it
morph closing degrades to all-pairs distance comparisons).
- `LatticeCoord` is a small immutable value class with proper
`equals`/`hashCode`.
- Provides: `Atoms pointsForPocket(int rank)`,
`Set<Integer> pocketsForPoint(int pointIndex)`,
`Set<Integer> neighborsOf(int pointIndex, int connectivity)` (where
`connectivity ∈ {6, 18, 26}` consults `latticeIndex`).
- `fill/PocketShapeFiller.groovy` — interface:
```groovy
Set<Integer> fill(Set<Integer> rawShellIndices,
List<Point> allPoints,
double spacing,
Params params)
```
- `fill/NoOpFiller.groovy` — returns input unchanged.
- `fill/MorphologicalCloser.groovy`:
- Operates on a `Map<(int,int,int) → Integer>` lattice index built from
allPoints. For each iteration, scans candidate cells (immediate
neighbors of assigned cells) and promotes those whose neighbor count
≥ `pocket_grid_fill_min_neighbors`. Stops at fixed-point or
`pocket_grid_fill_max_iters`.
- Neighborhood: 26-connectivity (configurable later if needed).
- `fill/ConvexHullFiller.groovy` — initial **stub** that throws
`UnsupportedOperationException("convex_hull fill not yet implemented")`
so users get a clear error if they select it. Real impl in a followup.
**Tests:**
- `MorphologicalCloserTest` — synthetic shapes:
- Pure sphere shell (3-cell-thick) → fills to solid sphere within
`max_iters`.
- U-shape with concavity → concavity filled in.
- Disconnected components → not merged when far apart.
- `NoOpFillerTest` — identity.
**Commit:** `Add PocketGrid data class and morph-closing fill strategy`
---
## Phase 4 — `PocketGridBuilder` (orchestration)
**Goal:** End-to-end grid generation + per-pocket assignment + fill.
**Changes:**
- `PocketGridBuilder.groovy`:
- `static PocketGrid build(Protein protein, List<? extends Pocket> pockets, Params params)`
- Steps:
1. Call the new sampler from Phase 2 →
`Atoms allPoints` of kept lattice points + their lattice
coordinates. Store both in the resulting `PocketGrid`.
2. Build a KdTree on `allPoints` (`allPoints.withKdTree()`) — cheap
once, reused by callers downstream.
3. For each pocket `p`:
- `p.surfaceAtoms.withKdTreeConditional()` (small set, KdTree
built on demand).
- Iterate `allPoints` once; for each point at index `i`, keep
`i` in the **raw shell** set if
`p.surfaceAtoms.nearestDist(allPoints.list[i]) <= params.pocket_grid_assign_cutoff`.
O(|allPoints| × log|surfaceAtoms|) per pocket.
- Pass the raw shell set + `latticeIndex` to
`filler.fill(...)` → final per-pocket index set.
4. Aggregate into `PocketGrid.pocketToPointIndices`; derive
`assignedIndices` as the union.
- Filler selection: dispatch on `params.pocket_grid_fill` enum value.
- All `@CompileStatic` + `@Slf4j`.
**Tests:**
- `PocketGridBuilderTest`:
- 1fbl.pdb fixture (small, fast). Predict pockets via existing
`PrankFacade`; build grid; assert:
- `allPoints` count is reasonable for the bounding box (sanity check).
- Each pocket has a non-empty point set after fill.
- Multi-pocket overlap can occur (count of `(point, pocket)` pairs
> count of distinct points).
- Edge case: protein with 0 predicted pockets → `PocketGrid` with
`allPoints` non-empty but `pocketToPointIndices` empty.
**Commit:** `Add PocketGridBuilder orchestrating grid + assignment + fill`
---
## Phase 5 — Descriptors infrastructure + initial 4
**Goal:** Pluggable descriptors with default `["volume"]`.
**Changes:**
- `descriptors/PocketGridContext.groovy` — data class: `pocket`, `protein`,
`gridPointsForPocket`, `pocketGrid`, `params`.
- `descriptors/PocketDescriptor.groovy` — interface:
```groovy
String name()
ColumnType columnType() // INT or DOUBLE
double compute(PocketGridContext ctx)
```
(Return type `double` — INT descriptors cast at write time, mirroring
TableData's int-as-double convention.)
- `descriptors/PocketDescriptorRegistry.groovy`:
- `static Map<String, PocketDescriptor> REGISTRY` — populated at
classload with the 4 shipped descriptors.
- `static PocketDescriptor get(String name)` — throws `PrankException`
on unknown.
- `static Set<String> knownNames()`.
- `VolumeDescriptor.groovy` — `count(gridPoints) × spacing³`.
- `SphericityDescriptor.groovy` — bounding-sphere variant. **Centroid is
the centroid of the pocket's assigned grid points**, not
`pocket.centroid` (which is derived from surfaceAtoms and would give
misleading numbers for asymmetric pockets):
- `gridCentroid = mean(p for p in ctx.gridPointsForPocket)`
- `r = max(dist(p, gridCentroid))`
- `V_sphere = (4/3) · π · r³`
- `result = V_pocket / V_sphere` (≤ 1 by construction; clamp is
defensive)
- `NumResiduesDescriptor.groovy` — `pocket.residues.size()`.
- `NumSurfaceAtomsDescriptor.groovy` — `pocket.surfaceAtoms.count`.
**Tests:**
- Per-descriptor unit tests using a synthetic small `PocketGridContext`:
- Volume: 8 grid cells @ 1Å spacing → V = 8 ų.
- Sphericity: solid sphere of N cells → sphericity ≈ 1.0 (within
tolerance for lattice quantization); flat disc → sphericity << 1.
- num_residues / num_surface_atoms: stub pockets.
- `PocketDescriptorRegistryTest` — known names resolve; unknown throws.
**Commit:** `Add pocket descriptor framework with 4 initial descriptors`
---
## Phase 6 — Export-data classes + exporters
**Goal:** Bridge `PocketGrid` and descriptor computations to `TableExporter`.
**Changes:**
- `PocketGridExportData.groovy` (implements `TableData`):
- Constructor takes `PocketGrid` and `boolean includeUnassigned`.
- Materializes long-format rows during construction (point-pocket pairs);
sort by `(pocket, x, y, z)`.
- Columns: `x`, `y`, `z` (DOUBLE), `pocket` (INT).
- `PocketDescriptorsExportData.groovy` (implements `TableData`):
- Constructor takes pockets, descriptor results, `boolean includeProbability`.
- Columns: `name` (STRING — uses Phase 1 refactor), `rank` (INT),
`score` (DOUBLE), `probability` (DOUBLE, conditional),
`center_x/y/z` (DOUBLE), then one column per descriptor (INT or
DOUBLE per the descriptor's `columnType()`).
- `PocketGridExporter.groovy`:
- `static void tryExport(PocketGrid grid, String outdir, String label, Params params)`
- Gated by `params.export_pocket_grid`; uses `params.pocket_grid_format`.
- Writes `{outdir}/{label}_pocket_grid.{format}`.
- `PocketDescriptorsExporter.groovy`:
- `static void tryExport(List<? extends Pocket> pockets, PocketGrid grid, Protein protein, Params params, String outdir, String label)`
- Derives `includeProbability` from the data itself:
`pockets.any { !Double.isNaN(it.probaTP) }`. No extra parameter
threaded through the wiring.
- Iterates `params.pocket_descriptors`, computes each, builds
`PocketDescriptorsExportData`, writes to file.
**Tests:**
- `PocketGridExportDataTest` — assert row count, sort order, column types
on a synthetic `PocketGrid`.
- `PocketDescriptorsExportDataTest` — STRING column round-trips through
CSV correctly (depends on Phase 1).
- Integration smoke: small fixture, export to all 7 formats, re-read with
the same reader paths used by `PointExportDataTest`.
**Commit:** `Add pocket grid and descriptors exporters`
---
## Phase 7 — PyMOL renderer + PDB sidecar
**Goal:** Visualization of the grid in PyMOL.
**Changes:**
- New util in `PocketGridPymolRenderer.groovy`:
- `static void render(PocketGrid grid, String outdir, String label, Params params)`
- Writes:
1. `{outdir}/visualizations/data/{label}_pocket_grid.pdb.gz` — one
HETATM per `(point, pocket)` pair; pocket rank in residue-sequence
column (cols 23-26); element column = `H` (or `D` for dummy).
Mirrors `PredictionVisualizer.writeLabeledPointsPdb:44-56`.
2. `{outdir}/visualizations/{label}_pocket_grid.pml`:
- `load data/{label}_pocket_grid.pdb.gz, pocket_grid`
- Per pocket rank N:
- `create pocket_grid_<N>, pocket_grid and resi <N>`
- `color <hex>, pocket_grid_<N>` (color via
`PredictionVisualizer.generatePocketColors(numPockets)`)
- `show spheres, pocket_grid_*`; `set sphere_scale, 0.3`
- `delete pocket_grid` (drop the bulk object).
- All paths via `Futils` for cross-platform safety.
**Tests:**
- `PocketGridPymolRendererTest` — synthetic small `PocketGrid` (3 pockets,
~20 points each); assert output files exist; spot-check PML contains
`load`, `create pocket_grid_1`, `color`, `show spheres`.
- Sanity: PDB output gzip-decompresses to valid HETATM records.
**Commit:** `Add PocketGridPymolRenderer with PDB sidecar`
---
## Phase 8 — Params + Main-startup validation
**Goal:** All 12 new params wired and validated.
**Changes:**
- `Params.groovy` — add 12 `@RuntimeParam` fields with javadoc, defaults
per spec table. Place near `export_points` / `export_points_format`.
- `Main.groovy` — extend the existing param-validation block (around
`:142-153`, same pattern used by cofactors):
- `pocket_grid_format` ∈ allowed enumeration.
- `pocket_grid_fill` ∈ {`morph_closing`, `convex_hull`, `none`}.
- Every name in `pocket_descriptors` ∈ `PocketDescriptorRegistry.knownNames()`.
- If `export_pocket_grid_pml` and `!export_pocket_grid` → throw
`PrankException("export_pocket_grid_pml requires export_pocket_grid=true")`.
**Tests:**
- `ParamsTest` — defaults match spec.
- `MainTest` (or wherever cofactor validation is tested) — each of the 4
validation failures triggers a fail-fast with a clear message.
**Commit:** `Add pocket grid params and startup validation`
---
## Phase 9 — Wire into routines
**Goal:** Call the new pipeline from prediction routines.
**Changes:**
- `PredictPocketsRoutine.groovy`:
- After score transformation and the existing
`PointsExporter.tryExportPoints(...)` call, insert:
```groovy
PocketGrid grid = null
if (params.export_pocket_grid || params.export_pocket_descriptors || params.export_pocket_grid_pml) {
grid = PocketGridBuilder.build(item.protein, prediction.pockets, params)
}
PocketGridExporter.tryExport(grid, outdir, item.label, params)
PocketDescriptorsExporter.tryExport(prediction.pockets, grid, item.protein, params, outdir, item.label)
if (params.visualizations && params.export_pocket_grid_pml) {
PocketGridPymolRenderer.render(grid, outdir, item.label, params)
}
```
- `RescorePocketsRoutine.groovy` — identical hook at the analogous point.
- Order is critical: build → grid file → descriptors (needs grid for
volume) → PML (needs grid).
**Tests:**
- `PredictPocketsRoutineTest` (extend existing) — run a small prediction
with `-export_pocket_grid 1 -export_pocket_descriptors 1
-export_pocket_grid_pml 1` on 1fbl.pdb; verify all four output files
appear at the right paths.
**Commit:** `Wire pocket grid/descriptors/PML into prediction routines`
---
## Phase 10 — Documentation
**Goal:** User-facing docs.
**Changes:**
- New `documentation/export-pocket-grid.md`:
- Sections: Overview, Output file format (long format, sort order,
formats), Algorithm summary (grid generation, assignment, fill),
Params table, CLI examples, PyMOL visualization, Notes.
- Mirrors the structure of `documentation/export-points.md`.
- New `documentation/export-pocket-descriptors.md`:
- Sections: Overview, Output file format, Descriptor catalog
(volume, sphericity, num_residues, num_surface_atoms — with
formulas), Extensibility (how to add a new descriptor), Params
relevant to descriptors.
- `documentation/export-points.md` — append a brief "See also" block at
the end pointing to the two new docs.
- `README.md` — single bullet in "What's new" for 2.7 (or whenever this
ships) referencing the two new docs.
**Tests:** none (docs only).
**Commit:** `Document pocket grid and descriptors export`
---
## Phase 11 — Smoke test on real data
**Goal:** End-to-end on real proteins; eyeball outputs.
**Changes:** none.
**Verification (manual):**
- Run on `distro/test_data/1fbl.pdb` with `-export_pocket_grid 1
-export_pocket_descriptors 1 -export_pocket_grid_pml 1`.
- Verify:
- Grid CSV row counts and centroid statistics look right (small protein
→ maybe 5k-15k assigned point-rows).
- Descriptors CSV — volume in 50-2000 ų range per pocket; sphericity
in [0, 1]; residue/atom counts non-zero.
- PyMOL: open the PML; visually confirm grid points cluster near
predicted pockets, colored consistently with the main pocket PML.
- Run on one of the SwinSite test proteins (1tjw_A) for cross-method
sanity.
- No regressions in existing SAS-points export.
**Commit:** none (or "Smoke test results: …" in a project log under `local/`).
---
## Risks / clarifications
Notes from the plan review that don't require code changes but are worth
flagging:
- **Sphericity clamp is redundant** — `V_pocket ≤ V_bounding_sphere`
always (covering sphere by construction). The `[0, 1]` clamp is purely
defensive; keep it.
- **Heavy Phase 4 integration test** — `PocketGridBuilderTest` uses
`PrankFacade` to predict pockets, which is slow. Keep the integration
test but also add a fast unit test that constructs `Pocket` instances
manually with a synthetic `surfaceAtoms` set.
- **Empty `pocket_descriptors`** — `-pocket_descriptors ""` (empty list)
is supported: descriptors file emits only the base columns
(`name, rank, score, [probability,] center_x/y/z`). Add a regression
test in Phase 6.
- **PDB residue-sequence column** is 4 chars (cols 23-26) → pockets are
capped at rank 9999 in the PML output. Real pockets stay well under
100; document the limit in the PML renderer's javadoc.
- **CSV string quoting** added in Phase 1 fires only for STRING columns.
Existing DOUBLE/INT writes stay unquoted — no CSV-format drift for
SAS-points export. Mention this in the Phase 1 commit message.
## Out-of-scope (followups noted in spec)
- Per-residue descriptors.
- `convex_hull` filler real implementation.
- Pocket overlap matrix output file.
- Long-format SAS-points export.

View File

@@ -1,353 +0,0 @@
# Spec — Pocket grid points export + per-pocket descriptors
Status: spec, not plan. Author decisions captured in two rounds:
- **Initial 6 Qs:** (1) long-format grid CSV, (2) morph-closing proxy with
strategy switch, (3) defaults OK, (4) separate descriptors file,
(5) no standalone command, (6) initial descriptor menu accepted.
- **20-audit cross-check vs. code:** see below; all 20 decisions are applied
in this revision.
## Goals
Two new opt-in outputs, both produced by any `predict` or `rescore` run,
plus an optional PyMOL visualization:
1. **`{outdir}/{name}_pocket_grid.{format}`** — regular 3D grid of points
covering the empty space around the protein, in **long format**: one row
per `(point, pocket)` pair. By default only **assigned** points are
written (one or more rows per point, one per pocket they belong to).
Unassigned points (`pocket = 0`) can be opted in with
`pocket_grid_include_unassigned`.
2. **`{outdir}/{name}_pocket_descriptors.{format}`** — one row per predicted
pocket with score, rank, centroid, and an extensible list of
geometric/chemical descriptors (volume from grid-point count, plus
others).
3. **`{outdir}/visualizations/{name}_pocket_grid.pml`** — optional PyMOL
visualization, produced by a new renderer.
Both data files reuse the existing `TableExporter` (csv / csv.gz / csv.zst /
arrow / arrow.gz / arrow.zst / parquet), matching the SAS-points export
pattern documented at `documentation/export-points.md`. Decoupled from the
prediction algorithm: P2Rank still scores SAS points exactly as today; the
grid is a post-prediction geometric overlay used only for descriptor
computation.
## Prerequisite refactor
**`TableData` and the three writers (`writeCsv`/`writeArrow`/`writeParquet`)
must be extended to support a `STRING` column type** (audit #1). Currently
`TableData` only accepts `DOUBLE` and `INT`
(`src/main/groovy/cz/siret/prank/program/routines/predict/output/TableData.groovy:13-15`).
Without this, the descriptors file's `name` column cannot be written.
Scope of the refactor:
- Add `ColumnType.STRING` and a `String[] getStringColumn(int)` (or boxed
`Object` access path) to `TableData`.
- Extend `writeCsv` to emit strings with proper CSV quoting (escape `,`,
`"`, newlines per RFC 4180).
- Extend `writeArrow` to use `VarCharVector` for string columns.
- Extend `writeParquet` to use `BINARY` (UTF8) primitive type for string
columns.
- Update `PointExportData` to declare its columns via the new type system
(no functional change for SAS-point export — no strings used today).
## Algorithms
### Grid generation (once per protein)
1. Build a KdTree over `protein.proteinAtoms`
(`protein.proteinAtoms.withKdTreeConditional()`). Note: when
`CofactorHandler` is enabled, cofactor atoms are already merged into
`proteinAtoms` (`Protein.groovy:571-583`) — no separate union step
needed (audit #4).
2. Bounding box around `protein.proteinAtoms`, expanded by
`pocket_grid_max_dist` in every direction (reuses
`Box.aroundAtoms(...).withMargin(...)`).
3. Walk a regular cubic lattice with edge `pocket_grid_spacing` inside the
box (reuses `GridGenerator.forBox(box, edge)`).
4. Per-atom VdW radius via CDK `Elements` (audit #2). Reuse the same
accessor pattern as `PatchedCdkNumericalSurface` — when CDK returns
`null` for an element, fall back to the Krypton proxy (2.02 Å), matching
the existing null-VdW workaround. Implemented as a small helper
`VdwRadiusTable.get(Atom) → double`.
5. For each lattice point:
- **drop** if `min_dist_to_proteinAtoms < vdw_radius(nearest_atom) + pocket_grid_atom_buffer`
— overlaps the protein;
- **drop** if `min_dist_to_proteinAtoms > pocket_grid_max_dist` — too
far from the surface;
- **keep** otherwise.
**Implementation note** (audit #3): extend
`GridGenerator.sampleGridPointsAroundAtoms` (`GridGenerator.java:157-172`)
to accept both `minDist` (semantically per-atom: VdW + buffer) and
`maxDist`. The current method already does the `maxDist` side; the
extension is the per-atom-VdW exclusion check.
### Per-pocket assignment (multi-valued)
1. For each pocket `p`, take all kept grid points within
`pocket_grid_assign_cutoff` of any atom in `p.surfaceAtoms`. That's the
*raw shell* — analogous to `SwinSiteLoader`'s `cutoutShell` at
`SwinSiteLoader.groovy:92-100`.
2. **Shape fill** (pluggable via `pocket_grid_fill`, runs **per-pocket**
each pocket's raw shell is dilated independently, audit #6):
- `morph_closing` (default): morphological closing on the lattice. Mark
any unassigned lattice cell whose 6-/18-/26-neighborhood contains
`pocket_grid_fill_min_neighbors` already-assigned cells; iterate
until stable or `pocket_grid_fill_max_iters` reached. Integer-grid
native, no extra deps.
- `convex_hull`: build the 3D convex hull of the raw shell (Quickhull or
equivalent — TBD at plan time); include every lattice point inside.
Exact; pulls a hull dependency.
- `none`: keep the raw shell exactly.
The `PocketShapeFiller` strategy interface (see Extensibility) makes
adding alternatives a single-file change.
3. A grid point may belong to multiple pockets. In the output file each
`(point, pocket)` membership is a separate row.
### Descriptor computation
After assignment, for each pocket and each name in `pocket_descriptors`,
look up the registered `PocketDescriptor` and compute. See "Initial
descriptor menu" below.
## New params (additions to `Params.groovy`)
All carry `@RuntimeParam` (audit #7) — runtime / output concerns, not
training.
Allowed values for `pocket_grid_format` (audit #8, enumerated explicitly to
avoid drift): `csv`, `csv.gz`, `csv.zst`, `arrow`, `arrow.gz`, `arrow.zst`,
`parquet`.
| Param | Default | Notes |
|---|---|---|
| `export_pocket_grid` | `false` | gate for the grid-points file |
| `export_pocket_descriptors` | `false` | gate for the descriptors file |
| `export_pocket_grid_pml` | `false` | gate for the PyMOL visualization; requires `export_pocket_grid=true` (fail-fast otherwise, audit #16) |
| `pocket_grid_format` | `"csv"` | one of the enumerated values above |
| `pocket_grid_include_unassigned` | `false` | include `pocket = 0` rows in the grid file |
| `pocket_grid_spacing` | `1.0` (Å) | lattice edge; volume scales with this³ |
| `pocket_grid_max_dist` | `6.0` (Å) | upper bound: nearest-atom distance to keep a grid point |
| `pocket_grid_atom_buffer` | `0.5` (Å) | additive buffer on per-atom VdW exclusion: keep if `dist > vdw_radius(atom) + buffer` (audit #9) |
| `pocket_grid_assign_cutoff` | `4.5` (Å) | membership cutoff vs. `pocket.surfaceAtoms`; matches `SwinSiteLoader.SURFACE_ATOMS_CUTOFF` |
| `pocket_grid_fill` | `"morph_closing"` | one of `morph_closing`, `convex_hull`, `none` |
| `pocket_grid_fill_min_neighbors` | `3` | morph_closing only — neighbor count threshold |
| `pocket_grid_fill_max_iters` | `5` | morph_closing only — guard against runaway dilation |
| `pocket_descriptors` | `["volume"]` | list-param; each name selects a registered descriptor |
**Validation** (audit #10): unknown values in `pocket_descriptors`,
`pocket_grid_fill`, and `pocket_grid_format`, plus the
`export_pocket_grid_pml ⇒ export_pocket_grid` invariant, are checked at
Main startup. Same pattern as the cofactor validation at
`Main.groovy:142-153`.
## Output schemas
### `{name}_pocket_grid.{format}` (long format)
| Column | Type | Description |
|---|---|---|
| `x`, `y`, `z` | f64 | grid point coordinate |
| `pocket` | i32 | pocket rank this row belongs to; `0` only present if `pocket_grid_include_unassigned` is on |
**Sort order** (audit #5): rows sorted by `pocket` asc, then `x` asc,
`y` asc, `z` asc. `pocket=0` (if enabled) goes last so readers that only
care about assigned points can stop early. Deterministic and reproducible
across runs.
### `{name}_pocket_descriptors.{format}`
Base columns (always present), then one column per name in
`pocket_descriptors`:
| Column | Type | Source |
|---|---|---|
| `name` | string | `pocket.name` (requires `TableData` STRING support, prerequisite refactor) |
| `rank` | i32 | `pocket.rank` |
| `score` | f64 | `pocket.score` |
| `probability` | f64 | from score transformer; **column omitted entirely** when no transformer ran |
| `center_x`, `center_y`, `center_z` | f64 | `pocket.centroid` |
| `<descriptor>` | f64 / i32 | one per requested descriptor |
**`probability` column inclusion** (audit #19): controlled by a constructor
flag on the export-data class, mirroring `PointExportData.includeScore`
(`PointExportData.groovy:47-48`). Schema is fixed at construction; no
runtime branching on row write.
## Initial descriptor menu
Shipped registry:
| Name | Output | Definition |
|---|---|---|
| `volume` | f64 (ų) | `\|assigned grid points\| × pocket_grid_spacing³` |
| `sphericity` | f64 in [0, 1] | `V_pocket / V_bounding_sphere`, where `V_bounding_sphere = (4/3)π · r³` with `r = max(\|p centroid\|)` over the pocket's grid points. Quantization-free; 1 = perfect sphere. (audit #18 — replaces the boundary-area formula) |
| `num_residues` | i32 | `pocket.residues.size()` (reuses existing accessor, audit #17) |
| `num_surface_atoms` | i32 | `pocket.surfaceAtoms.count` |
`volume` is the default value of `pocket_descriptors`. Others must be opted
in by name.
## Extensibility
All new Groovy classes carry `@CompileStatic` and `@Slf4j` per repo
convention (audit #20).
```
src/main/groovy/cz/siret/prank/program/routines/predict/output/descriptors/
├── PocketDescriptor.groovy # interface: String name(); Object compute(PocketGridContext ctx)
├── PocketDescriptorRegistry.groovy # name → factory; selection from Params.pocket_descriptors
├── VolumeDescriptor.groovy
├── SphericityDescriptor.groovy
├── NumResiduesDescriptor.groovy
└── NumSurfaceAtomsDescriptor.groovy
src/main/groovy/cz/siret/prank/program/routines/predict/output/grid/
├── PocketGrid.groovy # data: kept points + per-pocket assignment map
├── PocketGridBuilder.groovy # generation + assignment + fill orchestration
├── VdwRadiusTable.groovy # Atom → double, via CDK Elements + Krypton fallback
└── fill/
├── PocketShapeFiller.groovy # interface: Set<Point> fill(rawShell, allPoints, params)
├── MorphologicalCloser.groovy
├── ConvexHullFiller.groovy # may be stub initially
└── NoOpFiller.groovy
```
`PocketGridContext` exposes: the per-pocket grid-point set, the global
grid, the pocket, the protein, and `Params`. Adding a descriptor = drop one
file in `descriptors/` + register the name. Adding a fill strategy = drop
one file in `fill/` + extend the enum.
## Pocket grid visualization
Output:
- `{outdir}/visualizations/data/{name}_pocket_grid.pdb.gz` — one HETATM per
grid point; pocket rank stored in the residue-sequence column (mirrors
`writeLabeledPointsPdb` at `PredictionVisualizer.groovy:44-56`); generated
in long format (one HETATM per `(point, pocket)` pair so PyMOL can split
by residue).
- `{outdir}/visualizations/{name}_pocket_grid.pml` — small PyMOL script
that `load`s the PDB and colors by residue.
This **PDB-sidecar approach** (audit #11) replaces the earlier inline
`pseudoatom`-per-point design — at ~20k100k grid points the inline
approach would take seconds-to-minutes for PyMOL to load.
**Renderer:**
`src/main/groovy/cz/siret/prank/program/visualization/renderers/PocketGridPymolRenderer.groovy`,
parallel to `PymolRenderer` / `ChimeraXRenderer`. Takes the in-memory
`PocketGrid` (not the CSV file — the grid is already in memory and the PDB
sidecar is derived from it, audit #15 makes the format constraint moot).
**Colors:** reuse `PredictionVisualizer.generatePocketColors(numPockets)`
(`PredictionVisualizer.groovy:38`) so the grid PML matches the main pocket
PML palette (audit #13).
**Layout in the PML:**
- `load .../data/{name}_pocket_grid.pdb.gz, pocket_grid`
- Per pocket: `create pocket_grid_<rank>, pocket_grid and resi <rank>` and
`color <hex>, pocket_grid_<rank>`.
- `show spheres, pocket_grid_*` with small `sphere_scale` (e.g. 0.3).
**Path layout** (audit #12): data files (`_pocket_grid.{fmt}`,
`_pocket_descriptors.{fmt}`) at the root of `outdir`, matching the SAS
points export. Visualization artifacts (`_pocket_grid.pdb.gz`,
`_pocket_grid.pml`) under `visualizations/` / `visualizations/data/`,
matching the existing main-PML layout.
**Master visualization switch** (audit #14): respects `visualizations=false`
— if visualizations are globally off, the grid PML + PDB sidecar are
skipped even when `export_pocket_grid_pml=true`. Single off-switch for ALL
viz.
**Independence from `vis_renderers`:** the new renderer has its own gate
(`export_pocket_grid_pml`) and does *not* tie into the
`["pymol", "chimerax"]` renderer list. The grid PML is a power-user output
that shouldn't be implicit. Easy to revisit if usage patterns argue
otherwise.
## CLI examples
```bash
# grid + default descriptors (just volume), parquet
prank predict -f protein.pdb -export_pocket_grid 1 -export_pocket_descriptors 1 \
-pocket_grid_format parquet
# custom descriptor list + tighter grid
prank predict dataset.ds -export_pocket_descriptors 1 \
-pocket_descriptors "volume,sphericity,num_residues,num_surface_atoms" \
-pocket_grid_spacing 0.75 -pocket_grid_max_dist 5
# rescore with grid export, arrow.zst
prank rescore fpocket.ds -export_pocket_grid 1 -pocket_grid_format arrow.zst
# switch fill strategy (e.g. for ablation studies)
prank predict -f protein.pdb -export_pocket_grid 1 -pocket_grid_fill none
# grid CSV + PyMOL visualization
prank predict -f protein.pdb -export_pocket_grid 1 -export_pocket_grid_pml 1
# also keep the unassigned envelope (e.g. for debugging the grid generator)
prank predict -f protein.pdb -export_pocket_grid 1 -pocket_grid_include_unassigned 1
```
## Files touched (preview, plan will refine)
New:
- `descriptors/` and `grid/` packages as above
- `PocketGridExporter.groovy` + `PocketDescriptorsExporter.groovy` next to
`PointsExporter.groovy`
- `PocketGridExportData` / `PocketDescriptorsExportData` data classes next
to `PointExportData.groovy`
- `PocketGridPymolRenderer.groovy` under `program/visualization/renderers/`
- Tests next to each new class
- **`documentation/export-pocket-grid.md`** — user-facing how-to for the
grid file: algorithm summary, sort order, params, format options, CLI
examples, PyMOL visualization details
- **`documentation/export-pocket-descriptors.md`** — descriptors file
format, descriptor catalog with formulas, extensibility for adding new
descriptors
Modified:
- `Params.groovy` — 11 new `@RuntimeParam` fields (table above)
- `Main.groovy` — startup validation hooks for `pocket_descriptors`,
`pocket_grid_fill`, `pocket_grid_format`, and the
`export_pocket_grid_pml ⇒ export_pocket_grid` invariant
- `PredictPocketsRoutine.groovy` + `RescorePocketsRoutine.groovy` — wire
the new exporters and renderer at the same hook point as
`PointsExporter.tryExportPoints`
- `TableData.groovy` + `TableExporter.groovy` + `PointExportData.groovy`
STRING column-type support (prerequisite refactor)
- `GridGenerator.java` — extend `sampleGridPointsAroundAtoms` to accept a
per-atom minDist (VdW + buffer) alongside the existing maxDist
- `documentation/export-points.md` — cross-reference the two new docs from
the "See also" section
Not touched:
- `PredictionSummary.toCSV()` / `predictions.csv` schema — descriptors live
in their own file.
- `PocketStats.realVolumeApprox` — keep as-is; SwinSite still uses it. The
new grid-volume is independent.
## Scope notes
- Cofactor atoms participate in the bounding box and the VdW exclusion via
their inclusion in `protein.proteinAtoms`. They do **not** affect
`pocket.surfaceAtoms` membership for assignment — the existing pocket
surface-atom set defines membership.
- Outputs are computed *after* score transformation so `probability` is
available when applicable.
- `breaking-changes.md` (2.7 or whenever this ships) gets a bullet for the
new param family and the new output files.
## Followups / not in this spec
- Per-residue descriptors (different file, different aggregation).
- Pocket overlap matrix (cheap byproduct of the long-format grid file —
group-by `pocket` and intersect, or compute eagerly and dump as
`{name}_pocket_overlap.csv`).
- Long-format SAS-points export (parallel change, separate spec).
- Real-3D-hull `convex_hull` filler (initial ship may stub it).

View File

@@ -204,10 +204,7 @@ class Main implements Parametrized, Writable {
}
}
/**
* Fail-fast validation for the pocket-grid export feature
* (see misc/todo/pocket_grid/SPEC.md).
*/
/** Fail-fast validation for the pocket-grid export feature. */
private void validatePocketGridParams() {
// pocket_grid_format must be one of the values supported by TableExporter.
Set<String> allowedFormats = ['csv', 'csv.gz', 'csv.zst',

View File

@@ -156,8 +156,7 @@ class PredictPocketsRoutine extends Routine {
new GetcleftOutputCalculator().generateGetcleftSasPdbFiles(pair.prediction, outdir)
}
// Pocket grid + descriptors export + optional PyMOL viz
// (see misc/todo/pocket_grid/SPEC.md).
// Pocket grid + descriptors export + optional PyMOL viz.
PocketGridOutputs.exportIfEnabled(pair.prediction, item.protein, outdir, item.label)
}

View File

@@ -129,8 +129,7 @@ class RescorePocketsRoutine extends Routine {
// Export SAS points with feature vectors and scores (pocket points only in rescore mode)
PointsExporter.tryExportPoints(rescorer.exportData, outdir, item.label)
// Pocket grid + descriptors export + optional PyMOL viz
// (see misc/todo/pocket_grid/SPEC.md).
// Pocket grid + descriptors export + optional PyMOL viz.
PocketGridOutputs.exportIfEnabled(pair.prediction, item.protein, outdir, item.label)
}