Commit Graph

1939 Commits

Author SHA1 Message Date
rdk
ddd5d8a11c Add Seq2PocketLoader for Seq2Pocket pocket predictions
Parses per-protein <ID>_predictions.txt (semicolon CSV) and resolves
atom_ids against queryProtein.allAtoms by PDB serial. Empty/header-only
files produce 0 pockets gracefully. Prediction is bound to the
caller-supplied queryProtein, avoiding the ConcavityLoader bug class.

- Dataset.groovy: new case "seq2pocket"
- README.md: list SwinSite and Seq2Pocket in rescoring methods;
  cite pocketeer.ds + swinsite.ds in test_data/ examples
- CLAUDE.md: note that distro/README.md is a transient build artifact
- Test fixtures: 5 real predictions under distro/test_data/, plus
  unsorted/header-only/path-independence variants under src/test/resources/
- Seq2PocketLoaderTest: 10 tests, all passing
2026-05-16 12:40:36 +02:00
rdk
e9641680c1 Silence javac deprecation/unchecked notes
- GenericVector.toList(): replace deprecated DefaultGroovyMethods.toList
  (Groovy 5) with a plain Java loop; drop unused addTo() (no callers)
- Atoms(List<? extends Atom>): @SuppressWarnings("unchecked") for the
  intentional wrap-without-copy
- KdNode.splitLeafNode: @SuppressWarnings("unchecked") for casts from
  the Object[] backing store
2026-05-15 16:15:26 +02:00
rdk
c6ee163ece Audit cleanup: remove dead param, dead commented code, stale docs
- Drop dead mask_unknown_residues=true from default(_rescore).groovy
  (param removed from Params.groovy in 1b7809a6, 2019; configs missed)
- Rewrite distro/models/readme.md to match models on disk (add rescore_2024,
  rescore_conservation; remove nonexistent conservation.model)
- Remove broken documentation/rescoring.md link from distro/README.md
- distro/config/readme.md: drop nonexistent working.groovy reference,
  fix github link master->develop
- Delete dead commented-out method bodies in PdbUtils, RPlotter,
  PredictionVisualizer
- Fix typo in Main.groovy javadoc
2026-05-15 09:34:28 +02:00
rdk
9fd7ffe0db Bump gradle wrapper 9.5.0->9.5.1, slf4j 2.0.17->2.0.18, parquet-floor 1.65->1.69 2026-05-15 02:57:07 +02:00
rdk
c78519c98e Cofactor smoke harness, CDK VdW workaround, analyze-cofactors fixes
Bumps faster-molecular-surface 1.0 -> 1.1, vendored in
lib/local-mvn-repo/. The 1.1 release adds a VdW radius fallback for
elements whose CDK Elements enum entry is null (Co, Ni, Cu, Rh, Os, Ir,
plus radioactive/synthetic). Without the fix, cobalamin-bearing
structures crashed surface computation under -cofactors.

PatchedCdkNumericalSurface wraps the default CDK NumericalSurface (used
when -use_optimized_surface 0) with the same fallback, via a Krypton
proxy for null-VdW atoms. Surface.groovy switched over to it. Unit tests
mirror the FMS-side regressions.

AnalyzeRoutine.cmdCofactors: replace Struct.getHetGroups with
Struct.getLigandGroups (2 call sites) so GDP/GTP/ATP and other groups
that BioJava classifies as NUCLEOTIDE/AMINOACID don't get falsely
reported as "name not in structure" in cofactor_matches.csv or omitted
from het_groups.csv. Mirrors the M1 fix applied earlier to
CofactorHandler.extractCofactorAtoms.

testsets.sh: new cofactors_full() function exercising the cofactor
demo + full datasets in p2rank-datasets2/other/cofactors/ (predict,
analyze cofactors, -aa_mapping composition, visualizations,
export-points). Uses -fail_fast 1 so per-structure errors surface as
test failures rather than silent skips.
2026-05-15 00:35:08 +02:00
rdk
79cda78473 Add cofactor-as-protein-surface feature (Issue #79 part 2)
The -cofactors flag and dataset cofactors column accept LigandDefinition
specifiers ("FAD", "FAD[atom_id:N]", "FAD[contact_res_ids:A_T259,A_D246]").
Matched HET groups merge into the protein surface (proteinAtoms) and are
excluded from ligand listings; per-item resolution lets a dataset column
override the global Params.cofactors.

New: analyze cofactors subcommand (HETATM survey + specifier dry-run),
PyMOL teal-stick visualization (vis_highlight_cofactors), distant-cofactor
and chain-excluded WARN diagnostics, aa_mapping collision WARN (R19),
drop-in safety benchmark with byte-equality on a never-present specifier.

Documentation in documentation/cofactors.md (user-facing) and
documentation/dev/cofactors.md (engineering record with R1-R24 design choices
and post-merge audit fixes). Tests in CofactorHandlerTest,
CofactorIntegrationTest, CofactorPipelineTest, CofactorAnalyzeTest,
DataTableCsvTest plus a Log4jCapture test helper.
2026-05-14 07:58:14 +02:00
rdk
b2a23179f1 Bump groovy 5.0.5->5.0.6, log4j 2.25.4->2.26.0, zstd-jni 1.5.7-7->1.5.7-8 2026-05-12 01:56:07 +02:00
rdk
0e8bb0cb33 Add SwinSiteLoader for SwinSite pocket predictions
Registers `swinsite` as a third-party predictor in Dataset.groovy. The
loader reads grid<N>_score_<float>.mol2 (raw voxel points) per pocket,
parses score from the filename, computes pocket centroid from the grid,
and derives surfaceAtoms via cutoutShell against queryProtein.exposedAtoms
(4.5 -> 10 A expanding shell), mirroring ConcavityLoader.

Reads grid mol2 instead of pocket mol2: pocket mol2 atoms are standalone
copies with chain reset to 'A' and synthetic residue names, so they break
P2Rank's residue/conservation/ASA feature lookups. Grid + cutoutShell
keeps surfaceAtoms bound to real queryProtein atoms.

Mol2 parsing is a small inline @<TRIPOS>ATOM scan rather than CDK's
Mol2Reader: CDK has a lazy-init race in AtomTypeFactory that NPEs under
parallel dataset processing.

Ships swinsite.ds plus 6 protein PDBs (1tjw_A from SwinSite's
test_protein_only example, plus 1a26A/1a2kC/1afkA/1atlA/1bqoB from
coach420) covering 1/2/3/4/6-pocket cases. 1atlA's on-disk N-order is
non-monotonic in score (0.7288, 0.0664, 0.3433), exercising the rerank.
SwinSiteLoaderTest covers all six fixtures plus the
predictionIsBoundToQueryProtein contract and empty-dir tolerance.
2026-05-08 01:05:15 +02:00
rdk
59bc84c265 Mention pocket column alongside score in export-points docs
The score and pocket columns share the same predict/rescore-only
origin, so describe them together in the prose, the export-points
"not contained" caveat, the predict/rescore output description, and
the "Which command to use?" table.
2026-05-07 03:21:38 +02:00
rdk
f5ad22f604 Document 2.6 evaluation-metric fixes and note ligand-detection breaking change
Add documentation/dev/evaluation-metric-fixes-2.6.md covering DSO/DSWO integer-
division fixes, the ResidueSite DCC centroid fix, and the BioJava GroupType
ligand-detection fix. Mention the ligand-detection change in breaking-changes.md
since it shifts DCA/DCC on datasets containing GDP/GTP/ATP/SHR-like ligands.
2026-05-06 14:46:26 +02:00
rdk
15349bb48f Add pocket rank column to points export, fix overlap labeling
The points export (predict/rescore -export_points 1) now includes an
integer 'pocket' column matching newRank in *_predictions.csv, so users
can directly aggregate per-pocket descriptors without a spatial join.
Standalone 'export-points' (no prediction) omits the column.

Pocket-extension shells can overlap, so a single SAS point can sit in
multiple pocket.labeledPoints lists. Previously the assignment loop
last-write-wins gave the worst rank to shared points, which was
counter-intuitive for both visualization (PredictionVisualizer PDB
output) and descriptor aggregation. PocketRescorer.setNewRanks now
iterates pockets best-first with a guard, so the lowest newRank wins;
the redundant lp.pocket write in PocketPredictor is removed.

TableData gains a per-column ColumnType (DOUBLE default, INT) so
TableExporter emits true integers in CSV (no decimals), Arrow (Int32),
and Parquet (INT32) for the pocket column.

Bump version to 2.6.0-dev.8.
2026-05-06 14:08:29 +02:00
rdk
ee8ff7b471 Bump Gradle wrapper 9.4.1->9.5.0 2026-04-30 12:07:55 +02:00
rdk
9fe0e28bc0 Bump gradle-versions-plugin 0.53.0->0.54.0, commons-io 2.21.0->2.22.0, guava 33.5.0->33.6.0, gson 2.13.2->2.14.0, parquet-floor 1.64->1.65 2026-04-29 22:33:45 +02:00
rdk
c143e0fa9c Fix ConcavityLoader to bind prediction to queryProtein
ConcavityLoader.loadPrediction was ignoring its queryProtein parameter
and binding the returned Prediction to a Protein loaded from
*_residue.pdb (a pocket-touching residue subset, not the full protein).
Downstream features keyed on prediction.protein.fileName then resolved
against the wrong basename — most visibly conservation lookup, which
searched for "<ID>_<submethod>_residue_<chain>.hom" instead of
"<ID>_<chain>.hom" and silently produced zero conservation features.
Other feature extractors were similarly reading the truncated atom set.

The residue subset is still loaded and used to define the per-pocket
surface-atom shell (no behaviour change there), but the Prediction is
now bound to queryProtein, matching FPocketLoader and PUResNetLoader.

Add ConcavityLoaderTest plus a matching test in FPocketLoaderTest that
assert the loader-contract invariant prediction.protein === queryProtein.
2026-04-29 00:41:01 +02:00
rdk
42dfe7fd6f Fix PUResNet pocket loader to handle shifted insertion codes
PUResNet pocket PDBs occasionally left-shift the residue insertion code
into column 26 instead of column 27, breaking BioJava's strict resSeq
parser with NumberFormatException and silently dropping affected
predictions (216 of 9955 entries on holo4k+pdbbind2020).

Add PUResNetPdbRepair which detects the malformed pattern and rewrites
it in memory before parsing. Wire PUResNetLoader through it. PdbUtils
and the rest of the load path are unchanged.
2026-04-28 22:25:44 +02:00
rdk
43b1f7dcf1 Fix pocket centroid calculation in ConcavityLoader and PUResNetLoader
Use centroid instead of centerOfMass in ConcavityLoader, set centroid
explicitly in PUResNetLoader, fix POCKET_GRID_TO_SURFACE_DIST type to int.
2026-04-03 19:30:27 +02:00
rdk
994ad45238 Bump groovy 5.0.4->5.0.5, log4j 2.25.3->2.25.4 2026-04-01 22:25:51 +02:00
rdk
17a4304d29 Add rg, n_unp_pockets, n_unp_pockets_multichain fields to AhojSiteInfo 2026-04-01 12:44:10 +02:00
rdk
858ba45fe7 Refactor AhojUbsSiteParser to use CSV library and add AhojSiteInfo data class
- Replace manual line.split(",") with Apache Commons CSV (column-name access)
- Support both reduced (9-col) and full (59-col) ahoj_ubs CSV formats
- Add AhojSiteInfo: typed data class for 14 pocket metadata fields
- Add secondaryData map to ResidueSite for extensible metadata
- Export AhojSiteInfo columns in observed_sites.csv when available
- Add comprehensive parser tests for both CSV formats
- Add test data files and format documentation
2026-04-01 10:22:43 +02:00
rdk
6cf293478a Add atom hybridization feature (one-hot sp2/sp3)
CSV-based lookup for standard amino acid atoms with tiered fallback
for non-standard residues (backbone name match, then element-based default).
2026-03-21 21:55:00 +01:00
rdk
1997ab948e switch CI Java distribution from temurin to oracle 2026-03-21 18:42:22 +01:00
rdk
1c636757d6 update CI Java version matrix: drop 23/24, add 26 2026-03-21 17:54:56 +01:00
rdk
b58726c27e bump arrow and parquet-floor dependencies 2026-03-21 17:52:37 +01:00
rdk
0a51f504d0 bump gradle 2026-03-21 16:04:31 +01:00
rdk
a66bea74be Add eval_output_prediction_files param to output per-protein prediction CSVs in eval commands 2026-03-17 18:59:13 +01:00
rdk
faddcfb70f Lazy-init EnergyCalculator and LJEnergyCalculator in energy features 2.6.0-dev.7 2026-03-16 07:55:16 +01:00
rdk
48cb681aaa Refactor DSO/DSWO: replace Tuple2 with OverlapCounts, cache counts instead of Atoms, simplify CdkUtils 2026-03-16 03:20:48 +01:00
rdk
5b4613c3a4 Extract FpocketAdHocHelper, add run_fpocket_ad_hoc param for eval-rescore and rescore commands 2026-03-16 03:20:41 +01:00
rdk
ba53b97e90 Add per-method CSVs and grouped summary to binding-site-centers, add DataTable filter/distinctValues/formatGroupedSummaryTable 2026-03-16 01:06:44 +01:00
rdk
91987129fe Bump version to 2.6.0-dev.7 2026-03-15 21:37:05 +01:00
rdk
8852739016 Add DCC_4 protein-centric success rate metrics 2026-03-15 21:35:53 +01:00
rdk
a814157e2b Minor cleanups: fix typos, normalize loop syntax and imports in Evaluation 2026-03-15 21:32:23 +01:00
rdk
f3616da217 Unify Protein.sites to contain all binding sites, add predictedPocket to BindingSite interface
Protein.sites now holds ground-truth binding sites for both ligand-defined
and explicit (residue-based) evaluation modes. Sites are populated from
ligands via populateSitesFromLigands() when no explicit sites are defined.

- Add predictedPocket and setSasPoints to BindingSite interface
- Add predictedPocket field to ResidueSite
- Rename assignPocketsToLigands to assignPocketsToSites (works on BindingSite)
- Update calcCoveragesProt to use BindingSite.predictedPocket
- Determine isLigandMode via instanceof instead of sites.isEmpty()
- Unify PymolRenderer sites/ligands branch into single BindingSite loop
- Simplify AnalyzeRoutine.cmdBindingSiteCenters to use p.sites directly
2026-03-15 21:25:49 +01:00
rdk
829cf9b8be Return typed result objects from calcConservationStats and calcOverlapStatsForPockets 2026-03-15 20:28:51 +01:00
rdk
8a516228e1 Fix @CompileStatic errors in Evaluation: destructuring assignment, int-to-Double casts 2026-03-15 19:59:15 +01:00
rdk
5ac9aab18a Refactor Evaluation: simplify avg/div methods, use Function instead of Closure, extract writeScoresToFileIfRequested 2026-03-15 19:27:15 +01:00
rdk
20236ef092 Refactor conservation/chains analysis, add @CompileStatic to Evaluation, rename criterium to criterion 2026-03-15 17:59:53 +01:00
rdk
d9de1fba7e Add contact_atoms_centroid site evaluation center method for ligand-defined sites 2026-03-15 17:09:04 +01:00
rdk
49a8430a7d Add binding-site-centers command, refactor center methods, consolidate error reporting
- Rename SiteCentroidMethod to SiteCenterMethod
- Extract getCenterForMethod(SiteCenterMethod) into BindingSite interface
  for thread-safe, param-independent center calculation
- Refactor Ligand/ResidueSite getCenterForEval() to delegate to getCenterForMethod()
- Add analyze binding-site-centers command comparing all center methods per site
- Add Dataset.Result.writeErrorsAndGetSummary() and use it across all
  AnalyzeRoutine commands for consistent error reporting to both console and CSV
2026-03-14 18:22:47 +01:00
rdk
0e0cb47907 Add ca_atoms_centroid site evaluation center method with tests 2026-03-14 15:57:41 +01:00
rdk
1ecb29f876 Add load_ligands_from_separate_files param for loading ligands from individual ligand_* files 2026-03-13 18:21:26 +01:00
rdk
0b5b61304d Add legacy conservation file name format fallback (e.g. 2ed4_A.) 2026-03-13 17:22:27 +01:00
rdk
e7fc457f6a Fix ligand detection for BioJava GroupType misclassifications
BioJava assigns GroupType based on its Chemical Component Dictionary,
not structural role. Ligands in non-polymer chains can get any GroupType:
- GDP, GTP, ATP -> GroupType.NUCLEOTIDE
- SHR and similar -> GroupType.AMINOACID
- Most others -> GroupType.HETATM

Previously only HETATM groups were detected as ligands, causing errors
like "Ligand definition 'GDP' matches no ligands" for nucleotide and
amino acid derivative ligands.

Fix: any non-water group in a NONPOLYMER chain is now a ligand
candidate, regardless of GroupType. Polymer chain groups (protein AA,
DNA/RNA) are only included if they have GroupType.HETATM.

Add test PDB files (1a2kC.pdb with GDP, 1e5qA.pdb with SHR) and
comprehensive tests for all three GroupType cases.
2.6.0-dev.6
2026-03-10 14:34:28 +01:00
rdk
d78f80ee73 Extract writeCases() method, rename sites.csv to observed_sites.csv
Consolidate case CSV writing into Evaluation.writeCases(). Remove
duplicate DSO_0.1 criterion and stale TODO comments.
2026-03-10 03:24:44 +01:00
rdk
838b0a697f Fix integer division bug in DSO criterion and clean up
The Jaccard ratio was computed as int/int, always producing 0 or 1,
making fractional thresholds ineffective. Cast to double for correct
floating-point division. Also fix typo (cahe->cache), remove debug
comments, and update javadoc.
2026-03-10 02:27:11 +01:00
rdk
2de315e9e0 Rename API: PocketCriterium->PocketCriterion, getLigandAtoms->getAtoms, centroid->center
- Rename PocketCriterium to PocketCriterion (fix Latin spelling)
- Revert getLigandAtoms() back to getAtoms() in BindingSite interface
- Rename getCentroidForEval() to getCenterForEval()
- Rename explicitCentroid to explicitCenter in ResidueSite
- Rename SiteCentroidMethod values: explicit_centroid->explicit,
  sas_points_center_of_mass->sas_points_centroid
- Rename site_centroid_method param to site_eval_center_method
- Ligand.getCentroid() now delegates to getCenterForEval()
2026-03-10 02:02:47 +01:00
rdk
412c590dcb Fix CSV spacing consistency: remove padding and trailing spaces
Remove leading-space padding from fmt calls in getMiscStatsCSV and
FeatureImportances, fix header/data spacing mismatch in toPocketsCSV,
and remove trailing space in toLigandsCSV header.
2026-03-09 13:32:51 +01:00
rdk
fdebd71daf Add example Jupyter notebook for analyzing P2Rank output
Add notebook loading _predictions.csv and _residues.csv with example
data from predict_1fbl. Clean up CSV formatting: remove padding from
values, add fmtCsv() without leading spaces for CSV output.
2026-03-09 12:05:00 +01:00
rdk
61b8863c27 Simplify CSV output formatting and add null guard in CsvRow
Remove fixed-width column padding from PredictionSummary, fix spacing
in ResidueLabelings CSV output, and add null safety in CsvRow.add().
2026-03-09 11:17:59 +01:00
rdk
42ad4dfe9f Move centerOfMass and calculateCentroid to PerfUtils to avoid array allocation
Reimplements BioJava's centerOfMass and Atoms.calculateCentroid in
PerfUtils accepting Collection directly, avoiding temporary array
allocation. Adds delegate methods in Struct.
2026-03-09 02:22:48 +01:00