Parses per-protein <ID>_predictions.txt (semicolon CSV) and resolves
atom_ids against queryProtein.allAtoms by PDB serial. Empty/header-only
files produce 0 pockets gracefully. Prediction is bound to the
caller-supplied queryProtein, avoiding the ConcavityLoader bug class.
- Dataset.groovy: new case "seq2pocket"
- README.md: list SwinSite and Seq2Pocket in rescoring methods;
cite pocketeer.ds + swinsite.ds in test_data/ examples
- CLAUDE.md: note that distro/README.md is a transient build artifact
- Test fixtures: 5 real predictions under distro/test_data/, plus
unsorted/header-only/path-independence variants under src/test/resources/
- Seq2PocketLoaderTest: 10 tests, all passing
- GenericVector.toList(): replace deprecated DefaultGroovyMethods.toList
(Groovy 5) with a plain Java loop; drop unused addTo() (no callers)
- Atoms(List<? extends Atom>): @SuppressWarnings("unchecked") for the
intentional wrap-without-copy
- KdNode.splitLeafNode: @SuppressWarnings("unchecked") for casts from
the Object[] backing store
- Drop dead mask_unknown_residues=true from default(_rescore).groovy
(param removed from Params.groovy in 1b7809a6, 2019; configs missed)
- Rewrite distro/models/readme.md to match models on disk (add rescore_2024,
rescore_conservation; remove nonexistent conservation.model)
- Remove broken documentation/rescoring.md link from distro/README.md
- distro/config/readme.md: drop nonexistent working.groovy reference,
fix github link master->develop
- Delete dead commented-out method bodies in PdbUtils, RPlotter,
PredictionVisualizer
- Fix typo in Main.groovy javadoc
Bumps faster-molecular-surface 1.0 -> 1.1, vendored in
lib/local-mvn-repo/. The 1.1 release adds a VdW radius fallback for
elements whose CDK Elements enum entry is null (Co, Ni, Cu, Rh, Os, Ir,
plus radioactive/synthetic). Without the fix, cobalamin-bearing
structures crashed surface computation under -cofactors.
PatchedCdkNumericalSurface wraps the default CDK NumericalSurface (used
when -use_optimized_surface 0) with the same fallback, via a Krypton
proxy for null-VdW atoms. Surface.groovy switched over to it. Unit tests
mirror the FMS-side regressions.
AnalyzeRoutine.cmdCofactors: replace Struct.getHetGroups with
Struct.getLigandGroups (2 call sites) so GDP/GTP/ATP and other groups
that BioJava classifies as NUCLEOTIDE/AMINOACID don't get falsely
reported as "name not in structure" in cofactor_matches.csv or omitted
from het_groups.csv. Mirrors the M1 fix applied earlier to
CofactorHandler.extractCofactorAtoms.
testsets.sh: new cofactors_full() function exercising the cofactor
demo + full datasets in p2rank-datasets2/other/cofactors/ (predict,
analyze cofactors, -aa_mapping composition, visualizations,
export-points). Uses -fail_fast 1 so per-structure errors surface as
test failures rather than silent skips.
The -cofactors flag and dataset cofactors column accept LigandDefinition
specifiers ("FAD", "FAD[atom_id:N]", "FAD[contact_res_ids:A_T259,A_D246]").
Matched HET groups merge into the protein surface (proteinAtoms) and are
excluded from ligand listings; per-item resolution lets a dataset column
override the global Params.cofactors.
New: analyze cofactors subcommand (HETATM survey + specifier dry-run),
PyMOL teal-stick visualization (vis_highlight_cofactors), distant-cofactor
and chain-excluded WARN diagnostics, aa_mapping collision WARN (R19),
drop-in safety benchmark with byte-equality on a never-present specifier.
Documentation in documentation/cofactors.md (user-facing) and
documentation/dev/cofactors.md (engineering record with R1-R24 design choices
and post-merge audit fixes). Tests in CofactorHandlerTest,
CofactorIntegrationTest, CofactorPipelineTest, CofactorAnalyzeTest,
DataTableCsvTest plus a Log4jCapture test helper.
Registers `swinsite` as a third-party predictor in Dataset.groovy. The
loader reads grid<N>_score_<float>.mol2 (raw voxel points) per pocket,
parses score from the filename, computes pocket centroid from the grid,
and derives surfaceAtoms via cutoutShell against queryProtein.exposedAtoms
(4.5 -> 10 A expanding shell), mirroring ConcavityLoader.
Reads grid mol2 instead of pocket mol2: pocket mol2 atoms are standalone
copies with chain reset to 'A' and synthetic residue names, so they break
P2Rank's residue/conservation/ASA feature lookups. Grid + cutoutShell
keeps surfaceAtoms bound to real queryProtein atoms.
Mol2 parsing is a small inline @<TRIPOS>ATOM scan rather than CDK's
Mol2Reader: CDK has a lazy-init race in AtomTypeFactory that NPEs under
parallel dataset processing.
Ships swinsite.ds plus 6 protein PDBs (1tjw_A from SwinSite's
test_protein_only example, plus 1a26A/1a2kC/1afkA/1atlA/1bqoB from
coach420) covering 1/2/3/4/6-pocket cases. 1atlA's on-disk N-order is
non-monotonic in score (0.7288, 0.0664, 0.3433), exercising the rerank.
SwinSiteLoaderTest covers all six fixtures plus the
predictionIsBoundToQueryProtein contract and empty-dir tolerance.
The score and pocket columns share the same predict/rescore-only
origin, so describe them together in the prose, the export-points
"not contained" caveat, the predict/rescore output description, and
the "Which command to use?" table.
Add documentation/dev/evaluation-metric-fixes-2.6.md covering DSO/DSWO integer-
division fixes, the ResidueSite DCC centroid fix, and the BioJava GroupType
ligand-detection fix. Mention the ligand-detection change in breaking-changes.md
since it shifts DCA/DCC on datasets containing GDP/GTP/ATP/SHR-like ligands.
The points export (predict/rescore -export_points 1) now includes an
integer 'pocket' column matching newRank in *_predictions.csv, so users
can directly aggregate per-pocket descriptors without a spatial join.
Standalone 'export-points' (no prediction) omits the column.
Pocket-extension shells can overlap, so a single SAS point can sit in
multiple pocket.labeledPoints lists. Previously the assignment loop
last-write-wins gave the worst rank to shared points, which was
counter-intuitive for both visualization (PredictionVisualizer PDB
output) and descriptor aggregation. PocketRescorer.setNewRanks now
iterates pockets best-first with a guard, so the lowest newRank wins;
the redundant lp.pocket write in PocketPredictor is removed.
TableData gains a per-column ColumnType (DOUBLE default, INT) so
TableExporter emits true integers in CSV (no decimals), Arrow (Int32),
and Parquet (INT32) for the pocket column.
Bump version to 2.6.0-dev.8.
ConcavityLoader.loadPrediction was ignoring its queryProtein parameter
and binding the returned Prediction to a Protein loaded from
*_residue.pdb (a pocket-touching residue subset, not the full protein).
Downstream features keyed on prediction.protein.fileName then resolved
against the wrong basename — most visibly conservation lookup, which
searched for "<ID>_<submethod>_residue_<chain>.hom" instead of
"<ID>_<chain>.hom" and silently produced zero conservation features.
Other feature extractors were similarly reading the truncated atom set.
The residue subset is still loaded and used to define the per-pocket
surface-atom shell (no behaviour change there), but the Prediction is
now bound to queryProtein, matching FPocketLoader and PUResNetLoader.
Add ConcavityLoaderTest plus a matching test in FPocketLoaderTest that
assert the loader-contract invariant prediction.protein === queryProtein.
PUResNet pocket PDBs occasionally left-shift the residue insertion code
into column 26 instead of column 27, breaking BioJava's strict resSeq
parser with NumberFormatException and silently dropping affected
predictions (216 of 9955 entries on holo4k+pdbbind2020).
Add PUResNetPdbRepair which detects the malformed pattern and rewrites
it in memory before parsing. Wire PUResNetLoader through it. PdbUtils
and the rest of the load path are unchanged.
- Replace manual line.split(",") with Apache Commons CSV (column-name access)
- Support both reduced (9-col) and full (59-col) ahoj_ubs CSV formats
- Add AhojSiteInfo: typed data class for 14 pocket metadata fields
- Add secondaryData map to ResidueSite for extensible metadata
- Export AhojSiteInfo columns in observed_sites.csv when available
- Add comprehensive parser tests for both CSV formats
- Add test data files and format documentation
Protein.sites now holds ground-truth binding sites for both ligand-defined
and explicit (residue-based) evaluation modes. Sites are populated from
ligands via populateSitesFromLigands() when no explicit sites are defined.
- Add predictedPocket and setSasPoints to BindingSite interface
- Add predictedPocket field to ResidueSite
- Rename assignPocketsToLigands to assignPocketsToSites (works on BindingSite)
- Update calcCoveragesProt to use BindingSite.predictedPocket
- Determine isLigandMode via instanceof instead of sites.isEmpty()
- Unify PymolRenderer sites/ligands branch into single BindingSite loop
- Simplify AnalyzeRoutine.cmdBindingSiteCenters to use p.sites directly
- Rename SiteCentroidMethod to SiteCenterMethod
- Extract getCenterForMethod(SiteCenterMethod) into BindingSite interface
for thread-safe, param-independent center calculation
- Refactor Ligand/ResidueSite getCenterForEval() to delegate to getCenterForMethod()
- Add analyze binding-site-centers command comparing all center methods per site
- Add Dataset.Result.writeErrorsAndGetSummary() and use it across all
AnalyzeRoutine commands for consistent error reporting to both console and CSV
BioJava assigns GroupType based on its Chemical Component Dictionary,
not structural role. Ligands in non-polymer chains can get any GroupType:
- GDP, GTP, ATP -> GroupType.NUCLEOTIDE
- SHR and similar -> GroupType.AMINOACID
- Most others -> GroupType.HETATM
Previously only HETATM groups were detected as ligands, causing errors
like "Ligand definition 'GDP' matches no ligands" for nucleotide and
amino acid derivative ligands.
Fix: any non-water group in a NONPOLYMER chain is now a ligand
candidate, regardless of GroupType. Polymer chain groups (protein AA,
DNA/RNA) are only included if they have GroupType.HETATM.
Add test PDB files (1a2kC.pdb with GDP, 1e5qA.pdb with SHR) and
comprehensive tests for all three GroupType cases.
The Jaccard ratio was computed as int/int, always producing 0 or 1,
making fractional thresholds ineffective. Cast to double for correct
floating-point division. Also fix typo (cahe->cache), remove debug
comments, and update javadoc.
- Rename PocketCriterium to PocketCriterion (fix Latin spelling)
- Revert getLigandAtoms() back to getAtoms() in BindingSite interface
- Rename getCentroidForEval() to getCenterForEval()
- Rename explicitCentroid to explicitCenter in ResidueSite
- Rename SiteCentroidMethod values: explicit_centroid->explicit,
sas_points_center_of_mass->sas_points_centroid
- Rename site_centroid_method param to site_eval_center_method
- Ligand.getCentroid() now delegates to getCenterForEval()
Remove leading-space padding from fmt calls in getMiscStatsCSV and
FeatureImportances, fix header/data spacing mismatch in toPocketsCSV,
and remove trailing space in toLigandsCSV header.
Add notebook loading _predictions.csv and _residues.csv with example
data from predict_1fbl. Clean up CSV formatting: remove padding from
values, add fmtCsv() without leading spaces for CSV output.