Commit Graph

1880 Commits

Author SHA1 Message Date
rdk
c9ad8f71ff Add vis_site_centers param for rendering site/pocket centroids in PyMOL
- New vis_site_centers param (default false) renders centroids as hotpink
  pseudoatom spheres in both old (PymolRenderer) and new (NewPymolRenderer)
- Pass site centroids via RenderingModel.siteCentroids for analyze command
- Old renderer shows predicted pocket centroids and ligand centroids
- Fix empty visualizations/ dir in eval-predict: create vis dir under
  predDir instead of top-level outdir
2026-03-04 02:47:43 +01:00
rdk
d5715d9797 Fix PyMol renderer: bulk selections, CIF-to-PDB conversion, site-based labeling
- Use bulk atom ID selections instead of per-residue named selections to
  avoid exceeding PyMOL's object limit on large proteins
- Convert CIF inputs to PDB format with correct .pdb extension (PyMOL
  can't reliably parse BioJava CIF and uses extension to pick parser)
- Rename PyMOL object from "protein" to "prot" to avoid reserved keyword
- Fix null interpolation in PML when no ligands or no labeling
- Build BinaryLabeling from explicit site residues for visualization
  (item.binaryLabeling doesn't support site-based datasets)
2026-03-04 01:13:48 +01:00
rdk
026be7eae5 Improve analyze binding-sites: visualizations, site radius, eager loading
- Add PyMol visualizations using dataset.binaryResidueLabeler
- Add site_radius column (max distance from centroid to any site atom)
- Add excludeFromSummary param to DataTable.formatSummaryTable to skip
  center coordinates from numeric summary stats
- Load ExplicitSitesIndex eagerly during dataset loading (fail-fast)
- Skip CSV rows with empty residue/coordinate fields in AhojUbsSiteParser
- Write items without binding sites to separate file in outdir
2026-03-03 21:58:55 +01:00
rdk
9e9a500836 Bump version to 2.6.0-dev.4 2026-03-03 15:00:46 +01:00
rdk
c9fef83950 Use AtomKdTree interface in Atoms and minor cleanups
Switch Atoms.kdTree field and buildKdTree() to use the AtomKdTree
interface instead of AtomKdTreeV1 directly. Add @NonNull to iterator(),
improve initial capacity estimates, and fix whitespace.
2026-03-03 15:00:36 +01:00
rdk
997727e878 Add explicit sites loading and analyze binding-sites command
Implement ExplicitSitesIndex for loading binding site definitions from
external CSV files (pluggable format system, first format: ahoj_ubs).
Sites are resolved during item loading via DatasetItemLoader.

Add 'analyze binding-sites' sub-command producing unified CSV and summary
stats for both ligand-based and explicit site datasets, with unresolved
residue/site tracking for explicit datasets.

Remove unused SiteLoader stub.
2026-03-03 15:00:31 +01:00
rdk
8f5da9fdcd Add fused addWeighted and O(N²) single-linkage clusterer
Add GenericVector.addWeighted() for fused multiply-add, eliminating per-
neighbor array allocation in feature vector aggregation. Add SLinkClustererV2
using union-find with path compression, reducing single-linkage clustering
from O(N³) to O(N²). Wire V2 via factory methods on AtomClusterer and
AtomGroupClusterer.
2026-03-03 05:13:05 +01:00
rdk
261dae09c9 Rename consolidate() to sparsify() and add surface_sparsify param
Rename Atoms.consolidate() to Atoms.sparsify() for clarity. Use mutable
V1 KdTree for O(N log N) incremental insertion instead of periodic
rebuilds. Add surface_sparsify runtime param (default true) to allow
disabling surface point sparsification. Hardcode AtomKdTreeV1 in
Atoms.buildKdTree() and delegate Dataset cache clearing to item methods.
2026-03-03 04:01:54 +01:00
rdk
a66a973e1c Refactor KdTree into AtomKdTree interface with V1/V2 implementations
Rewrite AtomKdTreeV1 from Groovy to Java to eliminate Groovy IndyInterface
monitor contention that serialized 16 parallel threads down to ~2.
Move V1 KdTree into v1/ subpackage, extract AtomKdTree as a Java interface
with factory method dispatching by kdtree_implementation param, and rename
the old v2 wrapper to AtomKdTreeV2 implementing the same interface.
2026-03-03 00:17:38 +01:00
rdk
6d47285116 Add kdtree_implementation param and fix quickselect duplicate-key hang
Add runtime parameter to switch between KdTree3D (default) and v1
AtomKdTree. Fix O(N²) quickselect degeneration on duplicate coordinates
by adding post-partition equal-range scan.
2026-03-02 22:20:59 +01:00
rdk
24b9f5f709 Optimize KdTree3D build: bottom-up bounds, eliminate redundant traversals
- Bounding boxes computed bottom-up from leaf scans instead of scanning
  full data range at every tree level (O(N) vs O(N log N))
- Approximate parent bounds passed down for split-axis selection (O(1)
  per node instead of O(range) scan)
- Remove findNodeCount() and dead code; buildNode returns max index
- Resolve split-axis array once in quickselect inner loop
2026-03-02 21:15:41 +01:00
rdk
76026b9297 Refactor Dataset item cache clearing and fix processItem typo
Rename processssItem to processItem. Add per-item conditional cache
clearing after processing to reduce peak memory. Refactor cleanCaches
into clearCache/clearPrimaryCache/clearSecondaryCache with null-safety.
2026-03-02 20:52:07 +01:00
rdk
7f4d37b5c4 Add comparative benchmark test for v1 vs v2 KdTree
Parametrized test generates random points, builds both trees, verifies
identical results for all query types, and measures relative performance.
Skipped during normal test runs; invoked via kdtree-benchmark.sh script.
2026-03-02 20:52:05 +01:00
rdk
6cce0eb016 Rewrite KdTree as immutable, hardcoded 3D implementation in v2 package
New KdTree3D.java uses SoA storage, linearized implicit-heap layout,
balanced quickselect build, and stack-based traversal. Immutable design
eliminates mutable node state, enabling thread-safe concurrent queries.

AtomKdTree.groovy provides drop-in API wrapper. Atoms.java switched to
v2 with invalidate-on-add pattern and periodic-rebuild consolidate().
2026-03-02 20:13:47 +01:00
rdk
5d9ec9eb58 Fix bugs and add error reporting to analyze subcommands
- Fix integer division in BinCounter.getPosRatio() (long/long → double)
- Fix broken NaN check in ConservationCloudFeature (== → Double.isNaN)
- Fix wrong variable in apo_protein error message (proteinFile → apoProteinFile)
- Fix outerLater typo → outerLayer in Atoms.SphereLayers and usages
- Fix xenegy_cloud2_layered typo → xenergy_cloud2_layered in Params and usages
- Add error reporting (writeErrorCsvs) to all analyze subcommands
- Add ignoreLigandsSwitch to doCmdFasta (doesn't need ligands)
2026-03-02 09:59:36 +01:00
rdk
5a38f8f1de Avoid unnecessary allocations in hot paths
Cache aggregated errors in Dataset.Result to avoid recomputing.
Use direct x/y/z field access instead of getCoords() in
Atoms.copyPoints and PointExportData to avoid double[3] allocations.
2026-03-02 04:47:14 +01:00
rdk
1bbdcbc196 Split dataset by protein chain presence in analyze proteins command
Add Dataset.Item.getRow() to reconstruct dataset row strings.
In cmdProteins(), collect items into with/without protein chains
using ConcurrentLinkedQueue and write split .ds files when any
structures lack protein chains.
2026-03-02 02:23:30 +01:00
rdk
22a7dec4bc Bump version to 2.6.0-dev.3 and update xz dependency to 1.12 2026-03-01 21:04:37 +01:00
rdk
3a8e985eb4 Skip ligand loading in parse-proteins command 2026-03-01 20:48:00 +01:00
rdk
582d5ebf1f Optimize distance calculations to avoid getCoords() array allocations
Add Atom-based sqrDist/dist overloads in PerfUtils that use
getX/getY/getZ directly instead of allocating double[] via getCoords().
Refactor Point to store x/y/z as individual fields instead of a
double[] array. Fix Point.setCoords() which was previously a no-op.
Pre-build KD tree in Ligands.makeLigands() before the ligand loop.
Simplify KD tree usage in Atoms.dist/sqrDist by removing redundant
size threshold check.
2026-03-01 20:47:56 +01:00
rdk
4240d9e5c8 Add aggregated error reporting and writeErrorCsvs convenience method
Add writeAggregatedItemErrorsToCsv to Dataset.Result that groups errors
by message and outputs count/error sorted by frequency. Update
getErrorSummary to display an aggregated error table instead of just
the count. Add writeErrorCsvs(outdir) that writes all three error files
(per-item, aggregated, full stack traces) and use it in AnalyzeRoutine.
2026-03-01 19:28:12 +01:00
rdk
a8ab7e97a2 Add analyze proteins and parse-proteins commands with DataTable utility
Add 'analyze proteins' command that outputs per-protein stats CSV
(chain counts, residues, atoms, ligands, peptides) and a summary table
with min/max/avg/median. Add 'analyze parse-proteins' for parsing
all dataset items and reporting errors only.

Introduce DataTable — a lock-free, pre-registered-column table for
structured data collection across threads, with CSV and summary output.
2026-03-01 18:17:07 +01:00
rdk
6434a097f8 Clean up unused imports and sort import order across codebase 2026-02-26 03:46:34 +01:00
rdk
bab04a2a5e Avoid duplicate console output: skip stdout write when log_to_console is enabled 2026-02-26 01:22:27 +01:00
rdk
e923d199e6 Add external conservation provider with cache, health check, and documentation 2026-02-26 00:07:55 +01:00
rdk
34a742cd1b Add tests for ResidueSite and site-based evaluation 2026-02-26 00:07:55 +01:00
rdk
347d4e38d6 Implement site-metrics criteria and evaluation
- Add ResidueSite and SiteLoader
- Update pocket criteria (DCA, DCC, DPA, DSA, DSO, DSWO) for site evaluation
- Extend Evaluation with site-metrics support
- Bump version to 2.6.0-dev.1
2026-02-26 00:07:55 +01:00
rdk
bfdc87f55b replace ArrayList<PPred> with PredictedScores parallel-array structure
Memory optimization: per-prediction cost reduced from ~40 bytes (PPred object)
to ~9 bytes (parallel double[] + boolean[] arrays). For large datasets this
reduces prediction storage by ~77%.

PredictedScores provides: ArrayList-style growth, bulk addAll via arraycopy,
cached observedPositiveCount, stable descending merge sort (required for
reproducible metrics with tied RF scores), and direct backing array access
for hot loops in Metrics/Curves.
2026-02-26 00:07:55 +01:00
rdk
65fc8f3676 update FasterForest to 2.10.2, add Weka RandomForest conversion support 2026-02-26 00:07:55 +01:00
rdk
5ec88309ef update FasterForest to 2.10.1 2026-02-26 00:07:55 +01:00
rdk
1f19bdd2a4 fix ModelConverterTest failing on macOS CI: skip NativePanamaFloat forest types when native library unavailable 2026-02-26 00:07:52 +01:00
rdk
2bf6bfa270 update FasterForest to 2.10.0, bump version to 2.5.2-dev.11 2026-02-23 02:18:41 +01:00
rdk
9fcce6156f add UseCompactObjectHeaders note to local-env.sh template 2026-02-23 02:11:29 +01:00
rdk
57fb214881 update local-env.sh template with throughput-oriented JVM options 2026-02-23 01:17:32 +01:00
rdk
40c7638bc2 implement ModelConverterTest with comprehensive forest conversion tests 2026-02-23 00:54:11 +01:00
rdk
b1a05d3097 bump version to 2.5.2-dev.10 2026-02-23 00:39:54 +01:00
rdk
f3fc9329bc update FasterForest to 2.9.1, bump JUnit Jupiter to 6.0.3, and add NativePanama flattened eval tests 2026-02-23 00:35:18 +01:00
rdk
4aaf212b9b update FasterForest to 2.8.1 with NativePanama support and improve eval time tracking
- Add NativePanamaForest/NativePanamaForestAvx2 availability checks in ModelConverter
- Refactor flattening logic to separate trainable forest preparation from conversion
- Track all eval times and compute average excluding first run (caching warmup)
- Rename TIME_M to TIME_TRAINEVAL_AVG_M, add TIME_EVAL_AVG_M stat
2026-02-22 21:11:16 +01:00
rdk
b5a8edc377 track last evaluation time in EvalResults for seed loop benchmarks 2026-02-22 17:44:56 +01:00
rdk
3ad261645c update FasterForest to 2.8.0 and support flattening of FlatBinaryForest models 2026-02-22 17:33:27 +01:00
rdk
8f7d71ffb3 update FasterForest to 2.7.0 2026-02-20 12:58:14 +01:00
rdk
b8f802b145 refactor model flattening to use FasterForestConverter API with configurable target types
Generalize Model classifier from Classifier to Object to support both
trainable classifiers and flat BinaryForest models. Add rf_flatten_target
parameter for selecting forest type (FlatBinaryForest, LegacyFlatBinaryForest,
InterleavedBfsForest, etc). Deprecate rf_flatten_as_legacy in favor of the
new target type selection.
2026-02-16 01:00:55 +01:00
rdk
de75ac6be1 upgrade FasterForest to 2.6.0 and fix GString compilation errors
Replace flat jar with local Maven repo dependency at correct path
(groupId/artifactId/version/). Fix GString-to-String type errors in
AnalyzeRoutine that broke compilation with @CompileStatic.
2026-02-14 07:35:29 +01:00
rdk
27caa5fe46 sort CSV output rows as strings in analyze commands (chains, chains-residues, labeled-residues) 2026-02-14 06:04:13 +01:00
rdk
93fd8e953a add experimental rescoring model section to rescoring docs 2026-02-11 18:44:04 +01:00
rdk
ad946de45e rephrase Requirements section in README 2026-02-11 18:31:13 +01:00
rdk
9711cc7192 fix aa-mapping docs: broken csv link, replace special characters, cleanup 2026-02-11 18:14:03 +01:00
rdk
4a42f664e2 update aa-mapping documentation: add links to pdbfixer source 2026-02-11 18:05:11 +01:00
rdk
652442a8d2 add JVM compatibility flags to run scripts, document all flags 2026-02-11 15:27:39 +01:00
rdk
e9f530ce37 make --sun-misc-unsafe-memory-access conditional on Java 23+ 2026-02-11 15:24:09 +01:00