16 Commits

Author SHA1 Message Date
Yakov Pechersky
c6cabf4153 Speed-up tautomer canonicalization, no API changes (#9134)
* Speed up tautomer canonicalization by deferring on SSSR calc

* Lazy kekulization for tautomer enumeration

Defer kekulization of tautomers until they are actually needed for
transform matching. This avoids creating kekulized copies for:
1. The initial tautomer (until first iteration)
2. New tautomers that may never be processed (if enumeration ends early)

The Tautomer class now supports lazy initialization of the kekulized
form via getKekulized() method.

Performance improvement: ~7% additional speedup (total ~22-24% from baseline)

* Use count-only substructure matching in tautomer scoring

* Add SubstructMatchCount regression test

* MolStandardize: reduce enumerate overhead

* MolStandardize: avoid per-tautomer ring recomputation

* Atom: cache PeriodicTable pointer in valence calcs

* Atom: reuse PeriodicTable in getEffectiveAtomicNum

* PeriodicTable: add atomic fast path for getTable

* GraphMol: reduce ROMol copy reallocations

* MolStandardize: use quickCopy for per-match product copies

Use RWMol(*kmol, true) in tautomer enumeration to avoid copying properties/bookmarks/conformers for each candidate. This reduces deep-copy overhead without changing chemistry.

* MolStandardize: pre-filter scoring patterns by element/connectivity

For tautomer scoring, pre-compute which SubstructTerms are relevant for
a given input molecule. Since tautomerization only moves H atoms and
changes bond orders (never creates/destroys heavy-atom bonds), patterns
requiring missing elements or connectivity can be skipped for all
tautomers of that molecule.

Two-stage filtering:
1. Element check: skip patterns requiring atoms not in the molecule
2. Connectivity check: skip patterns whose bond-order-agnostic structure
   doesn't match the input molecule's connectivity

This reduces the number of VF2 substructure calls per tautomer from 12
to typically 3-5, depending on the molecule's composition.

* MolStandardize: preserve molecule properties for canonical tautomer

Copy molecule properties from the original input to the canonical tautomer
result. Since quickCopy during enumeration skips d_props to avoid overhead,
extended SMILES data like link nodes (LN) was lost. This restores them
on the final result.

* TautomerQuery: preserve molecule properties (e.g. link nodes) in tautomers

TautomerQuery::fromMol() uses TautomerEnumerator::enumerate() which uses
quickCopy for performance. This doesn't copy molecule properties like
_molLinkNodes. Without this fix, XQMol output would lose link node
extensions in the SMILES.

Copy properties from the original query molecule to all enumerated
tautomers before constructing the TautomerQuery. This preserves extended
SMILES data without impacting enumeration performance.

* MolStandardize: use parallel iteration and cache bond lookups

Replace O(n) getAtomWithIdx/getBondWithIdx calls with parallel iteration
over atom/bond ranges in canonicalizeInPlace and enumerate. Cache bond
lookups in setTautomerStereoAndIsoHs to avoid repeated O(n) searches.

* perf: add specialized matchers for simple tautomer scoring patterns

Replace VF2 graph matching with O(n) loops for 6 simple patterns:
- countDoubleOrAromaticBonds: C=O, N=O, P=O patterns
- countMethyls: [CX4H3] methyl groups
- countCarbonDoubleHetero: [C]=[/home/dcvuser/rdkit;Code/GraphMol/MolStandardize/Tautomer.h] aliphatic C=hetero
- countAromaticCarbonExocyclicN: [c]=aromatic C=exocyclic N
Complex patterns (benzoquinone, oxim, guanidine, aci-nitro) still use VF2.
Combined with the pre-filtering optimization, this achieves ~3.7x speedup
(~2500ms vs ~9300ms original) for tautomer canonicalization.

* Fix tautomer canonicalize dropping conformers from quickCopy

quickCopy (RWMol(*mol, true)) skips conformers, so tautomer
enumeration products lose 2D/3D coordinates. This causes InChI
generation to omit the /b (double bond E/Z stereo) layer, since
E/Z is derived from atomic coordinates.

Fix: copy conformers from the original molecule onto the canonical
tautomer after pickCanonical in TautomerEnumerator::canonicalize().

Tests: SMILES-based E/Z check in testTautomer.cpp, molblock-based
conformer preservation check in catch_tests.cpp.

* add test on canonicalize losing stereo

* add regression test for exocyclic C=C tautomer canonicalization

The getTautomerStateKey() pre-filter (commit 2595ef748) can falsely
deduplicate distinct tautomers when their atom-index-ordered state
patterns happen to match, leading canonicalize() to pick the wrong
canonical form for molecules with STEREOTRANS-pinned exocyclic C=C
bonds after RemoveHs.

Test verifies that O=C(CC1=CC2=CC=COC2)NC1=O canonicalizes to the
exocyclic form O=C1CC(=CC2=CC=COC2)C(=O)N1, not the endocyclic form
O=C1C=C(C=C2CC=COC2)C(=O)N1.

Currently expected to FAIL until the state key dedup bug is fixed.

* MolStandardize: expand tautomer connectivity SMARTS

* MolStandardize: scope tautomer pattern enum

* MolStandardize: trim tautomer pattern enum

* MolStandardize: use symmetric ring scoring
2026-03-31 06:42:40 +02:00
Greg Landrum
a9477d2694 Modernization of some substructure code (#8450)
* use std::span for substruct match callbacks

This removes a copy from every evaluation of potential matches

* some cleanup/modernization

* some modernization

* deprecate chiralAtomCompat

* small optimization

* remove naked pointers

* improve new_timings.py script

* changes suggested in review

* response to review

* response to review
2025-05-12 06:33:25 +02:00
Ricardo Rodriguez
35cb7809f2 Fixes #8492 (#8493) 2025-05-08 08:05:13 +02:00
Greg Landrum
b797137341 Make TautomerQuery serializable (#5248)
* add serialization to the TautomerQuery

* add to python wrapper

* changes in response to review

* put TautomerQueryCanSerialize in Python wrappers

* include serialization build dependency
2022-05-04 14:31:50 +02:00
Eisuke Kawashima
27f711a658 Run clang-tidy (readability-braces-around-statements) (#4977)
https://github.com/rdkit/rdkit/pull/3024#discussion_r526549843
2022-03-10 08:00:10 +01:00
Paolo Tosco
29ebf45f17 remove dead code (#4739)
Co-authored-by: Tosco, Paolo <paolo.tosco@novartis.com>
2021-11-29 08:00:21 +01:00
Ric
878c4c7ec0 save one search (#4566) 2021-09-28 04:37:24 +02:00
Ric
6db202aa0d Improve performance of removing substruct/tautomer duplicates (#4560)
* improve removeDuplicates performance

* improve removeTautomerDuplicates performance

* use std::set
2021-09-25 15:45:55 +02:00
Eisuke Kawashima
48f4f3ee82 Run clang-tidy (modernize-pass-by-value) (#4224) 2021-06-14 06:57:08 +02:00
Greg Landrum
f5a54af475 A collection of MolStandardize improvements (#4118)
* Swap to using a data structure for default normalization parameters

* bring the default fragment data into the code too

* cleanup

* add reionizer parameters via data

change fragment parse failures to ValueErrorExceptions

* tautomer parameters in the code

* got a little over-enthusiastic in that last cleanup

* use boost::flyweight to cache normalization and charge data params

* a bit more cleanup

* support reading params from JSON

* fragments from JSON
single-call for fragment removal

* add a one-liner for the canonical tautomer

* quick refactor

* Fixes #4115

* complete the parents

* docs

* move the definitions to a namespace and make them const

* see if switching to c++14 fixes the CI compile problems with g++ 5.5

* somewhat uglier way of solving the initalizer list problem
2021-05-19 09:11:23 +02:00
Paolo Tosco
106e9f7c37 Normalize line endings in source code files (#4104)
* set all source code files to have native line endings

* normalized all source code line endings

Co-authored-by: Paolo Tosco <paolo.tosco@novartis.com>
2021-05-13 14:31:39 +02:00
Brian Kelley
c8aa10c80f Add tautomer query to the substructlibrary (#3808)
* Fixes #3797

* [WIP] Add tautomer queries to the substruct library

* Add TautomerQuery to CMake

* Add missing TautomerQuery functions, python wrapper and tests

* Add python wrappers for Substruct Library Tautomer Queries

* Explictly label non-const pattern function now that we have both

* Use boost::shared_ptr not std::shared_ptr

* Fix java builds

* One more try to fix java builds

* Fix Java Tests

* Run clang format

* Reenable tests

* Fix annoyingly stupid bug and annoying commit of debug code

* Fix documentation

* reenable ifdef threadsafe check

* Throw warning and perform tautomer search instead of bailing with incorrect fingerprints

* Simplfy api with templates

* Fix SubstructLibrary java issues

* minor API cleanup

* simplify the SWIG wrappers

Co-authored-by: Brian Kelley <bkelley@relaytx.com>
Co-authored-by: greg landrum <greg.landrum@gmail.com>
2021-03-05 04:56:20 +01:00
Brian Kelley
d9033e4626 Fixes #3821 copy constructor by making the template molecule a shared… (#3822)
* Fixes #3821 copy constructor by making the template molecule a shared pointer

* Pushed a commit by accident, reverting

* Copy constructor now does a deep copy

* Add operator= test and ensure deep copies of template

* Update Code/GraphMol/TautomerQuery/catch_tests.cpp

Co-authored-by: Greg Landrum <greg.landrum@gmail.com>

* Remove extraneous .get()'s

* Add better testing names for catch test

Co-authored-by: Greg Landrum <greg.landrum@gmail.com>
2021-02-23 07:41:59 +01:00
Paolo Tosco
527a7adf99 Some work on TautomerEnumerator (#3327)
* - Added a TautomerEnumerator constructor which allows passing CleanupParameters
- Added three configurable parameters to CleanupParameters
- Added a callback to TautomerEnumerator
- Fixed a bug where the same tautomer could be mapped by both isomeric and non-isomeric SMILES
- TautomerEnumerator::enumerate() now returns a TautomerEnumeratorResult and does not take
  dynamic_bitset pointers as optional parameters
- Added a missing transform from the Sitzmann paper
- General code cleanup and optimization

* - TautomerEnumeratorResult is now iterable in both C++ and Python
- further optimizations
- implemented a TautomerEnumerator.PickCanonical() Python wrapper
- added C++ and Python accessors to SMILES and SmilesTautomerMap

* - make sure the number of tautomers reported by rdLogger is correct and definitive

* make sure that if N maxTautomers are requested, N tautomers are returned if the theoretical number of tautomers is M>N

* avoid that sulfonic acids hit the formamidinesulfinic acid tautomerisation rule

* offer an option to allow the old API to still be used.

* Changes in response to review and following discussion with Gareth and Greg

* - made TautomerEnumeratorResult an enum class (was a plain C enum)
- made TautomerEnumeratorResult::const_iterator a bidirectional_iterator
- added tests to fully probe the TautomerEnumeratorResult::const_iterator functionality

* - change the difference_type definition
- added tests for the above

* - cosmetic change to improve code readability

Co-authored-by: greg landrum <greg.landrum@gmail.com>
2020-09-24 17:00:03 -04:00
jones-gareth
9a864f4238 Sgroup (#3390)
* Changes to use SubstanceGroups in Java

* Forgot to add SWIG file

* Java test for SubstanceGroup wrappers

* Added RDKit boilerplate
2020-09-09 04:59:08 +02:00
jones-gareth
21a8a263bd Tautomer search (#3205)
* TautomerQuery class

* working test

* Comment header

* Merge with master. Greg's suggestions. More tests. Python wrapper

* Updated Pattern Fingerprints to merge with master. Reset email

* Java/C# wrappers. Java test

* Java/C# wrappers. Java test

* Java/C# wrappers. Java test

* Greg suggestions of 6_2_2020

* Explicit types in Java TautomerQueryTests class

* Update Code/GraphMol/QueryOps.h

Co-authored-by: Greg Landrum <greg.landrum@gmail.com>

* get windows dll builds working

* Removed tautomer query wrappers from RDKit namespace

* Fixes from evaluation

* Template molecule identification fix. Greg's suggestion

* Final check search functor for evaluating template matches as they are found

Co-authored-by: Gareth Jones <gjones@glysade.com>
Co-authored-by: Greg Landrum <greg.landrum@gmail.com>
2020-06-24 17:27:40 +02:00