43 Commits

Author SHA1 Message Date
Yakov Pechersky
c6cabf4153 Speed-up tautomer canonicalization, no API changes (#9134)
* Speed up tautomer canonicalization by deferring on SSSR calc

* Lazy kekulization for tautomer enumeration

Defer kekulization of tautomers until they are actually needed for
transform matching. This avoids creating kekulized copies for:
1. The initial tautomer (until first iteration)
2. New tautomers that may never be processed (if enumeration ends early)

The Tautomer class now supports lazy initialization of the kekulized
form via getKekulized() method.

Performance improvement: ~7% additional speedup (total ~22-24% from baseline)

* Use count-only substructure matching in tautomer scoring

* Add SubstructMatchCount regression test

* MolStandardize: reduce enumerate overhead

* MolStandardize: avoid per-tautomer ring recomputation

* Atom: cache PeriodicTable pointer in valence calcs

* Atom: reuse PeriodicTable in getEffectiveAtomicNum

* PeriodicTable: add atomic fast path for getTable

* GraphMol: reduce ROMol copy reallocations

* MolStandardize: use quickCopy for per-match product copies

Use RWMol(*kmol, true) in tautomer enumeration to avoid copying properties/bookmarks/conformers for each candidate. This reduces deep-copy overhead without changing chemistry.

* MolStandardize: pre-filter scoring patterns by element/connectivity

For tautomer scoring, pre-compute which SubstructTerms are relevant for
a given input molecule. Since tautomerization only moves H atoms and
changes bond orders (never creates/destroys heavy-atom bonds), patterns
requiring missing elements or connectivity can be skipped for all
tautomers of that molecule.

Two-stage filtering:
1. Element check: skip patterns requiring atoms not in the molecule
2. Connectivity check: skip patterns whose bond-order-agnostic structure
   doesn't match the input molecule's connectivity

This reduces the number of VF2 substructure calls per tautomer from 12
to typically 3-5, depending on the molecule's composition.

* MolStandardize: preserve molecule properties for canonical tautomer

Copy molecule properties from the original input to the canonical tautomer
result. Since quickCopy during enumeration skips d_props to avoid overhead,
extended SMILES data like link nodes (LN) was lost. This restores them
on the final result.

* TautomerQuery: preserve molecule properties (e.g. link nodes) in tautomers

TautomerQuery::fromMol() uses TautomerEnumerator::enumerate() which uses
quickCopy for performance. This doesn't copy molecule properties like
_molLinkNodes. Without this fix, XQMol output would lose link node
extensions in the SMILES.

Copy properties from the original query molecule to all enumerated
tautomers before constructing the TautomerQuery. This preserves extended
SMILES data without impacting enumeration performance.

* MolStandardize: use parallel iteration and cache bond lookups

Replace O(n) getAtomWithIdx/getBondWithIdx calls with parallel iteration
over atom/bond ranges in canonicalizeInPlace and enumerate. Cache bond
lookups in setTautomerStereoAndIsoHs to avoid repeated O(n) searches.

* perf: add specialized matchers for simple tautomer scoring patterns

Replace VF2 graph matching with O(n) loops for 6 simple patterns:
- countDoubleOrAromaticBonds: C=O, N=O, P=O patterns
- countMethyls: [CX4H3] methyl groups
- countCarbonDoubleHetero: [C]=[/home/dcvuser/rdkit;Code/GraphMol/MolStandardize/Tautomer.h] aliphatic C=hetero
- countAromaticCarbonExocyclicN: [c]=aromatic C=exocyclic N
Complex patterns (benzoquinone, oxim, guanidine, aci-nitro) still use VF2.
Combined with the pre-filtering optimization, this achieves ~3.7x speedup
(~2500ms vs ~9300ms original) for tautomer canonicalization.

* Fix tautomer canonicalize dropping conformers from quickCopy

quickCopy (RWMol(*mol, true)) skips conformers, so tautomer
enumeration products lose 2D/3D coordinates. This causes InChI
generation to omit the /b (double bond E/Z stereo) layer, since
E/Z is derived from atomic coordinates.

Fix: copy conformers from the original molecule onto the canonical
tautomer after pickCanonical in TautomerEnumerator::canonicalize().

Tests: SMILES-based E/Z check in testTautomer.cpp, molblock-based
conformer preservation check in catch_tests.cpp.

* add test on canonicalize losing stereo

* add regression test for exocyclic C=C tautomer canonicalization

The getTautomerStateKey() pre-filter (commit 2595ef748) can falsely
deduplicate distinct tautomers when their atom-index-ordered state
patterns happen to match, leading canonicalize() to pick the wrong
canonical form for molecules with STEREOTRANS-pinned exocyclic C=C
bonds after RemoveHs.

Test verifies that O=C(CC1=CC2=CC=COC2)NC1=O canonicalizes to the
exocyclic form O=C1CC(=CC2=CC=COC2)C(=O)N1, not the endocyclic form
O=C1C=C(C=C2CC=COC2)C(=O)N1.

Currently expected to FAIL until the state key dedup bug is fixed.

* MolStandardize: expand tautomer connectivity SMARTS

* MolStandardize: scope tautomer pattern enum

* MolStandardize: trim tautomer pattern enum

* MolStandardize: use symmetric ring scoring
2026-03-31 06:42:40 +02:00
Greg Landrum
ef90a4bedf Allow adding custom atom and bond matcher functions for substructure searching (#8994)
* extra SSS match functions for atoms/bonds
initial implementation and testing

* add baseline to test

* add a functor for matching atom coords

* support the extra checks in python

* refactor the way the python callbacks are handled

* test tolerances

* expose the AtomCoordsMatcher to python

* allow the extra checks to override the default matching

---------

Co-authored-by: = <=>
2025-12-12 20:03:31 +01:00
Greg Landrum
a9477d2694 Modernization of some substructure code (#8450)
* use std::span for substruct match callbacks

This removes a copy from every evaluation of potential matches

* some cleanup/modernization

* some modernization

* deprecate chiralAtomCompat

* small optimization

* remove naked pointers

* improve new_timings.py script

* changes suggested in review

* response to review

* response to review
2025-05-12 06:33:25 +02:00
Greg Landrum
5976eead54 Fixes #8485 (#8490) 2025-05-05 08:57:18 +02:00
Greg Landrum
e77d4e3f6a allow specified chiral features to SSS match unspecified features (#8115) 2024-12-18 20:37:17 +01:00
Greg Landrum
4a69bc3493 Fixes #6017 (#6825)
* Fixes #6017

* a bit of cleanup work

* remove unused variable

* change in response to review
switch to using std::max(maxMatches,maxRecursiveMatches)

* test the case where maxSubstructMatches<maxMatches
2023-10-25 04:57:29 +02:00
Rachel Walker
70427aa9b4 Add atom and bond property parameters to substruct matching (#6453)
* Add atom and bond property parameters to substruct matching

* use getPropIfPresent in propertyCompat

* fix typo

Co-authored-by: Greg Landrum <greg.landrum@gmail.com>

* Update Code/GraphMol/Substruct/SubstructUtils.cpp

Co-authored-by: Greg Landrum <greg.landrum@gmail.com>

* Update Code/GraphMol/Substruct/SubstructUtils.cpp

Co-authored-by: Greg Landrum <greg.landrum@gmail.com>

* added python tests

* Add PRECONDITIONs

Co-authored-by: Greg Landrum <greg.landrum@gmail.com>

---------

Co-authored-by: Greg Landrum <greg.landrum@gmail.com>
2023-06-15 05:08:48 +02:00
Greg Landrum
71051cde10 Fixes #6211 (#6250)
* backup

* basic tests pass

* add JSON out to substruct match parameters

* serialize the substruct match parameters in reactions

* add that to the python wrapper

* more testing
2023-04-05 19:08:37 +02:00
Greg Landrum
4e1a590b9f Fixes #888 (#6018)
* Fixes #888

* support older versions of boost
support for hashing dynamic_bitset was not added until v1.71

* changes in response to review
2023-01-30 17:18:22 +01:00
Brian Kelley
866e0f19f0 silence warnings in MSVC compliatons (#4796) 2021-12-15 04:54:11 +01:00
Greg Landrum
52f73e4be0 Add support for Beilstein generics when doing substructure queries (#4673)
* backup commit
This is mabye heading in the right direction and at least passes the basic tests which are there.

* some progress

* more tests and refactoring

* additional aliases
add carboaryl

* add CYC and ACY

* add ABC

* add AHC

* CBC and AOX

* add CHC and HAR

* add CXX

* cleanup: remove a bunch of nullptrs

* initial tagging support

* remove atom labels/sgroups after using them

* docs

* start handing writing

NOTE: this does not currently work: the generic code needs to move out of SubstructSearch

* move the generic groups to their own library

Signed-off-by: greg landrum <greg.landrum@gmail.com>

* make sure the generic groups end up in ctabs

* add forgotten CMakeLists.txt

* fix includes

* expose this stuff to Python

* CYC needs to initialize rings

* renaming

* add docs

* change in response to review
2021-12-01 06:01:53 +01:00
Eisuke Kawashima
78aac3c1bc Run clang-format against header files (#4143) 2021-06-08 07:57:51 +02:00
Gareth Jones
c2fb57c19f RGD - a fix for the cubane issue (single target atom matches 2 user R group attachments) (#4002)
* Most tests working

* All tests working

* Fixed tests after merge with master

* Create header and implementations for RCore

* Updated comments

* Removed old code

* DLL export for MolMatchFinalCheckFunctor

* Information line for failing Mac test

* Log replace core behaviour

* Ordering fix for OSX

* Possible fuzzer fix

* Removed debug output

* Fix unmatched user R group bug

* Code review changes

* Bug fix and ChemTransforms test
2021-05-23 15:16:03 -04:00
Greg Landrum
f829c877d8 MinimalLib: add CFFI interface (#4018)
* hello world works

* more

* more
minimallib needs to be tested

* parse substructure parameters from JSON

* add substruct search and parameters

* add descriptors

* register more descriptors

* fingerprints, first pass

* stop outputting tiny coord vals

* support generating 2d coords

* coordgen testing

* return nulls

* initial 3d support; add/removeHs; cleanup

* Embedding parameters from JSON

* update

* pattern fp, fps as bytes

* use json to configure MFP

* use json to configure rdkit and pattern fps

* aligned 2d coords

* parsing options

* options for writers

* rename remove_hs

* get this working on windows (kind of)

* silence some msvc warnings

* cmake updates

* update python tests

* add the CFFI code to CI builds

* cleanup line ending mess?

* a couple small fixes

* make this work with URF

* support coordMap in the 3D coordinate generation

* updates in response to review
2021-04-15 21:33:52 +02:00
Dan N
3dc1a220b7 Allow enhanced stereo to be used in substructure search (#3003)
* Test only commit for using enhanced stereo in substructure search

Adds some test cases to demonstrate what I'm planning.

When the test cases fail, the messages look like this:

    -------------------------------------------------------------------------------
    Enhanced stereochemistry
    AND and OR match their enantiomer
    -------------------------------------------------------------------------------
    /Users/wandschn/Documents/src/rdkit/Code/GraphMol/Substruct/catch_tests.cpp:216
    ...............................................................................

    /Users/wandschn/Documents/src/rdkit/Code/GraphMol/Substruct/catch_tests.cpp:218: FAILED:
    CHECK_THAT( *opposite_mol, IsSubstructOf(*mol_and, ps) )
    with expansion:
    CC[C@@H](F)[C@@H](C)O is not a substructure of CC[C@H](F)[C@H](C)O |&1:2,4|

    /Users/wandschn/Documents/src/rdkit/Code/GraphMol/Substruct/catch_tests.cpp:219: FAILED:
    CHECK_THAT( *opposite_mol, IsSubstructOf(*mol_or, ps) )
    with expansion:
    CC[C@@H](F)[C@@H](C)O is not a substructure of CC[C@H](F)[C@H](C)O |o1:2,4|

* rename parameter to include q and m to reduce my confusion

* Don't keep recreating a map

This map is the same in every loop. And actually, the desired
information is slightly different than what was formerly stored
in the map.

* Fix tests after our discussion.

Also adds more exciting tests of disastereomers and structures
with multiple stereo groups.

* Use enhanced stereochemistry in substructure searching

Allows use of enhaced stereochemistry in substructure searching
if `SubstructMatchParameters.useEnhancedStereo` is set.

The matching rules are pretty obnoxious, but a synopsis is:

* An achiral query/substructure matches everything, because it
  means "ignore chirality".
* An absolute query matches AND or OR, because they both include
  the molecule with an absolute center
* An query with an OR matches either an OR or an AND, because
  AND is more molecules.

* add info about matching to the documentation

* expose extended stereo matching option to python

* Some updates/tweaks to the documentation of enhanced stereochemistry

especially about searching.

* Code review comments.

Co-authored-by: greg landrum <greg.landrum@gmail.com>
2020-03-21 05:12:40 +01:00
Greg Landrum
a2767d9f7d Allow custom post-match filters for substructure matching (#2927)
* backup, does not work

* working on the C++ side

* backup

* fix the API

* document the new functionality

* improve that example

* final bit of cleanup

* switch to std::function
2020-02-04 11:22:38 -05:00
Greg Landrum
ec31bea97b clang-tidy-7 pass (#2408) 2019-04-16 12:05:47 -04:00
Greg Landrum
a102eaf932 Add options for substructure searching (#2254)
* first pass at adding a SubstructMatchParameter struct

* start moving the rest of the backend to use the parameters

* backend at least mostly moved over

* add aromaticMatchesConjugated
add tests

* switch over the MolBundle too
Add templates to reduce duplicated code

* support older compilers

let's see if it works...

* add SubstructMatchParameters to Python wrapper

* remove some deprecations and warnings

* damn compilers

* parameter support for bundles in python wrapper

* add the parameters to the java wrappers

* response to review
2019-02-08 09:10:10 -05:00
Greg Landrum
2738c35178 Fixes #1903 (#1971)
* Fixes #1903

* update SWIG bindings too
2018-07-25 09:14:17 +02:00
Paolo Tosco
c08ea49bda - enable building DLLs on Windows (#1861)
* - enable building DLLs on Windows

* - export.h and test.h are now auto-generated by CMake
2018-05-16 08:42:41 +02:00
Greg Landrum
bbd615497a Add a MolBundle class (#1537)
* very basics

* add the version to get all matches

* better exceptions, including tests

* documentation and actually add the test code

* responses to review
2017-09-11 13:04:58 -04:00
Greg Landrum
769e6648e4 Fixes #1489 (#1556)
* move the describeQuery functions to the RDKit namespace.
They are generally useful

* Fixes #1489
2017-09-11 08:34:25 -04:00
Greg Landrum
e08e0d16d8 first pass, using google style 2015-11-14 14:58:11 +01:00
Greg Landrum
5992c6fd23 - made the ResonanceMolSupplier really lazy, i.e. resonance structure enumeration is only carried out when the user asks for a structure or when the user explicitly request that calling the enumerate() member function. This makes object creation fast and enables calling getNumConjGrps(), getBondConjGrpIdx() and getAtomConjGrpIdx() member function without incurring in the cost of necessarily enumerating resonance structures
- now bonds and atoms with do not belong to conjugated groups get a -1 index
- added a few Python wrappers
- added a few tests
2015-11-04 05:39:46 +01:00
Paolo Tosco
3d48ba72e1 - added threading support to ResonanceMolSupplier and relevant tests
- added threading support to the ResonanceMolSupplier-enabled
  SubstructMatch() and relevant tests
- modified/removed some code in O3AAlignMolecules.cpp which doesn't
  seem necessary anymore
- modified Code/GraphMol/CMakeLists.txt to allow building
  on Windows
2015-11-01 23:01:34 +00:00
Paolo Tosco
f43677b978 - fixed a problem with thiocarboxylates/thiolates not being perceived
as conjugated like their oxygen analogs
- fixed an issue with large numbers of resonance structures exceeding
  the unsigned int allowance
- implemented the uniquify feature properly
- uniquify now defaults to false when using the ResonanceMolSupplier-
  enabled SubstructMatch() version
- the concept of 'laziness' is now clearer
- TODO:
  * remove some debugging info
  * move classes from .h to .cpp
  * SWIG wrappers
  * improve resonance structure sorting for degenerate resonance
    structures
  I will do all of the above ASAP
2015-10-21 20:06:53 +01:00
Paolo Tosco
eaa187b03d - added ResonanceMolSupplier
- added overloaded SubstructMatch() version supporting ResonanceMolSupplier
- added relevant Python wrappers
- added C++/Python tests
2015-10-04 23:21:28 +01:00
Greg Landrum
4b8caf2ceb Fixes #409 2015-01-10 07:21:55 +01:00
Greg Landrum
34ab68ca2a introduce QueryBond::QueryMatch, as with QueryAtoms;
all tests passing;
performance tests still needed
2014-05-07 05:29:25 +02:00
Greg Landrum
4a14a52674 Fixes #153 2013-11-15 06:47:18 +01:00
Greg Landrum
f3fbef45c5 update copyright statements 2010-09-26 17:04:37 +00:00
Greg Landrum
4db8233db6 sync with trunk 2010-09-10 05:12:41 +00:00
Greg Landrum
052ec66542 cleanups:
remove x bit from headers and sources;
remove a couple empty files from Code/GraphMol
2010-09-08 04:25:57 +00:00
Greg Landrum
0ce95829a5 cleanup deprecated args 2010-08-21 00:16:55 +00:00
Greg Landrum
f42f479d28 enabling infrastructure for making repeated recursive
smarts queries run faster (like vector bindings). Though there is an addition to the smarts parser
exposed here, I do not recommend using it in client code.
2010-06-03 10:02:15 +00:00
Greg Landrum
ec2c2042e8 remove the vflib usage code from Substruct area 2010-04-19 07:45:22 +00:00
Greg Landrum
30fe77b609 initial commit: passes all tests and seems to be faster than the original code 2009-02-09 16:19:31 +00:00
Greg Landrum
e450f5beeb doc updates and some minor formatting changes 2007-10-24 16:37:37 +00:00
Greg Landrum
7cfa8cde0b another substruct caching try 2007-09-23 06:56:13 +00:00
Greg Landrum
d5ffea669d add support for chirality in substructure searches;
this only is going to work in cases where CIP codes have been (i.e. can be)
assigned to atoms.
2006-11-03 06:35:14 +00:00
Greg Landrum
88d596abca get the AR_MOLGRAPH caching write with substructs; the current implementation introduces a core leak, so it is disabled by default 2006-10-19 05:24:05 +00:00
Greg Landrum
a5d7fc550a try to get ChemTransforms checked in 2006-07-18 05:35:12 +00:00
Greg Landrum
75a79b6327 initial import 2006-05-06 22:20:08 +00:00