* Speed up tautomer canonicalization by deferring on SSSR calc
* Lazy kekulization for tautomer enumeration
Defer kekulization of tautomers until they are actually needed for
transform matching. This avoids creating kekulized copies for:
1. The initial tautomer (until first iteration)
2. New tautomers that may never be processed (if enumeration ends early)
The Tautomer class now supports lazy initialization of the kekulized
form via getKekulized() method.
Performance improvement: ~7% additional speedup (total ~22-24% from baseline)
* Use count-only substructure matching in tautomer scoring
* Add SubstructMatchCount regression test
* MolStandardize: reduce enumerate overhead
* MolStandardize: avoid per-tautomer ring recomputation
* Atom: cache PeriodicTable pointer in valence calcs
* Atom: reuse PeriodicTable in getEffectiveAtomicNum
* PeriodicTable: add atomic fast path for getTable
* GraphMol: reduce ROMol copy reallocations
* MolStandardize: use quickCopy for per-match product copies
Use RWMol(*kmol, true) in tautomer enumeration to avoid copying properties/bookmarks/conformers for each candidate. This reduces deep-copy overhead without changing chemistry.
* MolStandardize: pre-filter scoring patterns by element/connectivity
For tautomer scoring, pre-compute which SubstructTerms are relevant for
a given input molecule. Since tautomerization only moves H atoms and
changes bond orders (never creates/destroys heavy-atom bonds), patterns
requiring missing elements or connectivity can be skipped for all
tautomers of that molecule.
Two-stage filtering:
1. Element check: skip patterns requiring atoms not in the molecule
2. Connectivity check: skip patterns whose bond-order-agnostic structure
doesn't match the input molecule's connectivity
This reduces the number of VF2 substructure calls per tautomer from 12
to typically 3-5, depending on the molecule's composition.
* MolStandardize: preserve molecule properties for canonical tautomer
Copy molecule properties from the original input to the canonical tautomer
result. Since quickCopy during enumeration skips d_props to avoid overhead,
extended SMILES data like link nodes (LN) was lost. This restores them
on the final result.
* TautomerQuery: preserve molecule properties (e.g. link nodes) in tautomers
TautomerQuery::fromMol() uses TautomerEnumerator::enumerate() which uses
quickCopy for performance. This doesn't copy molecule properties like
_molLinkNodes. Without this fix, XQMol output would lose link node
extensions in the SMILES.
Copy properties from the original query molecule to all enumerated
tautomers before constructing the TautomerQuery. This preserves extended
SMILES data without impacting enumeration performance.
* MolStandardize: use parallel iteration and cache bond lookups
Replace O(n) getAtomWithIdx/getBondWithIdx calls with parallel iteration
over atom/bond ranges in canonicalizeInPlace and enumerate. Cache bond
lookups in setTautomerStereoAndIsoHs to avoid repeated O(n) searches.
* perf: add specialized matchers for simple tautomer scoring patterns
Replace VF2 graph matching with O(n) loops for 6 simple patterns:
- countDoubleOrAromaticBonds: C=O, N=O, P=O patterns
- countMethyls: [CX4H3] methyl groups
- countCarbonDoubleHetero: [C]=[/home/dcvuser/rdkit;Code/GraphMol/MolStandardize/Tautomer.h] aliphatic C=hetero
- countAromaticCarbonExocyclicN: [c]=aromatic C=exocyclic N
Complex patterns (benzoquinone, oxim, guanidine, aci-nitro) still use VF2.
Combined with the pre-filtering optimization, this achieves ~3.7x speedup
(~2500ms vs ~9300ms original) for tautomer canonicalization.
* Fix tautomer canonicalize dropping conformers from quickCopy
quickCopy (RWMol(*mol, true)) skips conformers, so tautomer
enumeration products lose 2D/3D coordinates. This causes InChI
generation to omit the /b (double bond E/Z stereo) layer, since
E/Z is derived from atomic coordinates.
Fix: copy conformers from the original molecule onto the canonical
tautomer after pickCanonical in TautomerEnumerator::canonicalize().
Tests: SMILES-based E/Z check in testTautomer.cpp, molblock-based
conformer preservation check in catch_tests.cpp.
* add test on canonicalize losing stereo
* add regression test for exocyclic C=C tautomer canonicalization
The getTautomerStateKey() pre-filter (commit 2595ef748) can falsely
deduplicate distinct tautomers when their atom-index-ordered state
patterns happen to match, leading canonicalize() to pick the wrong
canonical form for molecules with STEREOTRANS-pinned exocyclic C=C
bonds after RemoveHs.
Test verifies that O=C(CC1=CC2=CC=COC2)NC1=O canonicalizes to the
exocyclic form O=C1CC(=CC2=CC=COC2)C(=O)N1, not the endocyclic form
O=C1C=C(C=C2CC=COC2)C(=O)N1.
Currently expected to FAIL until the state key dedup bug is fixed.
* MolStandardize: expand tautomer connectivity SMARTS
* MolStandardize: scope tautomer pattern enum
* MolStandardize: trim tautomer pattern enum
* MolStandardize: use symmetric ring scoring
* extra SSS match functions for atoms/bonds
initial implementation and testing
* add baseline to test
* add a functor for matching atom coords
* support the extra checks in python
* refactor the way the python callbacks are handled
* test tolerances
* expose the AtomCoordsMatcher to python
* allow the extra checks to override the default matching
---------
Co-authored-by: = <=>
* use std::span for substruct match callbacks
This removes a copy from every evaluation of potential matches
* some cleanup/modernization
* some modernization
* deprecate chiralAtomCompat
* small optimization
* remove naked pointers
* improve new_timings.py script
* changes suggested in review
* response to review
* response to review
* Fixes#6017
* a bit of cleanup work
* remove unused variable
* change in response to review
switch to using std::max(maxMatches,maxRecursiveMatches)
* test the case where maxSubstructMatches<maxMatches
* backup
* basic tests pass
* add JSON out to substruct match parameters
* serialize the substruct match parameters in reactions
* add that to the python wrapper
* more testing
* backup commit
This is mabye heading in the right direction and at least passes the basic tests which are there.
* some progress
* more tests and refactoring
* additional aliases
add carboaryl
* add CYC and ACY
* add ABC
* add AHC
* CBC and AOX
* add CHC and HAR
* add CXX
* cleanup: remove a bunch of nullptrs
* initial tagging support
* remove atom labels/sgroups after using them
* docs
* start handing writing
NOTE: this does not currently work: the generic code needs to move out of SubstructSearch
* move the generic groups to their own library
Signed-off-by: greg landrum <greg.landrum@gmail.com>
* make sure the generic groups end up in ctabs
* add forgotten CMakeLists.txt
* fix includes
* expose this stuff to Python
* CYC needs to initialize rings
* renaming
* add docs
* change in response to review
* Most tests working
* All tests working
* Fixed tests after merge with master
* Create header and implementations for RCore
* Updated comments
* Removed old code
* DLL export for MolMatchFinalCheckFunctor
* Information line for failing Mac test
* Log replace core behaviour
* Ordering fix for OSX
* Possible fuzzer fix
* Removed debug output
* Fix unmatched user R group bug
* Code review changes
* Bug fix and ChemTransforms test
* hello world works
* more
* more
minimallib needs to be tested
* parse substructure parameters from JSON
* add substruct search and parameters
* add descriptors
* register more descriptors
* fingerprints, first pass
* stop outputting tiny coord vals
* support generating 2d coords
* coordgen testing
* return nulls
* initial 3d support; add/removeHs; cleanup
* Embedding parameters from JSON
* update
* pattern fp, fps as bytes
* use json to configure MFP
* use json to configure rdkit and pattern fps
* aligned 2d coords
* parsing options
* options for writers
* rename remove_hs
* get this working on windows (kind of)
* silence some msvc warnings
* cmake updates
* update python tests
* add the CFFI code to CI builds
* cleanup line ending mess?
* a couple small fixes
* make this work with URF
* support coordMap in the 3D coordinate generation
* updates in response to review
* Test only commit for using enhanced stereo in substructure search
Adds some test cases to demonstrate what I'm planning.
When the test cases fail, the messages look like this:
-------------------------------------------------------------------------------
Enhanced stereochemistry
AND and OR match their enantiomer
-------------------------------------------------------------------------------
/Users/wandschn/Documents/src/rdkit/Code/GraphMol/Substruct/catch_tests.cpp:216
...............................................................................
/Users/wandschn/Documents/src/rdkit/Code/GraphMol/Substruct/catch_tests.cpp:218: FAILED:
CHECK_THAT( *opposite_mol, IsSubstructOf(*mol_and, ps) )
with expansion:
CC[C@@H](F)[C@@H](C)O is not a substructure of CC[C@H](F)[C@H](C)O |&1:2,4|
/Users/wandschn/Documents/src/rdkit/Code/GraphMol/Substruct/catch_tests.cpp:219: FAILED:
CHECK_THAT( *opposite_mol, IsSubstructOf(*mol_or, ps) )
with expansion:
CC[C@@H](F)[C@@H](C)O is not a substructure of CC[C@H](F)[C@H](C)O |o1:2,4|
* rename parameter to include q and m to reduce my confusion
* Don't keep recreating a map
This map is the same in every loop. And actually, the desired
information is slightly different than what was formerly stored
in the map.
* Fix tests after our discussion.
Also adds more exciting tests of disastereomers and structures
with multiple stereo groups.
* Use enhanced stereochemistry in substructure searching
Allows use of enhaced stereochemistry in substructure searching
if `SubstructMatchParameters.useEnhancedStereo` is set.
The matching rules are pretty obnoxious, but a synopsis is:
* An achiral query/substructure matches everything, because it
means "ignore chirality".
* An absolute query matches AND or OR, because they both include
the molecule with an absolute center
* An query with an OR matches either an OR or an AND, because
AND is more molecules.
* add info about matching to the documentation
* expose extended stereo matching option to python
* Some updates/tweaks to the documentation of enhanced stereochemistry
especially about searching.
* Code review comments.
Co-authored-by: greg landrum <greg.landrum@gmail.com>
* backup, does not work
* working on the C++ side
* backup
* fix the API
* document the new functionality
* improve that example
* final bit of cleanup
* switch to std::function
* first pass at adding a SubstructMatchParameter struct
* start moving the rest of the backend to use the parameters
* backend at least mostly moved over
* add aromaticMatchesConjugated
add tests
* switch over the MolBundle too
Add templates to reduce duplicated code
* support older compilers
let's see if it works...
* add SubstructMatchParameters to Python wrapper
* remove some deprecations and warnings
* damn compilers
* parameter support for bundles in python wrapper
* add the parameters to the java wrappers
* response to review
* very basics
* add the version to get all matches
* better exceptions, including tests
* documentation and actually add the test code
* responses to review
- added threading support to the ResonanceMolSupplier-enabled
SubstructMatch() and relevant tests
- modified/removed some code in O3AAlignMolecules.cpp which doesn't
seem necessary anymore
- modified Code/GraphMol/CMakeLists.txt to allow building
on Windows
as conjugated like their oxygen analogs
- fixed an issue with large numbers of resonance structures exceeding
the unsigned int allowance
- implemented the uniquify feature properly
- uniquify now defaults to false when using the ResonanceMolSupplier-
enabled SubstructMatch() version
- the concept of 'laziness' is now clearer
- TODO:
* remove some debugging info
* move classes from .h to .cpp
* SWIG wrappers
* improve resonance structure sorting for degenerate resonance
structures
I will do all of the above ASAP
smarts queries run faster (like vector bindings). Though there is an addition to the smarts parser
exposed here, I do not recommend using it in client code.