26 Commits

Author SHA1 Message Date
Yakov Pechersky
c6cabf4153 Speed-up tautomer canonicalization, no API changes (#9134)
* Speed up tautomer canonicalization by deferring on SSSR calc

* Lazy kekulization for tautomer enumeration

Defer kekulization of tautomers until they are actually needed for
transform matching. This avoids creating kekulized copies for:
1. The initial tautomer (until first iteration)
2. New tautomers that may never be processed (if enumeration ends early)

The Tautomer class now supports lazy initialization of the kekulized
form via getKekulized() method.

Performance improvement: ~7% additional speedup (total ~22-24% from baseline)

* Use count-only substructure matching in tautomer scoring

* Add SubstructMatchCount regression test

* MolStandardize: reduce enumerate overhead

* MolStandardize: avoid per-tautomer ring recomputation

* Atom: cache PeriodicTable pointer in valence calcs

* Atom: reuse PeriodicTable in getEffectiveAtomicNum

* PeriodicTable: add atomic fast path for getTable

* GraphMol: reduce ROMol copy reallocations

* MolStandardize: use quickCopy for per-match product copies

Use RWMol(*kmol, true) in tautomer enumeration to avoid copying properties/bookmarks/conformers for each candidate. This reduces deep-copy overhead without changing chemistry.

* MolStandardize: pre-filter scoring patterns by element/connectivity

For tautomer scoring, pre-compute which SubstructTerms are relevant for
a given input molecule. Since tautomerization only moves H atoms and
changes bond orders (never creates/destroys heavy-atom bonds), patterns
requiring missing elements or connectivity can be skipped for all
tautomers of that molecule.

Two-stage filtering:
1. Element check: skip patterns requiring atoms not in the molecule
2. Connectivity check: skip patterns whose bond-order-agnostic structure
   doesn't match the input molecule's connectivity

This reduces the number of VF2 substructure calls per tautomer from 12
to typically 3-5, depending on the molecule's composition.

* MolStandardize: preserve molecule properties for canonical tautomer

Copy molecule properties from the original input to the canonical tautomer
result. Since quickCopy during enumeration skips d_props to avoid overhead,
extended SMILES data like link nodes (LN) was lost. This restores them
on the final result.

* TautomerQuery: preserve molecule properties (e.g. link nodes) in tautomers

TautomerQuery::fromMol() uses TautomerEnumerator::enumerate() which uses
quickCopy for performance. This doesn't copy molecule properties like
_molLinkNodes. Without this fix, XQMol output would lose link node
extensions in the SMILES.

Copy properties from the original query molecule to all enumerated
tautomers before constructing the TautomerQuery. This preserves extended
SMILES data without impacting enumeration performance.

* MolStandardize: use parallel iteration and cache bond lookups

Replace O(n) getAtomWithIdx/getBondWithIdx calls with parallel iteration
over atom/bond ranges in canonicalizeInPlace and enumerate. Cache bond
lookups in setTautomerStereoAndIsoHs to avoid repeated O(n) searches.

* perf: add specialized matchers for simple tautomer scoring patterns

Replace VF2 graph matching with O(n) loops for 6 simple patterns:
- countDoubleOrAromaticBonds: C=O, N=O, P=O patterns
- countMethyls: [CX4H3] methyl groups
- countCarbonDoubleHetero: [C]=[/home/dcvuser/rdkit;Code/GraphMol/MolStandardize/Tautomer.h] aliphatic C=hetero
- countAromaticCarbonExocyclicN: [c]=aromatic C=exocyclic N
Complex patterns (benzoquinone, oxim, guanidine, aci-nitro) still use VF2.
Combined with the pre-filtering optimization, this achieves ~3.7x speedup
(~2500ms vs ~9300ms original) for tautomer canonicalization.

* Fix tautomer canonicalize dropping conformers from quickCopy

quickCopy (RWMol(*mol, true)) skips conformers, so tautomer
enumeration products lose 2D/3D coordinates. This causes InChI
generation to omit the /b (double bond E/Z stereo) layer, since
E/Z is derived from atomic coordinates.

Fix: copy conformers from the original molecule onto the canonical
tautomer after pickCanonical in TautomerEnumerator::canonicalize().

Tests: SMILES-based E/Z check in testTautomer.cpp, molblock-based
conformer preservation check in catch_tests.cpp.

* add test on canonicalize losing stereo

* add regression test for exocyclic C=C tautomer canonicalization

The getTautomerStateKey() pre-filter (commit 2595ef748) can falsely
deduplicate distinct tautomers when their atom-index-ordered state
patterns happen to match, leading canonicalize() to pick the wrong
canonical form for molecules with STEREOTRANS-pinned exocyclic C=C
bonds after RemoveHs.

Test verifies that O=C(CC1=CC2=CC=COC2)NC1=O canonicalizes to the
exocyclic form O=C1CC(=CC2=CC=COC2)C(=O)N1, not the endocyclic form
O=C1C=C(C=C2CC=COC2)C(=O)N1.

Currently expected to FAIL until the state key dedup bug is fixed.

* MolStandardize: expand tautomer connectivity SMARTS

* MolStandardize: scope tautomer pattern enum

* MolStandardize: trim tautomer pattern enum

* MolStandardize: use symmetric ring scoring
2026-03-31 06:42:40 +02:00
Yakov Pechersky
0986d22c58 Deterministic kekulize, independent of atom and bond order (#9125)
* Make kekulization deterministic

* Add tautomer order-independence regression (python)

* Adjust tautomer tests for deterministic kekulization

* Update graphmol wedged-bond kekulization checks

* SmilesParse: update aromatic bond index expectations

* SmilesParse: refresh cxsmilesTest expected files

* Depictor: update testDepictor expected MolBlocks

* Depictor: update depictorCatch expectations

* Depictor Wrap: update expected MolBlock for pyDepictor

* MarvinParse: update testMrvToMol expected outputs

* FileParsers: refresh testAtropisomers expected outputs

* FileParsers: update tests for deterministic kekulization

* MolDraw2D: refresh brittle bond assertions

* RascalMCES: update expected cluster size

* MinimalLib: make cffi wedging check order-independent

* documentation fix

* MinimalLib: update Kekulé bond table in aligned-coords test

* Hoist duplicated lambdas to TEST_CASE scope

* Remove unused originalWedges variable

* Remove redundant bounds check; clarify wedge-end preference

* Pre-sort allAtms by wedge-end + rank

* Use mol.atomNeighbors() for neighbor iteration

* Check inAllAtms before linear-scanning done

* Drop redundant optsV/wedgedOptsV sorts

* Remove unused Canon.h include

* Add canonical parameter to Kekulize; skip ranking during sanitization

* Test canonical re-kekulization preserves stereo across atom orderings

* MinimalLib: update Kekulé bond orders in invertedWedges

* Change Kekulize canonical default to false, expose in Python wrappers

* keep rank order, push_back

* Revert "RascalMCES: update expected cluster size"

This reverts commit a81bb39495.

* docstring change

* expose new flag to python wrapper

* document changes in ReleaseNotes.md

* revert minimallib test changes again

* canonical = true defaults

* Revert "revert minimallib test changes again"

This reverts commit 039e1d84da.

* Reapply "RascalMCES: update expected cluster size"

This reverts commit 7b83a7a3e8.

---------

Co-authored-by: greg landrum <greg.landrum@gmail.com>
2026-03-19 08:43:13 +01:00
Yakov Pechersky
67b73acba4 when shifting double bonds in tautomerization, set double bond stereo to STEREOANY (#9119)
* when shifting double bonds in tautomerization, set double bond stereo to STEREOANY

fixes #9102

notably, do this only to non-ring bonds
move tests over to assert this
avoid index-based bond lookup in test assertions
since bond indexing can move in tautomers

* inchi unittest check

* fast rings
2026-02-19 19:29:17 +01:00
Greg Landrum
86141183c1 Moving towards getting all tests to pass when using the new stereo code (#8409)
* Fixes #8379

* check in some working tests

* test passes

* test passes

* test passes

* test passes

* test passes

* ensure that the invariants flush the streams on failure

* tests pass

* test passes

* tests pass

* tests pass

* tests pass

* tests pass

* tests pass

* tests pass

* tests pass

* tests pass

* tests pass

* tests pass

* tests pass

* tests pass

* tests pass

* tests pass

* Fixes #8391

* tests pass

* fix a test with legacy
not clear why this was not causing problems before

* make a test work

* Fixes #8396

* gcc builds work

* fingerprint tests pass

* mention backwards incompatible change

* fix a problem with FindMolChiralCenters

* more testing details

* enable the test status output

* Fixes #8432

fix a bug in double-bond stereo handling for template matching

* all depictor tests pass

* use the new-stereo chiral ranks in the depiction code

* always assign new-stereo chiral ranks

* make _ChiralAtomRank a computed property
This is analogous to _CIPRank

* tweak to the way the atom ordering is computed for 2D coordinate generation

* update two expected results

* backup

* response to review

* tests pass

* tests pass

---------

Co-authored-by: = <=>
2025-04-15 14:00:32 +02:00
Yakov Pechersky
ad4ee83aec Fixes #7044 (#7137)
* oxime tests

* Support allenes in canonicalizing double bonds

* alternate solution to the problem

* expand comment

* reactivate conjugated nitro test

* Fix conjugated nitro tests, have a bondstereo test

* Empty commit to re-proc tests

---------

Co-authored-by: Greg Landrum <greg.landrum@gmail.com>
2024-10-18 05:32:20 +02:00
Greg Landrum
c7c9ad3328 Add in place and multithread support for more of the MolStandardize code (#6970) 2023-12-12 17:21:18 +01:00
Greg Landrum
522811b8d4 Fixes #5402 (#5542)
* support transforms with branches

* improve output when doing verbose canonicalization

* Fixes #5402
2022-09-09 05:06:42 +02:00
Greg Landrum
310999674b Make the aliphatic imine rule more closely match the definition in the paper (#5270) 2022-05-12 17:16:28 +02:00
Greg Landrum
1431c48d50 Fixes #4700: changes default tautomer enumeration rules (#4727)
* backup

* all tautomer defs which involve z or R{} queries in the original paper have been updated
passes all tests

* add support for the original version of the parameters

* fix win dll builds
2021-11-27 05:02:39 +01:00
Paolo Tosco
0b744c8ba4 fixes #4736 (#4737)
Co-authored-by: Tosco, Paolo <paolo.tosco@novartis.com>
2021-11-26 10:50:11 +01:00
Greg Landrum
f5a54af475 A collection of MolStandardize improvements (#4118)
* Swap to using a data structure for default normalization parameters

* bring the default fragment data into the code too

* cleanup

* add reionizer parameters via data

change fragment parse failures to ValueErrorExceptions

* tautomer parameters in the code

* got a little over-enthusiastic in that last cleanup

* use boost::flyweight to cache normalization and charge data params

* a bit more cleanup

* support reading params from JSON

* fragments from JSON
single-call for fragment removal

* add a one-liner for the canonical tautomer

* quick refactor

* Fixes #4115

* complete the parents

* docs

* move the definitions to a namespace and make them const

* see if switching to c++14 fixes the CI compile problems with g++ 5.5

* somewhat uglier way of solving the initalizer list problem
2021-05-19 09:11:23 +02:00
Paolo Tosco
6373d78744 Fixes #3755 (#3758)
* - fixed VERBOSE_ENUMERATION build
- code cleanup

* - fixes #3755

* changes in response to review
2021-01-27 08:20:25 +01:00
Paolo Tosco
6053f62453 Added support for isotopic Hs to TautomerEnumerator (#3502)
* - added support for isotopic Hs to TautomerEnumerator

* review suggestions

Co-authored-by: greg landrum <greg.landrum@gmail.com>
2020-10-19 13:42:51 +02:00
Paolo Tosco
7d0d7df5f0 Fixes a number of issues flagged by clang (#3498)
* - fixes a number of issues flagged by clang

* - removed commented line
2020-10-15 15:03:34 +02:00
Paolo Tosco
e666eb74fd - silence deprecation warning as the test is meant to test the deprecated version (#3439) 2020-09-27 09:17:59 -04:00
Paolo Tosco
f6e5d7a823 fixes #3430 (#3436) 2020-09-26 05:24:44 +02:00
Paolo Tosco
527a7adf99 Some work on TautomerEnumerator (#3327)
* - Added a TautomerEnumerator constructor which allows passing CleanupParameters
- Added three configurable parameters to CleanupParameters
- Added a callback to TautomerEnumerator
- Fixed a bug where the same tautomer could be mapped by both isomeric and non-isomeric SMILES
- TautomerEnumerator::enumerate() now returns a TautomerEnumeratorResult and does not take
  dynamic_bitset pointers as optional parameters
- Added a missing transform from the Sitzmann paper
- General code cleanup and optimization

* - TautomerEnumeratorResult is now iterable in both C++ and Python
- further optimizations
- implemented a TautomerEnumerator.PickCanonical() Python wrapper
- added C++ and Python accessors to SMILES and SmilesTautomerMap

* - make sure the number of tautomers reported by rdLogger is correct and definitive

* make sure that if N maxTautomers are requested, N tautomers are returned if the theoretical number of tautomers is M>N

* avoid that sulfonic acids hit the formamidinesulfinic acid tautomerisation rule

* offer an option to allow the old API to still be used.

* Changes in response to review and following discussion with Gareth and Greg

* - made TautomerEnumeratorResult an enum class (was a plain C enum)
- made TautomerEnumeratorResult::const_iterator a bidirectional_iterator
- added tests to fully probe the TautomerEnumeratorResult::const_iterator functionality

* - change the difference_type definition
- added tests for the above

* - cosmetic change to improve code readability

Co-authored-by: greg landrum <greg.landrum@gmail.com>
2020-09-24 17:00:03 -04:00
Greg Landrum
edd922c99c Cleanup warnings from clang-10 (#3238)
* stop returning local memory in exceptions

* remove a couple unnecessary copies in loops

* fix a bug in the way the default MMFF aromatic parameters are constructed

* remove a bunch of loop-variable warnings

* remove a bunch of clang warnings

* disable clang warnings in python wrappers

* remove some warnings when building the python wrappers
2020-06-19 17:16:22 -04:00
Greg Landrum
c7e7d88940 Fixes #2990 (#3016)
* add the test

* Fixes #2990

Needs more testing!

* additional testing
2020-03-22 04:58:01 +01:00
Greg Landrum
ace523b1b5 allow retrieval of the atoms/bonds modified by the tautomerization (#3013)
* add modifiedBonds and modifiedAtoms to C++ API

* add modifiedAtoms/modifiedBonds interface to Python too

* update documentation
2020-03-17 14:13:54 +01:00
Greg Landrum
6f9ba35826 Tune the tautomer scoring (#2959)
* tautomer scoring tweaks
doc updates
expose tautomer score to Python

* fix leaks in tests
2020-02-19 07:38:04 -05:00
Greg Landrum
915471a079 Fix a problem with aromatic heteroatom tautomer enumeration (#2952)
* update transforms to enforce neutral nitrogens when doing the aromatic neteroatom transforms

* formatting, add a test

* remove unused #include
2020-02-13 06:35:01 +01:00
Greg Landrum
f8a4020789 Add MolVS tautomer canonicalization (#2886)
* first pass at implementing molvs-style tautomer scoring
This isn't optimal in terms of performance, but all the MolVS tests pass.

* clang format

* A bit of refactoring of the tautomer stuff

* first pass at python wrappers

* allow specifying the tautomer scoring function from C++

* EFF: use boost::flyweight so SMARTS is only parsed once

* improve the python API

* switch to boost::function instead of using function pointers

* allow user-provided tautomer scoring functions

* documentation and scorer version

* change in response to review
2020-01-17 15:25:17 -05:00
Ric
a6b26253ff Fix (most of) mem problems (#2123)
* do not use new on loggers

* del pointers in testDistGeom

* Update Dict hasNonPOD status on bulk update

* delete new Dicts in memtest1.cpp

* fixes in MolSuppliers and testFMCS

* PeriodicTable singleton as unique_ptr

* fix EEM_arrays leak

* fix leaks in testPBF

* fix ParamCollection leak in test UFF

* fix leaks in MMFF

* clear prop dict before read in in pickler

* fix leaks in testFreeSASA

* fix leaks in test3D

* modernize Dict.h & SmilesParse.cpp

* fix leaks in testQuery

* fix leaks in testCrystalFF

* fix leaks in cxsmilesTest

* fix leaks in Catalog & mol cat test

* fix leaks in ShapeUtils & tests

* fix leaks in testSubgraphs1

* fix leaks testFingerprintGenerators

* fix leaks in Catalog/FilterCatalog

* fix leaks in graphmolqueryTest

* these changes reduce bison parse leaks

* fixed leaks in testChirality.cpp

* fix leaks + 2 tests in testMolWriter

* fix 4m leaks in substructLibraryTest

* small improvements to molTautomerTest; still leaks

* fix leaks in testRGroupDecomp

* fix leaks in test; parser still leaks

* fix leaks in itertest

* fix 4m leaks in testDepictor

* fixes in smatest; still leaking due to parser

* fixes in testSLNParse; still leaking due to parser

* flex/bison: always add atoms with ownership; smarts error cleanup

* fix leaks in testReaction

* fix leaks in testSubstructMatch

* fix leaks in resMolSupplierTest

* fix leaks in testChemTransforms + bug in ChemTransforms

* fix leaks in testPickler

* fix leaks in testMolTransform

* fix leaks in testFragCatalog

* fix leak in testSLNParse. Still leaks due to Smiles

* fixed most leaks in testMolSupplier

* pre bison fix

* fix some atom & bond parse problems; others still fail

* bison smiles & smarts, atoms & bonds more or less fixed

* fix leaks in molopstest.cpp

* fix leaks in testFingerprints, MACCS.cpp & AtomPairs.cpp

* fix leaks in moldraw2Dtest1

* fix leaks in testDescriptors

* fix leaks in testInchi

* fix leaks in testUFFForceFieldHelpers

* fix leaks in hanoiTest & new_canon.h

* fix leaks in testMMFFForceField

* fix leaks in graphmolTest1

* fix leaks in testMMFFForceFieldHelpers

* fix leaks in testDistGeomHelpers

* fix leaks in testMolAlign

* initialize occupancy & temp facto with default values

* fix leak in TautomerTransform

* updated suppressions

* fix testStructChecker

* fix logging & py tests

* fix TautomerTransform class/struct issue

* remove misplaced delete in testSLNParse

* deinit in testAvalonLib1

* fix Avalon-triggered(?) bug in StructChecker/Pattern.cpp

* fix random testMolWriter/Supplier fails

- diversify output file names to avoid clashing.
- unify Writers close/destruct behavior.
- flushing/closing in tests.

* use reset in FFs Params.cpp

* comments on testMMFFForceField

* unrequired 'if's added to mol suppliers

* correct cast in FilterCatalog.h

* use unique_ptr in MACCS Patterns

* remove unrequred if in new_canon

* update & move suppressions
2018-10-29 14:33:26 +00:00
Ric
91008ff11d Address compile warnings & trivial improvements (#2097)
* Address compile warnings & trivial improvements

* revert unwanted initializers; use RDUNUSED_PARAM for unused params

* revert fix in testRDFcustom; marked with 'TO DO' comment
2018-10-12 06:39:32 -04:00
Susan Leung
956fdf268c Dev/GSOC2018_MolVS_Integration (#2002)
* short test file for MolVS standardize_sm

* short test file for MolVS fragment

* short test file for MolVS metals

* short test file for MolVS normalize

* short test file for MolVS reionize

* short test file for MolVS tautomer

* short test file for MolVS validate

* long test file for MolVS standardize smiles

* long test file for MolVS fragment

* long test file for MolVS metals

* long test file for MolVS normalize

* long test file for MolVS reionize

* long test file for MolVS tautomer

* long test file for MolVS validate

* Unit tests for MolVS steps

* dropping support for Python2

* molvs/__init__.py

* molvs/charge.py

* molvs/errors.py

* molvs/fragment.py

* molvs/metal.py

* molvs/normalize.py

* molvs/resonance.py

* molvs/standardize.py

* molvs/tautomer.py

* molvs/utils.py

* molvs/validate.py

* molvs/validations.py

* molvs/cli.py

* adapted and renamed molvs/cli.py to work within $RDBASE/Contrib/MolVS/

* setup MolStandardize directories, source with empty cleanup function, header, CMake files

* corrections to empty source, header and test1.cpp

* adding empty functions and initializers to MolStandardize

* empty Metal source, header and added test

* added most of Metal.cpp functionality and made some more tests

* empty functions and initializers to Normalize

* empty functions and initializers to Validate

* added most code for RDKitDefault mode, along with some tests

* restructure for abstract base class ValidateMethod

* written in isNoneValidation for MolVSValidation

* took out isNoneValidation, put in noAtomValidation, neutralValidation, isotopeValidation for MolVSValidation

* added in AllowedAtoms

* added in disallowedAtoms

* corrections to Validate

* added code for FragmentRemover

* extended fragment functionality to include choose largest fragment, added in tests for fragment catalog, fragment remover. Also added fragmentValidation method in MolStandardize

* added another test to testValidate test_fragment

* corrections to fragment

* corrections to Metal

* added code for Normalize

* added normalize member function to MolStandardize and added tests

* added multi fragment functionality to Normalize.cpp and additional tests

* TransformCatalog

* tests for Normalize.cpp

* first bit of cleanup

* added most of Charge functionality and some tests

* some corrections to Charge.cpp and some more tests to testCharge.cpp

* corrections to Charge.cpp

* start of Tautomer Enumerate with some tests

* added BondType option to Tautomer Enumeration

* correcting for some memory leakage

* a few alterations to formatting

* sorting out some memory leaks

* sorting out some memory leaks

* some corrections for PCS test set

* redo tests with updated RDKit

* fixing memory leak

* more fixes after 100kPCS set testing

* using tab as delimiter in CSVs rather than comma

* tutorial for MolStandardize

* still working on Tautomer enumeration

* deleted some empty tests

* starting writing tautomer canonicalize

* rename test_data -> data (the source still needs to be updated)

* automatic source reformatting

* adjust to directory rename

* move the fragment catalog test into the MolStandardize directory
do not create separate library for FragmentCatalog

* stop building separate libraries for the catalogs

* move the CleanupParameters into the MolStandardize namespace

* first pass at python wrapper

* move the py module to the correct dir;
add some python tests;
add standardizeSmiles to python wrapper

* disabling the compareMolVSTest since that requires command line arguments to run

* get this building on windows

* put the python lib in the right place

* further work on python wrapper for rdMolStandardize

* added get and set functions to Metal and wrapped them

* added get and set functions to Metal and wrapped them

* changed construstor of Reionizer class and input args for reionize, wrapped this default

* overload Reionizer constructor so user can input own AcidBaseFile from python

* added Uncharger class to Charge and added test for Uncharger

* wrapped Fragment, fixed some memory leakage, changed some args and return types, added some tests

* wrapped Normalized and changed how Normalizer class is initiated

* changing MolVSValidation structure so user can choose which MolVS submethod they want

* starting to write Wrap for Validate

* now it compiles with Wrap/Validate.cpp

* a couple refactorings around validate

* move the validate code into the rdMolStandardize module

* make sure a valid pointer is returned for standardizeSmiles

* rdMolStandardize.MolVSValidation done and tests added

* half way through AllowedAtomsValidation

* finished AllowedAtomsValidation and DisallowedAtomsValidation

* moved charge, fragment, metal, normalize into the rdMolStandardize module

* changed tutorial to use wrapped code

* added copyrights

* added copyrights

* move the data files

* modify source files to adjust to the move

* added validateSmiles functionality

* removed std::cout

* redid some of the 100k PCS tests

* working on the tutorial

* adding some documentation

* deleting some comment lines

* some changes after pull review

* More changes after pull review

* start of trying to make java wrap

* remove some warnings, add some questions

* additional warning removals, a bit more reporting

* some test cleanups

* enable testing of the java code
2018-09-28 11:24:25 +02:00