* Speed up tautomer canonicalization by deferring on SSSR calc
* Lazy kekulization for tautomer enumeration
Defer kekulization of tautomers until they are actually needed for
transform matching. This avoids creating kekulized copies for:
1. The initial tautomer (until first iteration)
2. New tautomers that may never be processed (if enumeration ends early)
The Tautomer class now supports lazy initialization of the kekulized
form via getKekulized() method.
Performance improvement: ~7% additional speedup (total ~22-24% from baseline)
* Use count-only substructure matching in tautomer scoring
* Add SubstructMatchCount regression test
* MolStandardize: reduce enumerate overhead
* MolStandardize: avoid per-tautomer ring recomputation
* Atom: cache PeriodicTable pointer in valence calcs
* Atom: reuse PeriodicTable in getEffectiveAtomicNum
* PeriodicTable: add atomic fast path for getTable
* GraphMol: reduce ROMol copy reallocations
* MolStandardize: use quickCopy for per-match product copies
Use RWMol(*kmol, true) in tautomer enumeration to avoid copying properties/bookmarks/conformers for each candidate. This reduces deep-copy overhead without changing chemistry.
* MolStandardize: pre-filter scoring patterns by element/connectivity
For tautomer scoring, pre-compute which SubstructTerms are relevant for
a given input molecule. Since tautomerization only moves H atoms and
changes bond orders (never creates/destroys heavy-atom bonds), patterns
requiring missing elements or connectivity can be skipped for all
tautomers of that molecule.
Two-stage filtering:
1. Element check: skip patterns requiring atoms not in the molecule
2. Connectivity check: skip patterns whose bond-order-agnostic structure
doesn't match the input molecule's connectivity
This reduces the number of VF2 substructure calls per tautomer from 12
to typically 3-5, depending on the molecule's composition.
* MolStandardize: preserve molecule properties for canonical tautomer
Copy molecule properties from the original input to the canonical tautomer
result. Since quickCopy during enumeration skips d_props to avoid overhead,
extended SMILES data like link nodes (LN) was lost. This restores them
on the final result.
* TautomerQuery: preserve molecule properties (e.g. link nodes) in tautomers
TautomerQuery::fromMol() uses TautomerEnumerator::enumerate() which uses
quickCopy for performance. This doesn't copy molecule properties like
_molLinkNodes. Without this fix, XQMol output would lose link node
extensions in the SMILES.
Copy properties from the original query molecule to all enumerated
tautomers before constructing the TautomerQuery. This preserves extended
SMILES data without impacting enumeration performance.
* MolStandardize: use parallel iteration and cache bond lookups
Replace O(n) getAtomWithIdx/getBondWithIdx calls with parallel iteration
over atom/bond ranges in canonicalizeInPlace and enumerate. Cache bond
lookups in setTautomerStereoAndIsoHs to avoid repeated O(n) searches.
* perf: add specialized matchers for simple tautomer scoring patterns
Replace VF2 graph matching with O(n) loops for 6 simple patterns:
- countDoubleOrAromaticBonds: C=O, N=O, P=O patterns
- countMethyls: [CX4H3] methyl groups
- countCarbonDoubleHetero: [C]=[/home/dcvuser/rdkit;Code/GraphMol/MolStandardize/Tautomer.h] aliphatic C=hetero
- countAromaticCarbonExocyclicN: [c]=aromatic C=exocyclic N
Complex patterns (benzoquinone, oxim, guanidine, aci-nitro) still use VF2.
Combined with the pre-filtering optimization, this achieves ~3.7x speedup
(~2500ms vs ~9300ms original) for tautomer canonicalization.
* Fix tautomer canonicalize dropping conformers from quickCopy
quickCopy (RWMol(*mol, true)) skips conformers, so tautomer
enumeration products lose 2D/3D coordinates. This causes InChI
generation to omit the /b (double bond E/Z stereo) layer, since
E/Z is derived from atomic coordinates.
Fix: copy conformers from the original molecule onto the canonical
tautomer after pickCanonical in TautomerEnumerator::canonicalize().
Tests: SMILES-based E/Z check in testTautomer.cpp, molblock-based
conformer preservation check in catch_tests.cpp.
* add test on canonicalize losing stereo
* add regression test for exocyclic C=C tautomer canonicalization
The getTautomerStateKey() pre-filter (commit 2595ef748) can falsely
deduplicate distinct tautomers when their atom-index-ordered state
patterns happen to match, leading canonicalize() to pick the wrong
canonical form for molecules with STEREOTRANS-pinned exocyclic C=C
bonds after RemoveHs.
Test verifies that O=C(CC1=CC2=CC=COC2)NC1=O canonicalizes to the
exocyclic form O=C1CC(=CC2=CC=COC2)C(=O)N1, not the endocyclic form
O=C1C=C(C=C2CC=COC2)C(=O)N1.
Currently expected to FAIL until the state key dedup bug is fixed.
* MolStandardize: expand tautomer connectivity SMARTS
* MolStandardize: scope tautomer pattern enum
* MolStandardize: trim tautomer pattern enum
* MolStandardize: use symmetric ring scoring
* parse templates as smarts
* accept ring templates in SMARTS format
* undo CLAUDE mistake
* rename files
* enable templating for macrocycles
* enable macrocycle templating
* Add test for macrocycle templating
Tests that ring system templates are used only for macrocycles (rings
with size > 8). The test verifies the exact threshold by generating
coordinates with and without templates for rings of size 4-14.
Addresses review feedback on PR #9203.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
---------
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
* Fixes#9143
There is still some weirdness in the matching of ET bonds in macrocycles, but it is not connected to this change.
* adjust test to work on win (and mac?)
* tweak expected results for win ci
* Remove Dict::getData() for a strict abstraction boundary
Replace direct access to Dict's internal std::vector<Pair> with
encapsulated methods: size(), empty(), const iteration via
begin()/end(), appendPair(), markNonPOD(), and getRawVal().
This enables future changes to Dict's internal representation
without breaking callers.
Ref: rdkit/rdkit#9112
* Harden Dict::appendPair to take a populated Pair by move
appendPair(Pair&&) now auto-detects non-POD status via
RDValue::needsCleanup(), eliminating markNonPOD() and the
risk of dangling references or uninitialized entries.
needsCleanup() is placed next to destroy() on RDValue to
keep the POD/non-POD distinction in one place.
* Remove vestigial dictHasNonPOD param from streamReadProp
Both callers ignored the output. Non-POD detection is now handled
by Dict::appendPair via RDValue::needsCleanup().
* unbork java build
* Address PR review: bulk append, rename getRawVal, add custom data test
- Add Dict::append(vector<Pair>&&) for bulk insertion with reserve
- Use bulk append in streamReadProps to restore pre-allocation
- Rename getRawVal -> getRDValue per reviewer preference
- Add test verifying custom AnyTag data is destroyed through Dict lifecycle
* heed self-review
* don't manually implement vec.insert
* Add test: ExplicitBitVect round-trip through Dict serialization
Exercises the full streamWriteProps/streamReadProps path with an
ExplicitBitVect in an RDProps Dict, confirming the custom handler
is invoked and no memory is leaked (verified under valgrind).
* in anyTag test, assert destructors ran a specific number of times.
---------
Co-authored-by: bddap (Coding Agent) <andrew+bot@dirksen.com>
* implement consistency check
* add more consistency checks
* check direction consistency accross double bond
* clean up directions for non-stereo bonds
* fix counts for second from atom dirs; add check
* handle inconconsistent bond dirs
* add more tests, pubchem cases, and update existing
* drop statics
* fix typo
* make sourceBond arg const
* fix consistency check
* switch the Query infrastructure to use std::function
* add releasenotes mention
* refactor makeAtomInRingOfSizeQuery() to use lambdas and support range queries
* add 'k' atom query to SMARTS
* changes in response to review
V3000 parsing sets aromatic flags on bonds but not atoms. When removeHs
strips an explicit H from nitrogen in an aromatic ring, molRemoveH
checked heavyAtom->getIsAromatic() to decide whether to increment
numExplicitHs — but that flag was always false for V3000-parsed atoms.
Without the explicit H count, the kekulizer cannot distinguish pyrrole N from pyridine N,
causing
"Can't kekulize mol" errors on valid ChemDraw-exported molblocks.
Fix: use isAromaticAtom(), which checks both atom and bond aromatic
flags
* switch the Query infrastructure to use std::function
* add releasenotes mention
* response to review
Removed commented-out function pointer declarations for match and data functions.
* Add RDLog::CaptureLog for capturing log messages
Adds an RAII `CaptureLog` class to `namespace RDLog` (alongside the
existing `LogStateSetter`) that redirects an RDKit logger's output to an
internal `std::stringstream` for the duration of its lifetime. On
destruction the original stream destination and enabled state are fully
restored. Nesting is supported: an inner capture shadows the outer one
and each collects its own messages independently.
The default constructor captures `rdErrorLog`; an explicit constructor
accepts any `RDLogger`. Both enable the logger if it was previously
disabled and restore that state on destruction.
Python bindings expose `rdBase.CaptureLog` as a context manager with a
`messages` read-only property, mirroring the existing `rdBase.BlockLogs`
pattern. Messages remain accessible after the `with` block exits.
C++ tests are added to `catch_logs.cpp` (6 Catch2 sections covering
basic capture, empty state, enable/restore, stream restore, explicit
logger, and nested captures). Python tests are added to
`UnitTestLogging.py` (6 unittest cases covering the same scenarios).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* CaptureLog: add per-level properties (error_messages, warning_messages, etc.)
The Python CaptureLog wrapper now captures all four log levels
simultaneously. Per-level properties (error_messages, warning_messages,
info_messages, debug_messages) give access to messages from each logger
independently; the existing messages property returns them all combined.
The C++ RDLog::CaptureLog class is unchanged — it remains a clean
single-logger RAII type. The Python wrapper composes four instances of
it, one per log level.
Suggested by bp-kelley in PR review.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Refactor CaptureLog: add named per-level subclasses
Add CaptureErrorLog, CaptureWarningLog, CaptureInfoLog, and CaptureDebugLog
as named convenience subclasses of CaptureLog, each capturing a specific
logger. Update Python bindings to expose the four named classes directly
(dropping the combined multi-capture approach), and update tests accordingly.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Simplify CaptureLog: no argument, captures rdErrorLog only
Remove the RDLogger argument overload, the four named subclasses, and the
PyCaptureLog template in favor of a single no-argument CaptureLog that
mirrors the Schrödinger CaptureRDErrorLog from which it was inspired.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* CaptureLog tests: add dp_dest restoration and LogStateSetter interaction
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Rename CaptureLog to CaptureErrorLog
The name CaptureLog was ambiguous; CaptureErrorLog is explicit about which
logger it captures and avoids redundancy within namespace RDLog.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Generalize CaptureErrorLog into CaptureLog with logger parameter
Replace CaptureErrorLog with CaptureLog, which accepts any RDLogger in
its constructor (e.g. rdErrorLog, rdWarningLog). Add CaptureErrorLog as
a convenience subclass that pre-fills rdErrorLog, preserving backward
compatibility for existing callers.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* Fix Qt 6.5.8 link failure on macOS due to removed AGL framework
Qt 6.5.8 injects the AGL (Apple Graphics Layer) framework into the
WrapOpenGL::WrapOpenGL imported target's INTERFACE_LINK_LIBRARIES. AGL
was removed from macOS 14+ SDKs, causing a link error when building
MolDraw2DQt on modern macOS:
ld: framework not found AGL
This workaround filters AGL out of WrapOpenGL::WrapOpenGL's link
libraries after find_package(Qt6) populates them. The guard conditions
(APPLE and TARGET WrapOpenGL::WrapOpenGL) make it a no-op on other
platforms and Qt versions that do not create that target.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Broaden AGL workaround comment to not pin Qt 6.5.8 specifically
The issue may affect other Qt versions, not just 6.5.8.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* add a test
* change stereo bond canonicalization
* update canonicalization watch test with fixed cases
* make canonicalization test stricter (compare CIP codes)
* add reverse symmetry condition
* rewrite double bond canonicalization code
* update tests
* fix multiline comment
* update java tests
* update python test
* nix switchBondDir (unused)
* fix and rename flipBondDir
* refactor comment
* fix shadowed var name, casting
* fix neighbor sorting
* make seen_bonds a vector
* abstract setDirectionFromNeighboringBond
* handle both sides of the bond have directions
* move getNeighboringStereoBond
* check seen_bonds after popping connectedBondsQ
* use references for arguments
* add release note
* add example required by Dan
* add example requested by Dan
* Defer numpy initialization to first use in rdchem, rdmolops, cDataStructs
`from rdkit import Chem` unconditionally bootstrapped numpy (~120ms) via
import_array()/boost::python::numpy::initialize() in module init functions,
even when no numpy-dependent APIs were called. This is costly in cold-start
environments like AWS Lambda.
Move numpy initialization behind lazy guards (static bool + first-call init)
in rdchem.so, rdmolops.so, and cDataStructs.so. Numpy now loads only when
an API that actually needs it is invoked (GetDistanceMatrix, GetPositions,
SetPositions, GetAdjacencyMatrix, ConvertToNumpyArray, etc.).
Also change Conformer::SetPos to accept python::object instead of
np::ndarray to prevent Boost.Python from requiring numpy type conversion
before the lazy guard runs.
Adds test_lazy_numpy.py with subprocess-based tests verifying:
- `from rdkit import Chem` does not load numpy
- SmilesToMol/MolToSmiles work without numpy
- numpy loads on demand when array APIs are called
* skip inchi tests if not available
* switch to threadsafe once_flag, like elsewhere
* finish ifdef style
* switch to magic static style
* Revert "switch to magic static style"
This reverts commit 7300188db7.
* when shifting double bonds in tautomerization, set double bond stereo to STEREOANY
fixes#9102
notably, do this only to non-ring bonds
move tests over to assert this
avoid index-based bond lookup in test assertions
since bond indexing can move in tautomers
* inchi unittest check
* fast rings
* Cannot push_back std::string to boost json array
`boost::json::array.push_back` expects a `value`
* prefer emplace_back to avoid casting
* Ensure every MINIMAL_LIB option is tested in CI
Also remove Chemdraw support from the compilation. This does not change the final size of the output (not exported anyways) but it reduces the compilation time by 1 min (-10% on my local machine)
* Expose MMPA with other options
* Fix MMPA compilation: Implicitly binding raw pointers is illegal
Applying same pattern as in `get_frags_helper`
* use queues instead of sets for trimBonds
* skip 2 last atoms: if these were in rings, we'd have already noticed
* refactor duplicate detection in findRingsD2nodes
* make smallestRingsBfs a free function
* move things around
* fix paper reference; fix other comments
* Fixes#9068
* fix a problem with empty labels in s-group parsing
* fix empty column names in smiles suppliers
* add the check to setPODVal()
---------
Co-authored-by: = <=>
* Don't silent ignore missing atoms (and replace them with atom #0) in copymolsubet
* Fail if explicitly set atoms/bonds are not present
* Add tests
* Simplify optimization (copy whole molecule) add test for no bonds