rdkit

mirror of https://github.com/rdkit/rdkit.git synced 2026-06-04 21:54:27 +08:00

Author	SHA1	Message	Date
Yakov Pechersky	c6cabf4153	Speed-up tautomer canonicalization, no API changes (#9134 ) * Speed up tautomer canonicalization by deferring on SSSR calc * Lazy kekulization for tautomer enumeration Defer kekulization of tautomers until they are actually needed for transform matching. This avoids creating kekulized copies for: 1. The initial tautomer (until first iteration) 2. New tautomers that may never be processed (if enumeration ends early) The Tautomer class now supports lazy initialization of the kekulized form via getKekulized() method. Performance improvement: ~7% additional speedup (total ~22-24% from baseline) * Use count-only substructure matching in tautomer scoring * Add SubstructMatchCount regression test * MolStandardize: reduce enumerate overhead * MolStandardize: avoid per-tautomer ring recomputation * Atom: cache PeriodicTable pointer in valence calcs * Atom: reuse PeriodicTable in getEffectiveAtomicNum * PeriodicTable: add atomic fast path for getTable * GraphMol: reduce ROMol copy reallocations * MolStandardize: use quickCopy for per-match product copies Use RWMol(kmol, true) in tautomer enumeration to avoid copying properties/bookmarks/conformers for each candidate. This reduces deep-copy overhead without changing chemistry. MolStandardize: pre-filter scoring patterns by element/connectivity For tautomer scoring, pre-compute which SubstructTerms are relevant for a given input molecule. Since tautomerization only moves H atoms and changes bond orders (never creates/destroys heavy-atom bonds), patterns requiring missing elements or connectivity can be skipped for all tautomers of that molecule. Two-stage filtering: 1. Element check: skip patterns requiring atoms not in the molecule 2. Connectivity check: skip patterns whose bond-order-agnostic structure doesn't match the input molecule's connectivity This reduces the number of VF2 substructure calls per tautomer from 12 to typically 3-5, depending on the molecule's composition. * MolStandardize: preserve molecule properties for canonical tautomer Copy molecule properties from the original input to the canonical tautomer result. Since quickCopy during enumeration skips d_props to avoid overhead, extended SMILES data like link nodes (LN) was lost. This restores them on the final result. * TautomerQuery: preserve molecule properties (e.g. link nodes) in tautomers TautomerQuery::fromMol() uses TautomerEnumerator::enumerate() which uses quickCopy for performance. This doesn't copy molecule properties like _molLinkNodes. Without this fix, XQMol output would lose link node extensions in the SMILES. Copy properties from the original query molecule to all enumerated tautomers before constructing the TautomerQuery. This preserves extended SMILES data without impacting enumeration performance. * MolStandardize: use parallel iteration and cache bond lookups Replace O(n) getAtomWithIdx/getBondWithIdx calls with parallel iteration over atom/bond ranges in canonicalizeInPlace and enumerate. Cache bond lookups in setTautomerStereoAndIsoHs to avoid repeated O(n) searches. * perf: add specialized matchers for simple tautomer scoring patterns Replace VF2 graph matching with O(n) loops for 6 simple patterns: - countDoubleOrAromaticBonds: C=O, N=O, P=O patterns - countMethyls: [CX4H3] methyl groups - countCarbonDoubleHetero: [C]=[/home/dcvuser/rdkit;Code/GraphMol/MolStandardize/Tautomer.h] aliphatic C=hetero - countAromaticCarbonExocyclicN: [c]=aromatic C=exocyclic N Complex patterns (benzoquinone, oxim, guanidine, aci-nitro) still use VF2. Combined with the pre-filtering optimization, this achieves ~3.7x speedup (~2500ms vs ~9300ms original) for tautomer canonicalization. * Fix tautomer canonicalize dropping conformers from quickCopy quickCopy (RWMol(mol, true)) skips conformers, so tautomer enumeration products lose 2D/3D coordinates. This causes InChI generation to omit the /b (double bond E/Z stereo) layer, since E/Z is derived from atomic coordinates. Fix: copy conformers from the original molecule onto the canonical tautomer after pickCanonical in TautomerEnumerator::canonicalize(). Tests: SMILES-based E/Z check in testTautomer.cpp, molblock-based conformer preservation check in catch_tests.cpp. add test on canonicalize losing stereo * add regression test for exocyclic C=C tautomer canonicalization The getTautomerStateKey() pre-filter (commit 2595ef748) can falsely deduplicate distinct tautomers when their atom-index-ordered state patterns happen to match, leading canonicalize() to pick the wrong canonical form for molecules with STEREOTRANS-pinned exocyclic C=C bonds after RemoveHs. Test verifies that O=C(CC1=CC2=CC=COC2)NC1=O canonicalizes to the exocyclic form O=C1CC(=CC2=CC=COC2)C(=O)N1, not the endocyclic form O=C1C=C(C=C2CC=COC2)C(=O)N1. Currently expected to FAIL until the state key dedup bug is fixed. * MolStandardize: expand tautomer connectivity SMARTS * MolStandardize: scope tautomer pattern enum * MolStandardize: trim tautomer pattern enum * MolStandardize: use symmetric ring scoring	2026-03-31 06:42:40 +02:00
Yakov Pechersky	0986d22c58	Deterministic kekulize, independent of atom and bond order (#9125 ) * Make kekulization deterministic * Add tautomer order-independence regression (python) * Adjust tautomer tests for deterministic kekulization * Update graphmol wedged-bond kekulization checks * SmilesParse: update aromatic bond index expectations * SmilesParse: refresh cxsmilesTest expected files * Depictor: update testDepictor expected MolBlocks * Depictor: update depictorCatch expectations * Depictor Wrap: update expected MolBlock for pyDepictor * MarvinParse: update testMrvToMol expected outputs * FileParsers: refresh testAtropisomers expected outputs * FileParsers: update tests for deterministic kekulization * MolDraw2D: refresh brittle bond assertions * RascalMCES: update expected cluster size * MinimalLib: make cffi wedging check order-independent * documentation fix * MinimalLib: update Kekulé bond table in aligned-coords test * Hoist duplicated lambdas to TEST_CASE scope * Remove unused originalWedges variable * Remove redundant bounds check; clarify wedge-end preference * Pre-sort allAtms by wedge-end + rank * Use mol.atomNeighbors() for neighbor iteration * Check inAllAtms before linear-scanning done * Drop redundant optsV/wedgedOptsV sorts * Remove unused Canon.h include * Add canonical parameter to Kekulize; skip ranking during sanitization * Test canonical re-kekulization preserves stereo across atom orderings * MinimalLib: update Kekulé bond orders in invertedWedges * Change Kekulize canonical default to false, expose in Python wrappers * keep rank order, push_back * Revert "RascalMCES: update expected cluster size" This reverts commit `a81bb39495`. * docstring change * expose new flag to python wrapper * document changes in ReleaseNotes.md * revert minimallib test changes again * canonical = true defaults * Revert "revert minimallib test changes again" This reverts commit `039e1d84da`. * Reapply "RascalMCES: update expected cluster size" This reverts commit `7b83a7a3e8`. --------- Co-authored-by: greg landrum <greg.landrum@gmail.com>	2026-03-19 08:43:13 +01:00
Yakov Pechersky	67b73acba4	when shifting double bonds in tautomerization, set double bond stereo to STEREOANY (#9119 ) * when shifting double bonds in tautomerization, set double bond stereo to STEREOANY fixes #9102 notably, do this only to non-ring bonds move tests over to assert this avoid index-based bond lookup in test assertions since bond indexing can move in tautomers * inchi unittest check * fast rings	2026-02-19 19:29:17 +01:00
Greg Landrum	86141183c1	Moving towards getting all tests to pass when using the new stereo code (#8409 ) * Fixes #8379 * check in some working tests * test passes * test passes * test passes * test passes * test passes * ensure that the invariants flush the streams on failure * tests pass * test passes * tests pass * tests pass * tests pass * tests pass * tests pass * tests pass * tests pass * tests pass * tests pass * tests pass * tests pass * tests pass * tests pass * tests pass * Fixes #8391 * tests pass * fix a test with legacy not clear why this was not causing problems before * make a test work * Fixes #8396 * gcc builds work * fingerprint tests pass * mention backwards incompatible change * fix a problem with FindMolChiralCenters * more testing details * enable the test status output * Fixes #8432 fix a bug in double-bond stereo handling for template matching * all depictor tests pass * use the new-stereo chiral ranks in the depiction code * always assign new-stereo chiral ranks * make _ChiralAtomRank a computed property This is analogous to _CIPRank * tweak to the way the atom ordering is computed for 2D coordinate generation * update two expected results * backup * response to review * tests pass * tests pass --------- Co-authored-by: = <=>	2025-04-15 14:00:32 +02:00
Yakov Pechersky	ad4ee83aec	Fixes #7044 (#7137 ) * oxime tests * Support allenes in canonicalizing double bonds * alternate solution to the problem * expand comment * reactivate conjugated nitro test * Fix conjugated nitro tests, have a bondstereo test * Empty commit to re-proc tests --------- Co-authored-by: Greg Landrum <greg.landrum@gmail.com>	2024-10-18 05:32:20 +02:00
Greg Landrum	c7c9ad3328	Add in place and multithread support for more of the MolStandardize code (#6970 )	2023-12-12 17:21:18 +01:00
Greg Landrum	522811b8d4	Fixes #5402 (#5542 ) * support transforms with branches * improve output when doing verbose canonicalization * Fixes #5402	2022-09-09 05:06:42 +02:00
Greg Landrum	310999674b	Make the aliphatic imine rule more closely match the definition in the paper (#5270 )	2022-05-12 17:16:28 +02:00
Greg Landrum	1431c48d50	Fixes #4700 : changes default tautomer enumeration rules (#4727 ) * backup * all tautomer defs which involve z or R{} queries in the original paper have been updated passes all tests * add support for the original version of the parameters * fix win dll builds	2021-11-27 05:02:39 +01:00
Paolo Tosco	0b744c8ba4	fixes #4736 (#4737 ) Co-authored-by: Tosco, Paolo <paolo.tosco@novartis.com>	2021-11-26 10:50:11 +01:00
Greg Landrum	f5a54af475	A collection of MolStandardize improvements (#4118 ) * Swap to using a data structure for default normalization parameters * bring the default fragment data into the code too * cleanup * add reionizer parameters via data change fragment parse failures to ValueErrorExceptions * tautomer parameters in the code * got a little over-enthusiastic in that last cleanup * use boost::flyweight to cache normalization and charge data params * a bit more cleanup * support reading params from JSON * fragments from JSON single-call for fragment removal * add a one-liner for the canonical tautomer * quick refactor * Fixes #4115 * complete the parents * docs * move the definitions to a namespace and make them const * see if switching to c++14 fixes the CI compile problems with g++ 5.5 * somewhat uglier way of solving the initalizer list problem	2021-05-19 09:11:23 +02:00
Paolo Tosco	6373d78744	Fixes #3755 (#3758 ) * - fixed VERBOSE_ENUMERATION build - code cleanup * - fixes #3755 * changes in response to review	2021-01-27 08:20:25 +01:00
Paolo Tosco	6053f62453	Added support for isotopic Hs to TautomerEnumerator (#3502 ) * - added support for isotopic Hs to TautomerEnumerator * review suggestions Co-authored-by: greg landrum <greg.landrum@gmail.com>	2020-10-19 13:42:51 +02:00
Paolo Tosco	7d0d7df5f0	Fixes a number of issues flagged by clang (#3498 ) * - fixes a number of issues flagged by clang * - removed commented line	2020-10-15 15:03:34 +02:00
Paolo Tosco	e666eb74fd	- silence deprecation warning as the test is meant to test the deprecated version (#3439 )	2020-09-27 09:17:59 -04:00
Paolo Tosco	f6e5d7a823	fixes #3430 (#3436 )	2020-09-26 05:24:44 +02:00
Paolo Tosco	527a7adf99	Some work on TautomerEnumerator (#3327 ) * - Added a TautomerEnumerator constructor which allows passing CleanupParameters - Added three configurable parameters to CleanupParameters - Added a callback to TautomerEnumerator - Fixed a bug where the same tautomer could be mapped by both isomeric and non-isomeric SMILES - TautomerEnumerator::enumerate() now returns a TautomerEnumeratorResult and does not take dynamic_bitset pointers as optional parameters - Added a missing transform from the Sitzmann paper - General code cleanup and optimization * - TautomerEnumeratorResult is now iterable in both C++ and Python - further optimizations - implemented a TautomerEnumerator.PickCanonical() Python wrapper - added C++ and Python accessors to SMILES and SmilesTautomerMap * - make sure the number of tautomers reported by rdLogger is correct and definitive * make sure that if N maxTautomers are requested, N tautomers are returned if the theoretical number of tautomers is M>N * avoid that sulfonic acids hit the formamidinesulfinic acid tautomerisation rule * offer an option to allow the old API to still be used. * Changes in response to review and following discussion with Gareth and Greg * - made TautomerEnumeratorResult an enum class (was a plain C enum) - made TautomerEnumeratorResult::const_iterator a bidirectional_iterator - added tests to fully probe the TautomerEnumeratorResult::const_iterator functionality * - change the difference_type definition - added tests for the above * - cosmetic change to improve code readability Co-authored-by: greg landrum <greg.landrum@gmail.com>	2020-09-24 17:00:03 -04:00
Greg Landrum	edd922c99c	Cleanup warnings from clang-10 (#3238 ) * stop returning local memory in exceptions * remove a couple unnecessary copies in loops * fix a bug in the way the default MMFF aromatic parameters are constructed * remove a bunch of loop-variable warnings * remove a bunch of clang warnings * disable clang warnings in python wrappers * remove some warnings when building the python wrappers	2020-06-19 17:16:22 -04:00
Greg Landrum	c7e7d88940	Fixes #2990 (#3016 ) * add the test * Fixes #2990 Needs more testing! * additional testing	2020-03-22 04:58:01 +01:00
Greg Landrum	ace523b1b5	allow retrieval of the atoms/bonds modified by the tautomerization (#3013 ) * add modifiedBonds and modifiedAtoms to C++ API * add modifiedAtoms/modifiedBonds interface to Python too * update documentation	2020-03-17 14:13:54 +01:00
Greg Landrum	6f9ba35826	Tune the tautomer scoring (#2959 ) * tautomer scoring tweaks doc updates expose tautomer score to Python * fix leaks in tests	2020-02-19 07:38:04 -05:00
Greg Landrum	915471a079	Fix a problem with aromatic heteroatom tautomer enumeration (#2952 ) * update transforms to enforce neutral nitrogens when doing the aromatic neteroatom transforms * formatting, add a test * remove unused #include	2020-02-13 06:35:01 +01:00
Greg Landrum	f8a4020789	Add MolVS tautomer canonicalization (#2886 ) * first pass at implementing molvs-style tautomer scoring This isn't optimal in terms of performance, but all the MolVS tests pass. * clang format * A bit of refactoring of the tautomer stuff * first pass at python wrappers * allow specifying the tautomer scoring function from C++ * EFF: use boost::flyweight so SMARTS is only parsed once * improve the python API * switch to boost::function instead of using function pointers * allow user-provided tautomer scoring functions * documentation and scorer version * change in response to review	2020-01-17 15:25:17 -05:00
Ric	a6b26253ff	Fix (most of) mem problems (#2123 ) * do not use new on loggers * del pointers in testDistGeom * Update Dict hasNonPOD status on bulk update * delete new Dicts in memtest1.cpp * fixes in MolSuppliers and testFMCS * PeriodicTable singleton as unique_ptr * fix EEM_arrays leak * fix leaks in testPBF * fix ParamCollection leak in test UFF * fix leaks in MMFF * clear prop dict before read in in pickler * fix leaks in testFreeSASA * fix leaks in test3D * modernize Dict.h & SmilesParse.cpp * fix leaks in testQuery * fix leaks in testCrystalFF * fix leaks in cxsmilesTest * fix leaks in Catalog & mol cat test * fix leaks in ShapeUtils & tests * fix leaks in testSubgraphs1 * fix leaks testFingerprintGenerators * fix leaks in Catalog/FilterCatalog * fix leaks in graphmolqueryTest * these changes reduce bison parse leaks * fixed leaks in testChirality.cpp * fix leaks + 2 tests in testMolWriter * fix 4m leaks in substructLibraryTest * small improvements to molTautomerTest; still leaks * fix leaks in testRGroupDecomp * fix leaks in test; parser still leaks * fix leaks in itertest * fix 4m leaks in testDepictor * fixes in smatest; still leaking due to parser * fixes in testSLNParse; still leaking due to parser * flex/bison: always add atoms with ownership; smarts error cleanup * fix leaks in testReaction * fix leaks in testSubstructMatch * fix leaks in resMolSupplierTest * fix leaks in testChemTransforms + bug in ChemTransforms * fix leaks in testPickler * fix leaks in testMolTransform * fix leaks in testFragCatalog * fix leak in testSLNParse. Still leaks due to Smiles * fixed most leaks in testMolSupplier * pre bison fix * fix some atom & bond parse problems; others still fail * bison smiles & smarts, atoms & bonds more or less fixed * fix leaks in molopstest.cpp * fix leaks in testFingerprints, MACCS.cpp & AtomPairs.cpp * fix leaks in moldraw2Dtest1 * fix leaks in testDescriptors * fix leaks in testInchi * fix leaks in testUFFForceFieldHelpers * fix leaks in hanoiTest & new_canon.h * fix leaks in testMMFFForceField * fix leaks in graphmolTest1 * fix leaks in testMMFFForceFieldHelpers * fix leaks in testDistGeomHelpers * fix leaks in testMolAlign * initialize occupancy & temp facto with default values * fix leak in TautomerTransform * updated suppressions * fix testStructChecker * fix logging & py tests * fix TautomerTransform class/struct issue * remove misplaced delete in testSLNParse * deinit in testAvalonLib1 * fix Avalon-triggered(?) bug in StructChecker/Pattern.cpp * fix random testMolWriter/Supplier fails - diversify output file names to avoid clashing. - unify Writers close/destruct behavior. - flushing/closing in tests. * use reset in FFs Params.cpp * comments on testMMFFForceField * unrequired 'if's added to mol suppliers * correct cast in FilterCatalog.h * use unique_ptr in MACCS Patterns * remove unrequred if in new_canon * update & move suppressions	2018-10-29 14:33:26 +00:00
Ric	91008ff11d	Address compile warnings & trivial improvements (#2097 ) * Address compile warnings & trivial improvements * revert unwanted initializers; use RDUNUSED_PARAM for unused params * revert fix in testRDFcustom; marked with 'TO DO' comment	2018-10-12 06:39:32 -04:00
Susan Leung	956fdf268c	Dev/GSOC2018_MolVS_Integration (#2002 ) * short test file for MolVS standardize_sm * short test file for MolVS fragment * short test file for MolVS metals * short test file for MolVS normalize * short test file for MolVS reionize * short test file for MolVS tautomer * short test file for MolVS validate * long test file for MolVS standardize smiles * long test file for MolVS fragment * long test file for MolVS metals * long test file for MolVS normalize * long test file for MolVS reionize * long test file for MolVS tautomer * long test file for MolVS validate * Unit tests for MolVS steps * dropping support for Python2 * molvs/__init__.py * molvs/charge.py * molvs/errors.py * molvs/fragment.py * molvs/metal.py * molvs/normalize.py * molvs/resonance.py * molvs/standardize.py * molvs/tautomer.py * molvs/utils.py * molvs/validate.py * molvs/validations.py * molvs/cli.py * adapted and renamed molvs/cli.py to work within $RDBASE/Contrib/MolVS/ * setup MolStandardize directories, source with empty cleanup function, header, CMake files * corrections to empty source, header and test1.cpp * adding empty functions and initializers to MolStandardize * empty Metal source, header and added test * added most of Metal.cpp functionality and made some more tests * empty functions and initializers to Normalize * empty functions and initializers to Validate * added most code for RDKitDefault mode, along with some tests * restructure for abstract base class ValidateMethod * written in isNoneValidation for MolVSValidation * took out isNoneValidation, put in noAtomValidation, neutralValidation, isotopeValidation for MolVSValidation * added in AllowedAtoms * added in disallowedAtoms * corrections to Validate * added code for FragmentRemover * extended fragment functionality to include choose largest fragment, added in tests for fragment catalog, fragment remover. Also added fragmentValidation method in MolStandardize * added another test to testValidate test_fragment * corrections to fragment * corrections to Metal * added code for Normalize * added normalize member function to MolStandardize and added tests * added multi fragment functionality to Normalize.cpp and additional tests * TransformCatalog * tests for Normalize.cpp * first bit of cleanup * added most of Charge functionality and some tests * some corrections to Charge.cpp and some more tests to testCharge.cpp * corrections to Charge.cpp * start of Tautomer Enumerate with some tests * added BondType option to Tautomer Enumeration * correcting for some memory leakage * a few alterations to formatting * sorting out some memory leaks * sorting out some memory leaks * some corrections for PCS test set * redo tests with updated RDKit * fixing memory leak * more fixes after 100kPCS set testing * using tab as delimiter in CSVs rather than comma * tutorial for MolStandardize * still working on Tautomer enumeration * deleted some empty tests * starting writing tautomer canonicalize * rename test_data -> data (the source still needs to be updated) * automatic source reformatting * adjust to directory rename * move the fragment catalog test into the MolStandardize directory do not create separate library for FragmentCatalog * stop building separate libraries for the catalogs * move the CleanupParameters into the MolStandardize namespace * first pass at python wrapper * move the py module to the correct dir; add some python tests; add standardizeSmiles to python wrapper * disabling the compareMolVSTest since that requires command line arguments to run * get this building on windows * put the python lib in the right place * further work on python wrapper for rdMolStandardize * added get and set functions to Metal and wrapped them * added get and set functions to Metal and wrapped them * changed construstor of Reionizer class and input args for reionize, wrapped this default * overload Reionizer constructor so user can input own AcidBaseFile from python * added Uncharger class to Charge and added test for Uncharger * wrapped Fragment, fixed some memory leakage, changed some args and return types, added some tests * wrapped Normalized and changed how Normalizer class is initiated * changing MolVSValidation structure so user can choose which MolVS submethod they want * starting to write Wrap for Validate * now it compiles with Wrap/Validate.cpp * a couple refactorings around validate * move the validate code into the rdMolStandardize module * make sure a valid pointer is returned for standardizeSmiles * rdMolStandardize.MolVSValidation done and tests added * half way through AllowedAtomsValidation * finished AllowedAtomsValidation and DisallowedAtomsValidation * moved charge, fragment, metal, normalize into the rdMolStandardize module * changed tutorial to use wrapped code * added copyrights * added copyrights * move the data files * modify source files to adjust to the move * added validateSmiles functionality * removed std::cout * redid some of the 100k PCS tests * working on the tutorial * adding some documentation * deleting some comment lines * some changes after pull review * More changes after pull review * start of trying to make java wrap * remove some warnings, add some questions * additional warning removals, a bit more reporting * some test cleanups * enable testing of the java code	2018-09-28 11:24:25 +02:00

26 Commits