rdkit

mirror of https://github.com/rdkit/rdkit.git synced 2026-06-04 21:54:27 +08:00

Author	SHA1	Message	Date
Yakov Pechersky	c6cabf4153	Speed-up tautomer canonicalization, no API changes (#9134 ) * Speed up tautomer canonicalization by deferring on SSSR calc * Lazy kekulization for tautomer enumeration Defer kekulization of tautomers until they are actually needed for transform matching. This avoids creating kekulized copies for: 1. The initial tautomer (until first iteration) 2. New tautomers that may never be processed (if enumeration ends early) The Tautomer class now supports lazy initialization of the kekulized form via getKekulized() method. Performance improvement: ~7% additional speedup (total ~22-24% from baseline) * Use count-only substructure matching in tautomer scoring * Add SubstructMatchCount regression test * MolStandardize: reduce enumerate overhead * MolStandardize: avoid per-tautomer ring recomputation * Atom: cache PeriodicTable pointer in valence calcs * Atom: reuse PeriodicTable in getEffectiveAtomicNum * PeriodicTable: add atomic fast path for getTable * GraphMol: reduce ROMol copy reallocations * MolStandardize: use quickCopy for per-match product copies Use RWMol(kmol, true) in tautomer enumeration to avoid copying properties/bookmarks/conformers for each candidate. This reduces deep-copy overhead without changing chemistry. MolStandardize: pre-filter scoring patterns by element/connectivity For tautomer scoring, pre-compute which SubstructTerms are relevant for a given input molecule. Since tautomerization only moves H atoms and changes bond orders (never creates/destroys heavy-atom bonds), patterns requiring missing elements or connectivity can be skipped for all tautomers of that molecule. Two-stage filtering: 1. Element check: skip patterns requiring atoms not in the molecule 2. Connectivity check: skip patterns whose bond-order-agnostic structure doesn't match the input molecule's connectivity This reduces the number of VF2 substructure calls per tautomer from 12 to typically 3-5, depending on the molecule's composition. * MolStandardize: preserve molecule properties for canonical tautomer Copy molecule properties from the original input to the canonical tautomer result. Since quickCopy during enumeration skips d_props to avoid overhead, extended SMILES data like link nodes (LN) was lost. This restores them on the final result. * TautomerQuery: preserve molecule properties (e.g. link nodes) in tautomers TautomerQuery::fromMol() uses TautomerEnumerator::enumerate() which uses quickCopy for performance. This doesn't copy molecule properties like _molLinkNodes. Without this fix, XQMol output would lose link node extensions in the SMILES. Copy properties from the original query molecule to all enumerated tautomers before constructing the TautomerQuery. This preserves extended SMILES data without impacting enumeration performance. * MolStandardize: use parallel iteration and cache bond lookups Replace O(n) getAtomWithIdx/getBondWithIdx calls with parallel iteration over atom/bond ranges in canonicalizeInPlace and enumerate. Cache bond lookups in setTautomerStereoAndIsoHs to avoid repeated O(n) searches. * perf: add specialized matchers for simple tautomer scoring patterns Replace VF2 graph matching with O(n) loops for 6 simple patterns: - countDoubleOrAromaticBonds: C=O, N=O, P=O patterns - countMethyls: [CX4H3] methyl groups - countCarbonDoubleHetero: [C]=[/home/dcvuser/rdkit;Code/GraphMol/MolStandardize/Tautomer.h] aliphatic C=hetero - countAromaticCarbonExocyclicN: [c]=aromatic C=exocyclic N Complex patterns (benzoquinone, oxim, guanidine, aci-nitro) still use VF2. Combined with the pre-filtering optimization, this achieves ~3.7x speedup (~2500ms vs ~9300ms original) for tautomer canonicalization. * Fix tautomer canonicalize dropping conformers from quickCopy quickCopy (RWMol(mol, true)) skips conformers, so tautomer enumeration products lose 2D/3D coordinates. This causes InChI generation to omit the /b (double bond E/Z stereo) layer, since E/Z is derived from atomic coordinates. Fix: copy conformers from the original molecule onto the canonical tautomer after pickCanonical in TautomerEnumerator::canonicalize(). Tests: SMILES-based E/Z check in testTautomer.cpp, molblock-based conformer preservation check in catch_tests.cpp. add test on canonicalize losing stereo * add regression test for exocyclic C=C tautomer canonicalization The getTautomerStateKey() pre-filter (commit 2595ef748) can falsely deduplicate distinct tautomers when their atom-index-ordered state patterns happen to match, leading canonicalize() to pick the wrong canonical form for molecules with STEREOTRANS-pinned exocyclic C=C bonds after RemoveHs. Test verifies that O=C(CC1=CC2=CC=COC2)NC1=O canonicalizes to the exocyclic form O=C1CC(=CC2=CC=COC2)C(=O)N1, not the endocyclic form O=C1C=C(C=C2CC=COC2)C(=O)N1. Currently expected to FAIL until the state key dedup bug is fixed. * MolStandardize: expand tautomer connectivity SMARTS * MolStandardize: scope tautomer pattern enum * MolStandardize: trim tautomer pattern enum * MolStandardize: use symmetric ring scoring	2026-03-31 06:42:40 +02:00
Ricardo Rodriguez	7b7a8a4e17	Refactor iostreams includes (#8846 ) * refactor iostreams includes * restore ostream to MonomerInfo.cpp	2025-10-08 16:08:01 +02:00
Greg Landrum	fa048eacc5	Replace GetImplicitValence() and GetExplicitValence() with GetValence() (#7926 )	2025-01-28 21:09:03 +01:00
Brian Kelley	9495dd5413	Expose tautomer scoring functions to python (#7994 ) * Expose tautomer scoring functions to python * Add more tests/documentation * Rename getDefaultTautomerSubstructs to getDefaultTautomerScoreSubstructs * Remove ROMOL_SPTR * Add full custom scoring function example * Run clang format * Use proper BOOST_PYTHON_FUNCTION_OVERLOADS * Use default copy constructor	2024-11-15 05:37:35 +01:00
Greg Landrum	7d2598267a	Fixes #7689 (#7851 )	2024-09-26 19:22:26 +02:00
Greg Landrum	9d26fc229d	Fixes #7642 (#7643 )	2024-07-25 04:48:00 +02:00
Greg Landrum	724716b2c6	Switch to isoelectronic valence model (#7491 ) * change valence model to use isolobal analogy Remove support for five-coordinate C+ and, by analogy, five-coordinate N+2 Removes support for charge states that take atoms past the end of the periodic table i.e. [Lv-4] is no longer supported * update the tests for that * remove valence state of 6 for Al * fix representation of phosphate in the mol2 parser this is a correction of what was done during #5973 * cleanup the exceptions for P, S, As, and Se * drop valence states: Si 6, P 7, As 7 * a couple of additional changes from #7397 * update java tests * fix an inconsistency: Rb now supports valence -1 * documentation * - replace operator[] with at() for bounds check - extract some code into a function to avoid duplication - use TAB as separator throughout in the periodic table data for consistency * removing the .at() usage We know that these vectors aren't empty, so there's no need for the bounds check. --------- Co-authored-by: ptosco <paolo.tosco@novartis.com>	2024-06-25 15:38:49 +02:00
Riccardo Vianello	24aba6904e	Fix the Uncharger 'force' option w/ non-neutralizable negatively charged sites (#7382 )	2024-04-24 09:19:27 +02:00
Riccardo Vianello	06d2e2e89f	Add a 'force' option to MolStandardizer::Uncharger (#7088 ) * Add a 'force' option to MolStandardize::Uncharger * update comment * add more test cases exercising MolStandardize::Uncharger * fix the neutralization of surplus negative charges * changes in response to review * Add a test case for MolStandardize::Uncharger * refactor the neutralization of negative charges in MolStandardize::Uncharger	2024-02-06 15:34:28 +01:00
Greg Landrum	c7c9ad3328	Add in place and multithread support for more of the MolStandardize code (#6970 )	2023-12-12 17:21:18 +01:00
Greg Landrum	15751b3651	Add multi-threaded versions of some MolStandardize operations (#6909 ) * initial addition of MT support to MolStandardize * do the other inplace functions * add mt ops to python wrappers including tests * release the GIL * remove exploratory code added during dev * make normalizer thread safe * refactor some repeated code	2023-11-24 18:36:17 -05:00
Greg Landrum	2957ab4576	switch to catch2 v3 (#6898 ) * switch to catch2 v3 Fixes #6894 * fix a couple of problems noticed in the CI builds * more warning cleanup * changes in response to review	2023-11-15 06:45:42 +01:00
Greg Landrum	ac54eb3209	Add an in place version of most of the MolStandardize functionality (#6491 ) * reionizer and uncharger and normalizer can now operate in place * add removeUnmatchedAtoms argument to in-place version of runReactant When set to false atoms which are not explicitly removed by the reaction are preserved * Fix a case where transforms were incorrectly updating atomic numbers * add more inplace operations to MolStandardize * support those in the Python layer * support inplace for the rest of the python wrappers * move a few more functions over to the inplace code	2023-07-21 08:44:41 +02:00
Greg Landrum	ff6451447a	Fixes #5784 (#5817 ) catch kekulization errors during the tautomer enumeration I have tested this on ~100K ChEMBL molecules and encountered no further problems.	2022-12-01 17:11:50 +01:00
Greg Landrum	522811b8d4	Fixes #5402 (#5542 ) * support transforms with branches * improve output when doing verbose canonicalization * Fixes #5402	2022-09-09 05:06:42 +02:00
Greg Landrum	fb49a33b5a	Fix a couple of problems with MolStandardize (#5319 ) * Fixes #5317 * Fixes #5318 * Fixes #5320 * Update Code/GraphMol/MolStandardize/Charge.cpp Co-authored-by: Paolo Tosco <paolo.tosco.mail@gmail.com> Co-authored-by: Paolo Tosco <paolo.tosco.mail@gmail.com>	2022-05-30 06:00:09 +02:00
Greg Landrum	310999674b	Make the aliphatic imine rule more closely match the definition in the paper (#5270 )	2022-05-12 17:16:28 +02:00
Greg Landrum	dc7058ab1c	Fixes #5169 (#5191 )	2022-04-27 15:02:48 +02:00
Greg Landrum	54ff5ec5dd	Add H and X specification to P tautomerization rules (#5077 ) Fixes #5008	2022-03-14 04:41:09 +01:00
Greg Landrum	70dadbb21f	Fixes #4260 (#4267 )	2021-06-21 09:52:11 -04:00
Greg Landrum	f5a54af475	A collection of MolStandardize improvements (#4118 ) * Swap to using a data structure for default normalization parameters * bring the default fragment data into the code too * cleanup * add reionizer parameters via data change fragment parse failures to ValueErrorExceptions * tautomer parameters in the code * got a little over-enthusiastic in that last cleanup * use boost::flyweight to cache normalization and charge data params * a bit more cleanup * support reading params from JSON * fragments from JSON single-call for fragment removal * add a one-liner for the canonical tautomer * quick refactor * Fixes #4115 * complete the parents * docs * move the definitions to a namespace and make them const * see if switching to c++14 fixes the CI compile problems with g++ 5.5 * somewhat uglier way of solving the initalizer list problem	2021-05-19 09:11:23 +02:00
Greg Landrum	158fe71ff2	Add some MolStandardize functionality to the CFFI library (#4062 ) * support CleanupParameters from JSON * add standardization to cffi * remove a bunch of repeated code from the new stuff	2021-04-22 05:23:25 +02:00
Paolo Tosco	f1119f3980	Make MetalDisconnector more robust against metallorganics (#3465 ) * Make MetalDisconnector more robust against metallorganics * - fixed misbehavior with radicals - added tests - code cleanup * - fixed MetalDisconnector with dative bonds - removed pointless test	2020-10-13 04:41:18 +02:00
shrey183	8ea1ac6112	[GSoC-2020] Generalized and Multithreaded File Reader (#3363 ) * fixed issue #2965 * added test case for issue #2965 * fixed formatting and added comment. * update * General Reader files * removed dependency on boost filesystems * removed class * clang-format * added-comments * further-cleanup * added clang-formatting * braces-for-if-else * changed error messages, added option for windows file path * fixed getFileName function * cleanup * option for filename without path * further-cleanup * added tests for determineFileFormat * cleanup, const arguments for validate function * init * cleanup * cleanup * clang-format does not work for CMake * added RDK_TEST_MULTITHREADED option * add-flag * cleanup * Delete ConcurrentQueue.h This PR deals with the Generalized File Reader. * Delete testConcurrentQueue.cpp This PR deals with the Generalized File Reader. * no change * concurrent queue * print values * Single Producer Multiple Consumer works * cleanup * Producer Consumer Example * update queue methods and tests * cleanup * test * fixed tests * cleanup, updated tests * Delete ProducerConsumer.h * Delete testProducerConsumer.cpp * cleanup * futher cleanup * changes based on feedback * make queue non copyable * psuedocode * possible implementation * untested implementation * change class to typename * basic-setup * need to fix segfault * need to fix blocking * need to fix blocking * need to fix blocking * fix indentation * one possibility * without lambda function * possible fix with some test cases * performance tests * added support for record id and item text * cleanup * cleanup * fixed memory leak and added methods with tests for getting last id and item text * cleanup * added more test cases with different smi files * cleanup * SD mol supplier * modified the parsing for SDMolSupplier * cleanup * cleanup * new file for testing * added support for reading molecule properties with tests * thread-safe logging and exception handling * cleanup * without thread safe logging * cleanup * cleanup, modified MultithreadedSmilesMolSupplier * cleanup, made reader and writer functions private * move O2.sdf * basic python wrapper with tests * cleanup, added new methods for python wrappers * made changes suggested by Andrew * file and compression formats are case-insensitive * cannot open files with gzstream * cleanup * possible fix for opening compressed streams (SMILES) * removed seekg() and tellg() methods from multithreadeded suppliers * cleanup * test cases for python wrappers * some wrapper cleanup * cleanup, removed unused functions * update the MT tests so that they actually do some work also includes some cleanup here * cleanup * remove iterator_next header include * added support for multithreaded readers * use getNumThreadsToUse for multithreaded suppliers * fixed documentation for multithreaded python wrappers * commented performance test * first draft of final evaluation report * removed inline variables * first draft getting started in python * fixed typos in getting started in python * fixed typos * fix documentation tests * fixed documentation tests * added links to important files and PR * added perfomance results * first version of wrappers with compressed streams * getting rid of streambuf stream method * modified General File Reader * make this work when building in non-threads mode * rename a test * rename a function in the python API * rearrange the python test a bit * disable the stream-based constructors in Python * mark the multithreaded classes as experimental Co-authored-by: greg landrum <greg.landrum@gmail.com>	2020-10-09 04:31:05 +02:00
Eisuke Kawashima	be9349b3bb	Correct TEST_CASE tags for Catch2 (#3069 ) https://github.com/catchorg/Catch2/blob/v2.1.2/docs/test-cases-and-sections.md#tags	2020-04-08 15:43:38 +02:00
Manan Goel	f3a6db2a02	This commit fixes the bug "segmenation fault/core dump when chargePar… (#3029 ) * This commit fixes the bug "segmenation fault/core dump when chargeParent is run with skip_standardize set to true" mentioned in #2970 * Fixed memory leaks in MolStandardize and deleted variables which aren't required	2020-03-24 07:48:38 +01:00
shrey183	00c6a7e370	Possible fix for issue #2965 (#3001 ) * fixed issue #2965 * added test case for issue #2965 * fixed formatting and added comment.	2020-03-14 14:28:17 +01:00
Greg Landrum	f2841ecf42	Fixes #2792 (#2793 )	2019-11-20 16:26:35 +01:00
Greg Landrum	c09fb2f3f4	fragments need to match bond counts too (#2768 )	2019-11-14 13:57:22 +01:00
Greg Landrum	cb55f6b979	Fixes #2749 (#2750 ) * Switch to using numTotalHs() instead of numExplicitHs() * Fixes #2749 * changes in response to review	2019-10-31 07:24:34 -04:00
Greg Landrum	02cff7dfe4	Fix #2722 and #2721 (#2723 ) * Fixes #2722 * Fixes and tests #2721	2019-10-17 11:35:32 -04:00
Greg Landrum	b87c629e10	fix a problem with normalize, ringinfo, and fragments (#2685 )	2019-10-03 15:28:33 +02:00
Greg Landrum	7ffd863c9b	A collection of bug fixes (#2608 ) * Fixes #2602 * Fixes #2605 * Remove vestigial isEarlyAtom() definition in Kekulize.cpp * Fixes #2606 * Fixes #2607 adds allowed valence 2 for Sn and Pb * Fixes #2610 * update in response to review	2019-08-15 04:53:23 +02:00
Greg Landrum	3ce2016039	Fixes #2452 (#2507 )	2019-06-24 23:07:19 -04:00
Greg Landrum	d0c8c3cf8f	Fixes #2411 and #2414 (#2412 ) * clang-tidy-7 pass * Fixes #2411 * Fixes #2414	2019-04-19 21:51:41 -04:00
Greg Landrum	941d7abb5f	Fixes #2392 (#2393 ) * Fixes #2392 * update release notes	2019-04-06 07:16:55 -04:00
Greg Landrum	1d01874678	improvements to the Uncharge functionality (#2374 ) * modify the uncharger to be use a canonical atom ordering * add doCanonical cleanup parameter make canonical ordering the default document the change * Add neutralization of additonal negative groups (not just acids). This may not be the right thing to do. * expose the new parameter to python * changes in response to review	2019-03-29 21:02:55 -04:00
Greg Landrum	55fb9034a6	Add a skip_all_if_match option to the FragmentRemover (#2338 ) * add SKIP_IF_ALL_MATCH argument to FragmentRemover Refactor FragmentRemover::remove() to make it more efficient * implement and test SKIP_IF_ALL_MATCH * expose the extra option to Python * add info to logger	2019-03-14 09:32:08 -04:00

38 Commits