38 Commits

Author SHA1 Message Date
Yakov Pechersky
c6cabf4153 Speed-up tautomer canonicalization, no API changes (#9134)
* Speed up tautomer canonicalization by deferring on SSSR calc

* Lazy kekulization for tautomer enumeration

Defer kekulization of tautomers until they are actually needed for
transform matching. This avoids creating kekulized copies for:
1. The initial tautomer (until first iteration)
2. New tautomers that may never be processed (if enumeration ends early)

The Tautomer class now supports lazy initialization of the kekulized
form via getKekulized() method.

Performance improvement: ~7% additional speedup (total ~22-24% from baseline)

* Use count-only substructure matching in tautomer scoring

* Add SubstructMatchCount regression test

* MolStandardize: reduce enumerate overhead

* MolStandardize: avoid per-tautomer ring recomputation

* Atom: cache PeriodicTable pointer in valence calcs

* Atom: reuse PeriodicTable in getEffectiveAtomicNum

* PeriodicTable: add atomic fast path for getTable

* GraphMol: reduce ROMol copy reallocations

* MolStandardize: use quickCopy for per-match product copies

Use RWMol(*kmol, true) in tautomer enumeration to avoid copying properties/bookmarks/conformers for each candidate. This reduces deep-copy overhead without changing chemistry.

* MolStandardize: pre-filter scoring patterns by element/connectivity

For tautomer scoring, pre-compute which SubstructTerms are relevant for
a given input molecule. Since tautomerization only moves H atoms and
changes bond orders (never creates/destroys heavy-atom bonds), patterns
requiring missing elements or connectivity can be skipped for all
tautomers of that molecule.

Two-stage filtering:
1. Element check: skip patterns requiring atoms not in the molecule
2. Connectivity check: skip patterns whose bond-order-agnostic structure
   doesn't match the input molecule's connectivity

This reduces the number of VF2 substructure calls per tautomer from 12
to typically 3-5, depending on the molecule's composition.

* MolStandardize: preserve molecule properties for canonical tautomer

Copy molecule properties from the original input to the canonical tautomer
result. Since quickCopy during enumeration skips d_props to avoid overhead,
extended SMILES data like link nodes (LN) was lost. This restores them
on the final result.

* TautomerQuery: preserve molecule properties (e.g. link nodes) in tautomers

TautomerQuery::fromMol() uses TautomerEnumerator::enumerate() which uses
quickCopy for performance. This doesn't copy molecule properties like
_molLinkNodes. Without this fix, XQMol output would lose link node
extensions in the SMILES.

Copy properties from the original query molecule to all enumerated
tautomers before constructing the TautomerQuery. This preserves extended
SMILES data without impacting enumeration performance.

* MolStandardize: use parallel iteration and cache bond lookups

Replace O(n) getAtomWithIdx/getBondWithIdx calls with parallel iteration
over atom/bond ranges in canonicalizeInPlace and enumerate. Cache bond
lookups in setTautomerStereoAndIsoHs to avoid repeated O(n) searches.

* perf: add specialized matchers for simple tautomer scoring patterns

Replace VF2 graph matching with O(n) loops for 6 simple patterns:
- countDoubleOrAromaticBonds: C=O, N=O, P=O patterns
- countMethyls: [CX4H3] methyl groups
- countCarbonDoubleHetero: [C]=[/home/dcvuser/rdkit;Code/GraphMol/MolStandardize/Tautomer.h] aliphatic C=hetero
- countAromaticCarbonExocyclicN: [c]=aromatic C=exocyclic N
Complex patterns (benzoquinone, oxim, guanidine, aci-nitro) still use VF2.
Combined with the pre-filtering optimization, this achieves ~3.7x speedup
(~2500ms vs ~9300ms original) for tautomer canonicalization.

* Fix tautomer canonicalize dropping conformers from quickCopy

quickCopy (RWMol(*mol, true)) skips conformers, so tautomer
enumeration products lose 2D/3D coordinates. This causes InChI
generation to omit the /b (double bond E/Z stereo) layer, since
E/Z is derived from atomic coordinates.

Fix: copy conformers from the original molecule onto the canonical
tautomer after pickCanonical in TautomerEnumerator::canonicalize().

Tests: SMILES-based E/Z check in testTautomer.cpp, molblock-based
conformer preservation check in catch_tests.cpp.

* add test on canonicalize losing stereo

* add regression test for exocyclic C=C tautomer canonicalization

The getTautomerStateKey() pre-filter (commit 2595ef748) can falsely
deduplicate distinct tautomers when their atom-index-ordered state
patterns happen to match, leading canonicalize() to pick the wrong
canonical form for molecules with STEREOTRANS-pinned exocyclic C=C
bonds after RemoveHs.

Test verifies that O=C(CC1=CC2=CC=COC2)NC1=O canonicalizes to the
exocyclic form O=C1CC(=CC2=CC=COC2)C(=O)N1, not the endocyclic form
O=C1C=C(C=C2CC=COC2)C(=O)N1.

Currently expected to FAIL until the state key dedup bug is fixed.

* MolStandardize: expand tautomer connectivity SMARTS

* MolStandardize: scope tautomer pattern enum

* MolStandardize: trim tautomer pattern enum

* MolStandardize: use symmetric ring scoring
2026-03-31 06:42:40 +02:00
Ricardo Rodriguez
7b7a8a4e17 Refactor iostreams includes (#8846)
* refactor iostreams includes

* restore ostream to MonomerInfo.cpp
2025-10-08 16:08:01 +02:00
Greg Landrum
fa048eacc5 Replace GetImplicitValence() and GetExplicitValence() with GetValence() (#7926) 2025-01-28 21:09:03 +01:00
Brian Kelley
9495dd5413 Expose tautomer scoring functions to python (#7994)
* Expose tautomer scoring functions to python

* Add more tests/documentation

* Rename getDefaultTautomerSubstructs to getDefaultTautomerScoreSubstructs

* Remove ROMOL_SPTR

* Add full custom scoring function example

* Run clang format

* Use proper BOOST_PYTHON_FUNCTION_OVERLOADS

* Use default copy constructor
2024-11-15 05:37:35 +01:00
Greg Landrum
7d2598267a Fixes #7689 (#7851) 2024-09-26 19:22:26 +02:00
Greg Landrum
9d26fc229d Fixes #7642 (#7643) 2024-07-25 04:48:00 +02:00
Greg Landrum
724716b2c6 Switch to isoelectronic valence model (#7491)
* change valence model to use isolobal analogy

Remove support for five-coordinate C+ and, by analogy, five-coordinate N+2

Removes support for charge states that take atoms past the end of the periodic table
  i.e. [Lv-4] is no longer supported

* update the tests for that

* remove valence state of 6 for Al

* fix representation of phosphate in the mol2 parser

this is a correction of what was done during #5973

* cleanup the exceptions for P, S, As, and Se

* drop valence states:

Si 6, P 7, As 7

* a couple of additional changes from #7397

* update java tests

* fix an inconsistency: Rb now supports valence -1

* documentation

* - replace operator[] with at() for bounds check
- extract some code into a function to avoid duplication
- use TAB as separator throughout in the periodic table data for consistency

* removing the .at() usage

We know that these vectors aren't empty, so there's no need for the bounds check.

---------

Co-authored-by: ptosco <paolo.tosco@novartis.com>
2024-06-25 15:38:49 +02:00
Riccardo Vianello
24aba6904e Fix the Uncharger 'force' option w/ non-neutralizable negatively charged sites (#7382) 2024-04-24 09:19:27 +02:00
Riccardo Vianello
06d2e2e89f Add a 'force' option to MolStandardizer::Uncharger (#7088)
* Add a 'force' option to MolStandardize::Uncharger

* update comment

* add more test cases exercising MolStandardize::Uncharger

* fix the neutralization of surplus negative charges

* changes in response to review

* Add a test case for MolStandardize::Uncharger

* refactor the neutralization of negative charges in MolStandardize::Uncharger
2024-02-06 15:34:28 +01:00
Greg Landrum
c7c9ad3328 Add in place and multithread support for more of the MolStandardize code (#6970) 2023-12-12 17:21:18 +01:00
Greg Landrum
15751b3651 Add multi-threaded versions of some MolStandardize operations (#6909)
* initial addition of MT support to MolStandardize

* do the other inplace functions

* add mt ops to python wrappers
including tests

* release the GIL

* remove exploratory code added during dev

* make normalizer thread safe

* refactor some repeated code
2023-11-24 18:36:17 -05:00
Greg Landrum
2957ab4576 switch to catch2 v3 (#6898)
* switch to catch2 v3
Fixes #6894

* fix a couple of problems noticed in the CI builds

* more warning cleanup

* changes in response to review
2023-11-15 06:45:42 +01:00
Greg Landrum
ac54eb3209 Add an in place version of most of the MolStandardize functionality (#6491)
* reionizer and uncharger and normalizer can now operate in place

* add removeUnmatchedAtoms argument to in-place version of runReactant

When set to false atoms which are not explicitly removed by the reaction are preserved

* Fix a case where transforms were incorrectly updating atomic numbers

* add more inplace operations to MolStandardize

* support those in the Python layer

* support inplace for the rest of the python wrappers

* move a few more functions over to the inplace code
2023-07-21 08:44:41 +02:00
Greg Landrum
ff6451447a Fixes #5784 (#5817)
catch kekulization errors during the tautomer enumeration
I have tested this on ~100K ChEMBL molecules and encountered
no further problems.
2022-12-01 17:11:50 +01:00
Greg Landrum
522811b8d4 Fixes #5402 (#5542)
* support transforms with branches

* improve output when doing verbose canonicalization

* Fixes #5402
2022-09-09 05:06:42 +02:00
Greg Landrum
fb49a33b5a Fix a couple of problems with MolStandardize (#5319)
* Fixes #5317

* Fixes #5318

* Fixes #5320

* Update Code/GraphMol/MolStandardize/Charge.cpp

Co-authored-by: Paolo Tosco <paolo.tosco.mail@gmail.com>

Co-authored-by: Paolo Tosco <paolo.tosco.mail@gmail.com>
2022-05-30 06:00:09 +02:00
Greg Landrum
310999674b Make the aliphatic imine rule more closely match the definition in the paper (#5270) 2022-05-12 17:16:28 +02:00
Greg Landrum
dc7058ab1c Fixes #5169 (#5191) 2022-04-27 15:02:48 +02:00
Greg Landrum
54ff5ec5dd Add H and X specification to P tautomerization rules (#5077)
Fixes #5008
2022-03-14 04:41:09 +01:00
Greg Landrum
70dadbb21f Fixes #4260 (#4267) 2021-06-21 09:52:11 -04:00
Greg Landrum
f5a54af475 A collection of MolStandardize improvements (#4118)
* Swap to using a data structure for default normalization parameters

* bring the default fragment data into the code too

* cleanup

* add reionizer parameters via data

change fragment parse failures to ValueErrorExceptions

* tautomer parameters in the code

* got a little over-enthusiastic in that last cleanup

* use boost::flyweight to cache normalization and charge data params

* a bit more cleanup

* support reading params from JSON

* fragments from JSON
single-call for fragment removal

* add a one-liner for the canonical tautomer

* quick refactor

* Fixes #4115

* complete the parents

* docs

* move the definitions to a namespace and make them const

* see if switching to c++14 fixes the CI compile problems with g++ 5.5

* somewhat uglier way of solving the initalizer list problem
2021-05-19 09:11:23 +02:00
Greg Landrum
158fe71ff2 Add some MolStandardize functionality to the CFFI library (#4062)
* support CleanupParameters from JSON

* add standardization to cffi

* remove a bunch of repeated code from the new stuff
2021-04-22 05:23:25 +02:00
Paolo Tosco
f1119f3980 Make MetalDisconnector more robust against metallorganics (#3465)
* Make MetalDisconnector more robust against metallorganics

* - fixed misbehavior with radicals
- added tests
- code cleanup

* - fixed MetalDisconnector with dative bonds
- removed pointless test
2020-10-13 04:41:18 +02:00
shrey183
8ea1ac6112 [GSoC-2020] Generalized and Multithreaded File Reader (#3363)
* fixed issue #2965

* added test case for issue #2965

* fixed formatting and added comment.

* update

* General Reader files

* removed dependency on boost filesystems

* removed class

* clang-format

* added-comments

* further-cleanup

* added clang-formatting

* braces-for-if-else

* changed error messages, added option for windows file path

* fixed getFileName function

* cleanup

* option for filename without path

* further-cleanup

* added tests for determineFileFormat

* cleanup, const arguments for validate function

* init

* cleanup

* cleanup

* clang-format does not work for CMake

* added RDK_TEST_MULTITHREADED option

* add-flag

* cleanup

* Delete ConcurrentQueue.h

This PR deals with the Generalized File Reader.

* Delete testConcurrentQueue.cpp

This PR deals with the Generalized File Reader.

* no change

* concurrent queue

* print values

* Single Producer Multiple Consumer works

* cleanup

* Producer Consumer Example

* update queue methods and tests

* cleanup

* test

* fixed tests

* cleanup, updated tests

* Delete ProducerConsumer.h

* Delete testProducerConsumer.cpp

* cleanup

* futher cleanup

* changes based on feedback

* make queue non copyable

* psuedocode

* possible implementation

* untested implementation

* change class to typename

* basic-setup

* need to fix segfault

* need to fix blocking

* need to fix blocking

* need to fix blocking

* fix indentation

* one possibility

* without lambda function

* possible fix with some test cases

* performance tests

* added support for record id and item text

* cleanup

* cleanup

* fixed memory leak and added methods with tests for getting last id and item text

* cleanup

* added more test cases with different smi files

* cleanup

* SD mol supplier

* modified the parsing for SDMolSupplier

* cleanup

* cleanup

* new file for testing

* added support for reading molecule properties with tests

* thread-safe logging and exception handling

* cleanup

* without thread safe logging

* cleanup

* cleanup, modified MultithreadedSmilesMolSupplier

* cleanup, made reader and writer functions private

* move O2.sdf

* basic python wrapper with tests

* cleanup, added new methods for python wrappers

* made changes suggested by Andrew

* file and compression formats are case-insensitive

* cannot open files with gzstream

* cleanup

* possible fix for opening compressed streams (SMILES)

* removed seekg() and tellg() methods from multithreadeded suppliers

* cleanup

* test cases for python wrappers

* some wrapper cleanup

* cleanup, removed unused functions

* update the MT tests so that they actually do some work
also includes some cleanup here

* cleanup

* remove iterator_next header include

* added support for multithreaded readers

* use getNumThreadsToUse for multithreaded suppliers

* fixed documentation for multithreaded python wrappers

* commented performance test

* first draft of final evaluation report

* removed inline variables

* first draft getting started in python

* fixed typos in getting started in python

* fixed typos

* fix documentation tests

* fixed documentation tests

* added links to important files and PR

* added perfomance results

* first version of wrappers with compressed streams

* getting rid of streambuf stream method

* modified General File Reader

* make this work when building in non-threads mode

* rename a test

* rename a function in the python API

* rearrange the python test a bit

* disable the stream-based constructors in Python

* mark the multithreaded classes as experimental

Co-authored-by: greg landrum <greg.landrum@gmail.com>
2020-10-09 04:31:05 +02:00
Eisuke Kawashima
be9349b3bb Correct TEST_CASE tags for Catch2 (#3069)
https://github.com/catchorg/Catch2/blob/v2.1.2/docs/test-cases-and-sections.md#tags
2020-04-08 15:43:38 +02:00
Manan Goel
f3a6db2a02 This commit fixes the bug "segmenation fault/core dump when chargePar… (#3029)
* This commit fixes the bug "segmenation fault/core dump when chargeParent is run with skip_standardize set to true" mentioned in #2970

* Fixed memory leaks in MolStandardize and deleted variables which aren't required
2020-03-24 07:48:38 +01:00
shrey183
00c6a7e370 Possible fix for issue #2965 (#3001)
* fixed issue #2965

* added test case for issue #2965

* fixed formatting and added comment.
2020-03-14 14:28:17 +01:00
Greg Landrum
f2841ecf42 Fixes #2792 (#2793) 2019-11-20 16:26:35 +01:00
Greg Landrum
c09fb2f3f4 fragments need to match bond counts too (#2768) 2019-11-14 13:57:22 +01:00
Greg Landrum
cb55f6b979 Fixes #2749 (#2750)
* Switch to using numTotalHs() instead of numExplicitHs()

* Fixes #2749

* changes in response to review
2019-10-31 07:24:34 -04:00
Greg Landrum
02cff7dfe4 Fix #2722 and #2721 (#2723)
* Fixes #2722

* Fixes and tests #2721
2019-10-17 11:35:32 -04:00
Greg Landrum
b87c629e10 fix a problem with normalize, ringinfo, and fragments (#2685) 2019-10-03 15:28:33 +02:00
Greg Landrum
7ffd863c9b A collection of bug fixes (#2608)
* Fixes #2602

* Fixes #2605

* Remove vestigial isEarlyAtom() definition in Kekulize.cpp

* Fixes #2606

* Fixes #2607

adds allowed valence 2 for Sn and Pb

* Fixes #2610

* update in response to review
2019-08-15 04:53:23 +02:00
Greg Landrum
3ce2016039 Fixes #2452 (#2507) 2019-06-24 23:07:19 -04:00
Greg Landrum
d0c8c3cf8f Fixes #2411 and #2414 (#2412)
* clang-tidy-7 pass

* Fixes #2411

* Fixes #2414
2019-04-19 21:51:41 -04:00
Greg Landrum
941d7abb5f Fixes #2392 (#2393)
* Fixes #2392

* update release notes
2019-04-06 07:16:55 -04:00
Greg Landrum
1d01874678 improvements to the Uncharge functionality (#2374)
* modify the uncharger to be use a canonical atom ordering

* add doCanonical cleanup parameter
make canonical ordering the default
document the change

* Add neutralization of additonal negative groups (not just acids).
This may not be the right thing to do.

* expose the new parameter to python

* changes in response to review
2019-03-29 21:02:55 -04:00
Greg Landrum
55fb9034a6 Add a skip_all_if_match option to the FragmentRemover (#2338)
* add SKIP_IF_ALL_MATCH argument to FragmentRemover
    Refactor FragmentRemover::remove() to make it more efficient

* implement and test SKIP_IF_ALL_MATCH

* expose the extra option to Python

* add info to logger
2019-03-14 09:32:08 -04:00