29 Commits

Author SHA1 Message Date
David Cosgrove
b04a861ae7 Replace combineMols with RWMol::insertMol. (#9319)
Co-authored-by: David Cosgrove <david@cozchemix.co.uk>
2026-06-02 14:38:14 +02:00
Dan Nealschneider
226427e0bc Synthon substructure search 2x performance (#9307)
* synthon perf: replace O(N) haveEnoughHits scan with O(1) atomic counter

processPartHitsFromDetails called haveEnoughHits after each verified hit,
which scanned every slot of the pre-sized results vector (up to toTryChunkSize
= 2.5M entries) to count non-null entries via std::accumulate. With ~3000
verified hits per search that is ~7.5B pointer reads per query.

Replace with a std::atomic<int64_t> numHitsFound counter in makeHitsFromToTry,
incremented via fetch_add on each verified hit. The early-exit condition becomes
a single atomic read, O(1) per hit regardless of vector size. The atomic is
local to makeHitsFromToTry so it resets correctly per chunk and is safe for
the multi-threaded path without added synchronization.

Measured on synthon_perf branch (42-rxn / 140B-product Freedom space,
maxHits=3000, hitStart=1000, before boost::unordered_flat_set change):
  search-several (9 queries): ~30s → ~16.5s (~1.8x)
  search-one (benzene):       ~3.5s → ~1.8s  (~1.9x)

All 4 synthon ctest cases pass.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* style ++

* Update Code/GraphMol/SynthonSpaceSearch/SynthonSpaceSearcher.cpp

Co-authored-by: Greg Landrum <greg.landrum@gmail.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Greg Landrum <greg.landrum@gmail.com>
2026-06-02 14:23:49 +02:00
Dan Nealschneider
76a32ef1ee synthon perf: replace sort+unique dedup with boost::unordered_flat_set (#9305)
sortAndUniquifyToTry previously built a parallel vector of (index, string)
pairs, sorted by string, erased duplicates, then rebuilt the original vector
— O(N log N) with one heap allocation per candidate product.

Replace with an erase-remove over a boost::unordered_flat_set<size_t> keyed
on buildProductHash (boost::hash_combine over synthon IDs + reaction ID).
Dedup is now O(N) average with no string allocations on the hot path.

Also switch SearchResults::d_molNames from std::unordered_set<std::string>
to boost::unordered_flat_set<std::string> for the same open-addressing cache
locality benefit during mergeResults.

Perf (42-rxn / 140B-product Freedom space, maxHits=3000, hitStart=1000,
9 queries; vanilla.log → 2unordered_flat_set.log):
  Benzene:       6.92s → 5.64s  (−19%)
  Tolueneish:    6.19s → 5.07s  (−18%)
  Acetaminophen: 4.50s → 3.63s  (−19%)
  Allopurinol:   4.41s → 3.94s  (−11%)
  Theophylline:  4.39s → 3.90s  (−11%)
  Nicotine:      4.87s → 3.97s  (−18%)
  Ciprofloxacin: 6.82s → 6.09s  (−11%)
  Aspirin:       4.51s → 3.42s  (−24%)
  Metoprolol:    5.11s → 4.07s  (−20%)
  Total:        48.40s → 40.33s (−17%)

Hit counts and MaxNumResults unchanged across all queries.

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-28 17:13:03 +02:00
Eisuke Kawashima
e89c9f656a style: apply readability-braces-around-statements (#8136)
Co-authored-by: Eisuke Kawashima <e-kwsm@users.noreply.github.com>
2026-02-09 12:10:50 +01:00
David Cosgrove
8ef091f306 Github9007 (#9022) 2025-12-29 21:08:10 +01:00
David Cosgrove
4883485ac5 Fixes Github9009 (#9012)
* Mostly working.

* Clear out debugging writes.

* Avoid a mol copy.

* Left some debugging in.

* Move the fix into fragmentOnBonds.

* Add test.

---------

Co-authored-by: David Cosgrove <david@cozchemix.co.uk>
2025-12-22 06:26:23 +01:00
Justin Gullingsrud
37a874acd9 Make incremental search callback take ownership of mols (#8940) 2025-11-11 17:10:29 +01:00
Justin Gullingsrud
bda9ffbeec Incremental synthon search (#8855)
* Iterated interface to substructure search

* Add a test

* Add python unit test

* Expose the toTryChunkSize parameter to python

* Respect the maxHits parameter; sort the hitset

* Treat maxHits=-1 as infinite

* Add callback versions of fp and rascal search; conform to C++ style

* Add fp and rascal C++ tests

* maxHits=-1 tripped me up again

* Add fp and rascal python wrappers.

Changed the name of the callback-based method to have "Incremental"
in the name because the overloaded versions with default arguments
can't be reliably selected by the boost python runtime.  Probably
better to have a different method name anyway since the return type
is None instead of a results object.

* Delete stray printf.

* Run clang-format

* Use std::int64_t instead of ssize_t for portability

* Make docstrings on callback-based methods more descriptive

* Stop incremental search if the callback returns true.

* Add an example of incremental synthon search to the getting started docs

* trivial commit to force CI rerun

* Reformat single line if statements.

* Make SearchResultsCallback take const ref input

* Fix another one-liner

* Oops - another one-liner
2025-11-08 04:27:16 +01:00
Ricardo Rodriguez
7b7a8a4e17 Refactor iostreams includes (#8846)
* refactor iostreams includes

* restore ostream to MonomerInfo.cpp
2025-10-08 16:08:01 +02:00
David Cosgrove
34e80ab764 Fixes Github8650 - spaces in columns in tab-separated synthon space files (#8652)
* Try tab-separated before space-separated when reading synthon lines.
Fix bug in looking up reactions by name.
amino_acid.txt had a mixture of tab-separated and space-separated lines which is no longer allowed.

* Typo

---------

Co-authored-by: David Cosgrove <david@cozchemix.co.uk>
2025-07-25 06:35:35 +02:00
David Cosgrove
79403df1b0 Synthon space bad chiral count (#8550)
* Count chiral atoms just counts tetrahedral atoms.

* Put an actual check in.

* Randomly fix some unconnected warnings.

* Curb my enthusiasm for std::cmp_less.

* Trigger Build

---------

Co-authored-by: David Cosgrove <david@cozchemix.co.uk>
2025-06-03 13:09:09 +02:00
David Cosgrove
5861d503e5 Synthon space hit filters (#8473)
* Add filters to hits.

* Add safeSetattr to Python wrapper.

* Fix chiral centres filter.

* Defaults of -1 for int max filters.

---------

Co-authored-by: David Cosgrove <david@cozchemix.co.uk>
2025-05-11 05:14:12 +02:00
David Cosgrove
06196b6912 Synthon space github8502 (#8509)
* Read SynthonSpace from stream.

* Fix bug when splitting diatomic molecule.

---------

Co-authored-by: David Cosgrove <david@cozchemix.co.uk>
2025-05-08 06:54:24 +02:00
Eisuke Kawashima
26033b6578 style: apply modernize-use-override (#8137)
Co-authored-by: Eisuke Kawashima <e-kwsm@users.noreply.github.com>
2025-04-25 12:24:58 +02:00
Paolo Tosco
92c63eb98c Fix SynthonSpace build when RDK_USE_BOOST_SERIALIZATION is not defined (#8380)
* get SynthonSpace.cpp to build also when RDK_USE_BOOST_SERIALIZATION is
not defined

* test should not fail when RDK_USE_BOOST_SERIALIZATION is not defined
2025-04-08 09:47:48 +02:00
Greg Landrum
981360f55b add a missed RDKIT_SYNTHONSPACESEARCH_EXPORT (#8399)
this prevented some win64 DLL builds
2025-04-01 12:54:55 +02:00
Paolo Tosco
a3cb75b769 Change size_t into std::uint64_t in SearchResults (#8392) 2025-03-29 18:22:31 +01:00
Ricardo Rodriguez
643b13cba4 switch to std::mt19937 (#8378) 2025-03-23 18:45:57 +01:00
David Cosgrove
620a16108d Synthon Search Phase 2 (#8338)
* Function for converting text to db file.

* Do search looping first on reactions, then on fragments.

* Add lowMem mode so reactions are only read from database as required.

* Move fragment fingerprint generation out of the inner loop.

* Put positions of SynthonSets directly in DB file so no need to read the file on initialisation.

* Update test binary file.

* Fix SynthonSpace.summarise() for new loMem mode.

* Extra bits in Python wrapper.

* Correct docstring.

* Compute pattern fingerprints ahead of search.

* Put Synthons into hitsets.

* First stage of re-factoring SynthonSpace.  Synthons are highly duplicated in the SynthonSets, so are held centrally in a pool in SynthonSpace and just the pointers kept in the SynthonSets.  The same Synthon, identified by SMILES string, can have multiple IDs in the SynthonSets so the ID is now held by the SynthonSet not the Synthon.

* Second stage - moved the synthon FPs into the Synthon as well.

* New binary file format.

* Tidying and fix because Synthons are shared across SynthonSets.

* Use shorter fingerprints for synthons.

* Don't exit with a bad file.

* Back out the fingeprint folding which made things worse.
Don't copy the synthon molecules into the hit sets, just take a pointer.
Put the fragments into the corresponding hit set, useful for debugging.

* Change way hit names are made to the manner preferred by Enamine.

* Only generate query connector regions once.

* Do some of the connector region checking by SMILES.

* Move where it gets the connector combinations so it's not done unnecessarily often.

* Fix tests.

* Don't make molecules for the connector combinations, a bitset is plenty.

* Make a pool of fragment fingerprints to reduce the number in total.
Use an upper bound on the Tanimoto Coeff to reduce need for full calculation.

* Fix splitMolecule, which wasn't producing all possible fragments.

* Take out old code.

* Back to using unique_ptr for fragments.
Abolish maxBondSplits option. Use the maximum number of synthons in the space to control the splitting.

* Don't fold the reaction connector region fps into 1.

* Streamline connector combinations in substructure search.

* Re-factor fragment fingerprint generation prior to multi-threading.

* Make checkConnectorRegions return false when it should.
Tweak AllProbeBitsMatch.cpp.

* Fix Python wrapper of text file reader.

* More complex query shenanigans - amino acid this time.

* More complex query shenanigans - amino acid this time.

* Tidy.

* Fix binary DB read bug.
New Idorsia space file.

* Correct/improve function documentation.

* Tidying up.

* Remove stray include.

* Fix CI Tests.

* Plug memory leak.
Revise python timeout test.

* Simplify way synthon searchMols are created.  Previous method gave incorrect results sometimes hence new test.

* Update idorsia space file.

* Update idorsia test result.

* Update idorsia test result.

* Changes after first review.

* Move getFormattedNumProducts to general function.

* Stash working version with maps and mutex.

* Working with sorted vectors rather than maps.
Reading Text DB presumably slow.

* Split out MemoryMappedFileReader.cpp.

* Fix ReadDBFile in Python wrapper.

* Streamline tests.

* Include filesystem.

* Replace many uses of std::map with sorted std::vector.

* Use more auto.

* Threaded build hits.

* Threaded search.

* Don't chunk threaded buildAllHits.

* Allow for different results in random sampling.

* Threaded splitMolecule.
Fix bug - apply removeQueryAtoms to all frags, not just one per unique SMILES.
Do largest fragment heuristic up front so as not to repeat on each thread.

* Streamline Python tests.

* Separate out time-consuming tests.

* Add Rascal similarity searching.

* Add extended queries.

* Make extended queries honour maxHits correctly.

* Extra extended query test.

* Hide really long tests on local files.

* Remove local test.

* Make random tests less strict.
Attempt to fix build issues.

* Attempt to fix build issues.

* Response to review.

* Fix no-threads version.

* Re-move re-formatting.

* Add move semantics to MemoryMappedFileReader.

* Move c'tor needs size as well.

---------

Co-authored-by: David Cosgrove <david@cozchemix.co.uk>
2025-03-21 13:09:34 +01:00
David Cosgrove
c36276a7c0 Optimisation of fingerprint Synthon Search (#8223)
* Change how synthonsToUse is stored in SynthonSpaceHitSet.

* Sort fragments by descending similarity.

* Sort fragments by ascending size.

* Use pair not tuple.

* Un-cringe Greg.

---------

Co-authored-by: David Cosgrove <david@cozchemix.co.uk>
2025-01-30 04:59:19 +01:00
David Cosgrove
59b4152974 Making SynthonSearch respond to Ctrl-C (#8153) 2025-01-22 11:43:51 +01:00
David Cosgrove
c2944e7050 Optimisations to fingerprint search of Synthon Space (#8152)
* First pass at approximate FP check.

* Tidy and Python wrapper.

* More tidying.

* Add addFP and subtractFP to binary file.

* Minor tidy.

* In splits code, check for duplicate fragmentations.

* Update test results.

* Tidy.

* Set configurable limit on number of fragments generated from query.

* Stash prior to trying counts fps.

* Stash count fps.

* Back to bit fingerprints again.

* Extra comment.

---------

Co-authored-by: David Cosgrove <david@cozchemix.co.uk>
2025-01-22 05:44:56 +01:00
David Cosgrove
b8effb8b25 SynthonSpace search timeout (#8070)
* Add timeout to searches.

* Correct docstring.

* Include chrono header.

* Get it compiling with gcc.

* And then clang didn't like it...

* Revert to tmpnam as msktemp isn't available on Windows.

* Response to review.
Run time no longer saved in SearchResults.
Timeout check not tied to size of results.
Made the test timeout shorter.

* Fix the Python wrapper.

* Shamelessly steal the better timeout method from PR8110.

* suggested changes

* be more conservative about what does not time out

the CI machines can be surprisingly slow

---------

Co-authored-by: David Cosgrove <david@cozchemix.co.uk>
Co-authored-by: Greg Landrum <greg.landrum@gmail.com>
2024-12-21 09:04:23 +01:00
David Cosgrove
ce35b3c25b SynthonSearch synth check (#8109)
* Catch errors when creating products.

* Extra python test.

* Fix formatting.

---------

Co-authored-by: David Cosgrove <david@cozchemix.co.uk>
2024-12-18 05:12:07 +01:00
David Cosgrove
403cd55e6a Synthon search fp bug (#8086)
* Fix bug - connector patterns weren't being matched to the synthon connector patterns.

* Tiny tweak.

* Typo in comment.

---------

Co-authored-by: David Cosgrove <david@cozchemix.co.uk>
2024-12-12 18:24:01 +01:00
David Cosgrove
d985a44f26 Handle DOS files in SynthonSpaceSearch (#8075)
* Handle DOS files.

* Smaller test file.
Add DOS file to .gitattributes.

* Update Code/GraphMol/SynthonSpaceSearch/substructure_search_catch_tests.cpp

---------

Co-authored-by: David Cosgrove <david@cozchemix.co.uk>
Co-authored-by: Greg Landrum <greg.landrum@gmail.com>
2024-12-09 17:29:17 +01:00
David Cosgrove
bbac292b4c Synthon fingerprint search (#8025)
* First pass at splitting molecule.

* Interim commit.  Reading libraries from file in original format.

* Basic search seems to be working.

* Pattern fingerprint screening.

* Connector region heuristic.

* Fixed triazole (aromatic/non-aromatic connectors).

* Fix search with non-split parent query, where query is substructure of a single reagent.

* Remove duplicate hits by reaction/reagents used.

* Implement largest fragment heuristic.

* Extra test files.

* Read/write binary file.
Program for conversion from text format to binary format.

* Remove empty reagent sets on reading, probably due to synthon number counting from 1 rather than 0.

* Tidy SSSearch functions.

* Stash pending major surgery for triazole bug.

* Revert to using unique_ptr.
Correct use of reagent order.

* Function to summarise Hyperspace.

* Delay building hits till end and put cutoff on number.

* Earlier bale-out in getHitReagents.

* Streamline checkConnectorRegions.

* Remove free functions for search.

* Correct name of Python test.

* First stage of Python wrappers.

* Rename namespace.

* Parameters object.

* Mysterious windows export thing.

* Fix bug - not matching number of connectors in fragment and synthon.

* Back like it was.  The connector count wasn't the problem.

* Put the substructure results into their own class.

* gcc 14 didn't like my use of std::reduce.
Update expected test results.

* Remove write statement.

* Tidy.

* Tidy.

* Enable random sample of hits.

* Test that complex SMARTS works.
Update Python wrappers.

* Rename Hyperspace to SynthonSpace.

* More renaming.
Python test.

* Enable Python test.
Remove write.

* Plug memory leak.

* Response to Greg's initial look.

* More response to Greg's initial look.

* get the windows DLL builds working

* Do away with mutable.
Purge a few more uses of reagent in favour of synthon.
Remove the c++ exe for converting text to binary databases.

* Better Synthon c'tor.

* More feedback from Greg.

* Tidy the Python wrapper.

* Remove tags from catch tests.

* Don't allow copying of SubstructureResults.

* Revert to allow copying of SubstructureResults.  The Python wrapper needs it.

* Refinements based on CLion/clangd suggestions.

* Allow for map numbers in connectors in space file.

* Refactor to make the searcher a separate class from the space.

* Transfer Greg's review suggestions from Hyperspace merge.

* First cut of fingerprint searcher.

* Python wrapper.
Some tidying.

* Better random selection.

* Fix bug in preparing frags for fingerprints.
Re-factor.

* Minor-refactor.

* Sort hits by similarity if available.

* Option for a few different fingerprint types.  Pending a better solution.

* Write fingerprints to binary file.

* Use any fingerprint generator for similarity searching.  No Python wrapper yet.

* Python wrapper.

* Change random selection to use distribution weighted by number of hits in each reaction.

* Lots of suggestions from CLion/clang.

* Use boost discrete_distribution for cross-platform consistency.

* Tidy test up.

* Try boost rng as well.

* uniform_int_distribution to boost also.

* Small tidy.

* Method to write enumerated library.

* Windows export thing.

* Windows export thing.

* Allow for commas in tab-separated fields.

* win64 dll builds now work

* More aliphatic synthon, aromatic product joy.

* Force ring finding if it hasn't been done.

* Fingerprint hits not being sorted if maxHits reached.

* Remove debugging write.  Doh!

* Response to review of SynthonSpace2.

* Missed one.

* Add test file.

* Hand merge Greg's #8050.

* Discard nodiscard.

* Move include of export.h inside include guards.

* Response to review.

* Fix memory leaks.

---------

Co-authored-by: David Cosgrove <david@cozchemix.co.uk>
Co-authored-by: Greg Landrum <greg.landrum@gmail.com>
2024-11-29 13:07:32 +01:00
David Cosgrove
43229cf933 Synthon space2 (#8048)
* Fix for connector regions and missing ringinfo.

* Merge in the fix for comma-separated names in tab-separated space files.

* Response to review.

---------

Co-authored-by: David Cosgrove <david@cozchemix.co.uk>
2024-11-28 13:12:35 +01:00
David Cosgrove
eaf544ab6f SynthonSpace Search (#7978)
* First pass at splitting molecule.

* Interim commit.  Reading libraries from file in original format.

* Basic search seems to be working.

* Pattern fingerprint screening.

* Connector region heuristic.

* Fixed triazole (aromatic/non-aromatic connectors).

* Fix search with non-split parent query, where query is substructure of a single reagent.

* Remove duplicate hits by reaction/reagents used.

* Implement largest fragment heuristic.

* Extra test files.

* Read/write binary file.
Program for conversion from text format to binary format.

* Remove empty reagent sets on reading, probably due to synthon number counting from 1 rather than 0.

* Tidy SSSearch functions.

* Stash pending major surgery for triazole bug.

* Revert to using unique_ptr.
Correct use of reagent order.

* Function to summarise Hyperspace.

* Delay building hits till end and put cutoff on number.

* Earlier bale-out in getHitReagents.

* Streamline checkConnectorRegions.

* Remove free functions for search.

* Correct name of Python test.

* First stage of Python wrappers.

* Rename namespace.

* Parameters object.

* Mysterious windows export thing.

* Fix bug - not matching number of connectors in fragment and synthon.

* Back like it was.  The connector count wasn't the problem.

* Put the substructure results into their own class.

* gcc 14 didn't like my use of std::reduce.
Update expected test results.

* Remove write statement.

* Tidy.

* Tidy.

* Enable random sample of hits.

* Test that complex SMARTS works.
Update Python wrappers.

* Rename Hyperspace to SynthonSpace.

* More renaming.
Python test.

* Enable Python test.
Remove write.

* Plug memory leak.

* Response to Greg's initial look.

* More response to Greg's initial look.

* get the windows DLL builds working

* Do away with mutable.
Purge a few more uses of reagent in favour of synthon.
Remove the c++ exe for converting text to binary databases.

* Better Synthon c'tor.

* More feedback from Greg.

* Tidy the Python wrapper.

* Remove tags from catch tests.

* Don't allow copying of SubstructureResults.

* Revert to allow copying of SubstructureResults.  The Python wrapper needs it.

* Refinements based on CLion/clangd suggestions.

* Allow for map numbers in connectors in space file.

* Response to review.

* update binary file spec

* Changes after review.

---------

Co-authored-by: David Cosgrove <david@cozchemix.co.uk>
Co-authored-by: Greg Landrum <greg.landrum@gmail.com>
2024-11-17 08:13:54 +01:00