Files
rdkit/Code
Dan N b5dcb21fef Improve performance of aromaticity detection for large molecules (#3253)
* remove trailing spaces

* 3256: Envelope aromaticity not detected in complex fused system

Removes stopping point in aromaticity detection when all atoms
are "done". This also markedly improves the performance of
aromaticity detection for very large molecules - for example,
aromitization of 3EOH from the PDB was dominated by done atom
checking before this commit.

Some aromatic bonds were missed before this commit in complex fused
systems. This happened if all atoms in the fused system were also
in some smaller aromatic ring and there was at least one fused edge
that was single in the kekule form.

Some example molecules for which envelope aromaticity failed
before this commit:

c1cc2n(c1)c1cccn1c1cccn21
-> became c1cc2n(c1)-c1cccn1-c1cccn1-2 before this commit
c1cc2c3cc[nH]c3n3cccc3n2c1
-> became c1cc2n(c1)-c1cccn1-c1[nH]ccc1-2 before this commit
c1cc2c3cc[nH]c3c3cc[nH]c3n2c1
-> became c1cc2n(c1)-c1[nH]ccc1-c1[nH]ccc1-2 before this commit

Here's a similar example that didn't fail even before this
commit. The central ring only shares double bonds with the
exterior rings.
* c1cc2c([nH]1)c1cc[nH]c1c1cc[nH]c21

Requires updates to some MQN descriptors tests because some
bonds become aromatic (MQN includes counts of single and
double bonds of kekule form).

FWIW, for the molecule that had a change in counts, the counts
were incorrect both before and after this commit, because
MQN uses an approximation (dividing aromatic bonds evenly
between single and double bonds) to avoid kekulization.
This approximation is invalid when there are oodles of
nitrogens lone pairs participating in the aromatic
bonds.

(the failing line was 2558 in aromat_regress.txt: Cc1cc2n(n1)c1cc(C)nn1c1c(C=O)c(C)nn21)

* Detect envelope aromaticity in fused systems

In #3253, we proposed removing doneAtoms for performance, and it was
noted that it also fixed detection of envelope aromaticity in some
fused systems. However, when I completely removed doneAtoms, I saw
hangs in sanitization of things like nanotubes. Using doneBonds
allows envelope aromaticity, while preserving a reasonable break
on runaway work for crazy molecules.

The performance issue was addressed by caching the ring bond
count.

Here are some sanitize timings on proteins from the RCSB PDB:
Before this commit:
* 3eoh 1.21s
* 2j3n 0.77s
* 1nks 0.053s

Afterwards:
* 3eoh 0.42s
* 2j3n 0.15s
* 1nks 0.046s

* Use boost::dynamic_bitset instead of unordered_set

To cound ring bonds.
2020-08-13 05:57:16 +02:00
..
2020-04-17 17:48:58 +02:00
2020-05-05 08:08:20 +02:00
2020-05-13 06:52:36 +02:00
2020-05-25 09:40:01 +02:00
2020-07-11 13:42:32 +02:00
2019-10-10 20:18:43 +09:00