1) UoB
1) UoB
Abstract
Chemical molecular diagrams are commonly found in doc-
uments from the chemical and life science disciplines. We (a) Single Planar (b) Double Planar (c) Triple Planar
present an overview of the elements of these diagrams and
of MolRec, our system for analysing and recognising them.
MolRec uses a number of techniques to refine the scanned
images and precisely detect line segments and line junctions,
structural elements and the atomic formulae that commonly
appear in such diagrams. The output of our system is a (d) Wedge (e) Hollow wedge (f) Bold
chemical formula and associated MOL file, a standard rep-
resentation of molecular structures used in cheminformat-
ics that records precise molecular spatial and connectivity
information. When applied to the TREC 2011 test set of
1000 molecular diagrams, MolRec returned in two separate (g) Dashed wedge (h) Dashed (i) Dashed bold
runs 949 and 950 correctly recalled structures, respectively.
We discuss these results and present an analysis of MolRec’s
performance on the test set.
from the shortest to the longest dash. For a dashed bond and
bold bond, the direction and stereo-centre are unspecified
and have to be identified.
A wavy bond, as in Figure 1(j), is used to show an unspec-
ified configuration (mixture of up and down).
As Figure 1(k) shows, an arrow is used to illustrate a da-
tive (polar) bond. The direction of the arrow is from source-
to-head and it indicates the existence of a negatively charged Figure 5: Superatom
atom at the head of the arrow.
Further 3-dimensional structure can be depicted with
bridge bonds, in case there are multiple different connec-
tion paths between different parts of the molecule. These are
typically presented in a 21/2-dimensional perspective draw-
2 Implementation
ing form. Such diagrams have one or more foreground bonds
MolRec’s recognition procedure consists of a series of steps,
drawn crossing one or more background bonds, where fore-
of which we present the most important ones in this section.
ground and background bonds are not connected where they
appear to touch in the diagram. If the background bond is After initial binarisation of the input image, connected
drawn with a gap to make this clear, it is called an open components are labelled and fed into a simple metric space
bridge bond, otherwise it is called a closed bridge bond based OCR engine to identify character symbols, which
(c.f. Figure 3). are subsequently combined into character groups. Then we
recognise bonds based on a rule set for rewriting basic graph-
ical elements. This forms the basis of a graph structure,
which can be translated into the MOL output format, after
embedding of superatoms and further resolution of ambigu-
ous stereo bonds.
A full specification of our rule set for bond recognition as
well as a more detailed description of the entire recognition
procedure can be found in [6].
(a) (b)
(c)
(c)
(a) Closed Bridge Bond (b) Open Bridge Bond One image, displayed in Figure 16, shows a dashed wedge
bond with a wavy line crossing it. We are not familiar
Figure 13: Problematic Closed and Open Bridge Bonds with this notation and do not know how to interpret it. An
analysis of the corresponding solution MOL file showed the
same contents as if the wavy line were not present at all.
MolRec interpreted the wavy line as a wavy bond (c.f. Fig-
ure 1(j)) and recognised the crossing of the wavy bond with
3.8 Unhandled Bond Type the dashed wedge as 4 bonds connected at the centre: two
dashed wedges and two wavy bonds.
The dashed dative bond is shown in Figure 14. We have not
encountered such a symbol before and are not clear about its
intended interpretation. The solution MOL file interprets it
as a planar single bond. Since MolRec’s bond recognition