0% found this document useful (0 votes)
12 views8 pages

1) UoB

Paper

Uploaded by

Pulkit Shukla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views8 pages

1) UoB

Paper

Uploaded by

Pulkit Shukla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Performance of MolRec at TREC 2011

Overview and Analysis of Results


Noureddin M. Sadawi, Alan P. Sexton, and Volker Sorge

School of Computer Science


University of Birmingham
Email: N.M.Sadawi|A.P.Sexton|[email protected]
URL: www.cs.bham.ac.uk/∼nms|∼aps|∼vxs

Abstract
Chemical molecular diagrams are commonly found in doc-
uments from the chemical and life science disciplines. We (a) Single Planar (b) Double Planar (c) Triple Planar
present an overview of the elements of these diagrams and
of MolRec, our system for analysing and recognising them.
MolRec uses a number of techniques to refine the scanned
images and precisely detect line segments and line junctions,
structural elements and the atomic formulae that commonly
appear in such diagrams. The output of our system is a (d) Wedge (e) Hollow wedge (f) Bold
chemical formula and associated MOL file, a standard rep-
resentation of molecular structures used in cheminformat-
ics that records precise molecular spatial and connectivity
information. When applied to the TREC 2011 test set of
1000 molecular diagrams, MolRec returned in two separate (g) Dashed wedge (h) Dashed (i) Dashed bold
runs 949 and 950 correctly recalled structures, respectively.
We discuss these results and present an analysis of MolRec’s
performance on the test set.

(j) Wavy (k) Dative


1 Overview of Diagram Elements
Figure 1: Common Bond Conventions
Molecular diagrams generally consist of a combination of
characters denoting names of atoms (e.g., O) or more com-
plex molecules (e.g., HO) together with graphical elements termine what chemists term the stereo-centre of the bond.
depicting chemical bonds. The latter can be of a number of The solid wedge, hollow wedge or bold line segment, in
different types and their combination determines the overall Figures 1(d), 1(e) and 1(f) respectively, are used to show
3-dimensional structure of the entire molecule. bonds coming out of the plane of the drawing surface (to-
An overview of the different common types of bonds wards the viewer). The direction of the solid and hol-
is given in Figure 1. One or more normal line segments low wedge bonds is determined by the tip-to-base direction,
are used to show normal bonds (planar bonds) as in Fig- meaning the stereo-centre is at the narrow end. In the bold
ures 1(a), 1(b) and 1(c). Parallel line segments close together bond case both direction and stereo-centre are unspecified
where each is of approximately the average bond length in- and have to be determined somehow, for example by using
dicate a double (or triple) bond. Parallel line segments where chemical domain knowledge, for a correct recognition of the
at least some are shorter than the others indicate a sequence diagram.
of separate bonds with an omitted carbon atom at the node, A dashed wedge, a dashed line segment and a dashed bold
called an implicit node, identified by the end of the shorter line segment, Figures 1(g), 1(h) and 1(i) respectively, are
line segments, as in Figure 2. used to depict bonds going behind the plane of the draw-
To indicate 3-dimensional structure of molecular dia- ing surface (away from the viewer). These also have direc-
grams, bonds are drawn in different styles to indicate a di- tions to specify the stereo-centre. For a dashed wedge bond,
rection with respect to the drawing surface. These styles de- the stereo-centre is at the shortest dash, so the direction is
(a) (b)

Figure 4: Aromatic Ring

to the surrounding molecule. Superatom names are typically


(c) (d) meaningful to chemists but syntactically in the diagram give
little or no clue to their actual content. Hence they can only
Figure 2: Implicit Nodes (indicated by shaded disks) in Bond be supported through some variety of dictionary mechanism.
Sequences

from the shortest to the longest dash. For a dashed bond and
bold bond, the direction and stereo-centre are unspecified
and have to be identified.
A wavy bond, as in Figure 1(j), is used to show an unspec-
ified configuration (mixture of up and down).
As Figure 1(k) shows, an arrow is used to illustrate a da-
tive (polar) bond. The direction of the arrow is from source-
to-head and it indicates the existence of a negatively charged Figure 5: Superatom
atom at the head of the arrow.
Further 3-dimensional structure can be depicted with
bridge bonds, in case there are multiple different connec-
tion paths between different parts of the molecule. These are
typically presented in a 21/2-dimensional perspective draw-
2 Implementation
ing form. Such diagrams have one or more foreground bonds
MolRec’s recognition procedure consists of a series of steps,
drawn crossing one or more background bonds, where fore-
of which we present the most important ones in this section.
ground and background bonds are not connected where they
appear to touch in the diagram. If the background bond is After initial binarisation of the input image, connected
drawn with a gap to make this clear, it is called an open components are labelled and fed into a simple metric space
bridge bond, otherwise it is called a closed bridge bond based OCR engine to identify character symbols, which
(c.f. Figure 3). are subsequently combined into character groups. Then we
recognise bonds based on a rule set for rewriting basic graph-
ical elements. This forms the basis of a graph structure,
which can be translated into the MOL output format, after
embedding of superatoms and further resolution of ambigu-
ous stereo bonds.
A full specification of our rule set for bond recognition as
well as a more detailed description of the entire recognition
procedure can be found in [6].

2.1 Character Grouping


(a) Closed Bridge Bond (b) Open Bridge Bond
Letters, numerals and some symbols are taken to indicate
Figure 3: Closed and Open Bridge Bonds atoms or superatoms. Any such components identified dur-
ing the OCR process are grouped to form labels. Grouping
Aromatic rings are sometimes drawn with a large circle is performed horizontally, vertically and diagonally.
inside a cycle of bonds, as in Figure 4, instead of separate Let a and b be elements of N ∪ L ∪ S, where N is the
planar bonds. set of digits, L the set of letters and S the set of non-letter,
Superatoms are names that are embedded in a diagram as non-digit symbols. If a and b are within a preset distance
if it were an atom (c.f. Figure 5). These names represent of each other, a will be grouped horizontally, vertically or
whole molecular substructures that can have multiple bonds diagonally with b as follows:
Horizontal grouping is performed if the spatial relations 2 line segments join. An average line width is calculated
between a and b is horizontal and one of the following con- during this stage. This results in a set of polylines that is
ditions holds: used as input to the Douglas-Peucker line simplification al-
gorithm [1]. The result of using this algorithm is a set of
i) Both a and b are letters straight line segments and an average line length.
ii) Both a and b are digits Parallel line segments indicating double or triple bonds
or bond sequences with implicit nodes are all identified by
iii) a is a letter and b is a symbol, or vice versa. clustering line segments of the same slope that are within a
Vertical grouping is performed if the spatial relations be- threshold distance of each other. In the bond sequence case,
tween a and b is vertical and: the long line segments are split at the points where the short
line segments end, so that a node to hold the implied atom is
i) Both a and b are letters created.
Diagonal grouping is performed if the spatial relations be- A sequence of short parallel line segments spaced apart at
tween a and b is diagonal and one of the following conditions regular distances are detected by identifying short line seg-
holds: ments whose centre points are regularly spaced within a cer-
tain tolerance. A sequence of short parallel line segments
i) a is a letter and b is a digit, or vice versa. of monotonically changing length represent a dashed wedge
ii) a is a letter and b is a charge sign, or vice versa. where the stereo-centre is identified as the atom at the short-
est line segment of the wedge, while short line segments of
similar length represent a dashed bold bond. A more direct
approach based on using their slope is not reliable because
of the difficulty of accurately finding the slope of such short
line segments. Dashed bonds are detected by identifying re-
peated short line segments of similar length whose centre
points are collinear. Again, the stereo-centre in the case of
(a) (b) (c) dashed bold bond and dashed bond is unknown. Our method
for identifying the stereo-centre is explained in section 3.1.
To extract precise geometric information about bonds in
the shape of solid wedges, we use a disk of a radius larger
than a measured average line thickness. The measured thick-
ness is obtained dynamically by analysing discovered lines
in the image. This disk can fit inside the base of a wedge
(triangle) but not in a normal line. We grow the disk until
(d) (e) it reaches the largest size possible while still covering only
foreground pixels in the original image. Then we walk the
Figure 6: Character Groups position of the disk in any direction that allows it to continue
to grow. If this object is indeed a wedge, then when it can
Some examples are shown in Fig 6. The top character grow no more we have found the base of the triangle, thus
group of Figure 6(a) shows a letter to the right of a symbol identifying the stereo-centre for this 3D bond. We can then
“(C”, a letter to the right of another letter “CH”, a digit to walk the disk along in the direction of slowest decrease of
the bottom right of a letter “H2 ”, a symbol to the top right of disk size to find the opposite end of the bond. If it is not a
a digit “2 )”, a digit to the bottom right of a symbol “)1 ” and a triangle, then the disk size will not change appreciably over
digit to the right of another digit “16”. Also, these rules will the length of the thinned line segment corresponding to this
not allow grouping of very close characters such as the ones connected component. In this case we recognise the object
shown in Figure 6(b), 6(c) and 6(d). The example shown as a thick solid line segment (c.f. Figure 7), but in this case
in Figure 6(e), which is not entirely clear even to a human the stereo-centre in unknown, and we just need to identify
reader, is not disambiguated by the previous rules. the two end points (more on this in section 3.1).
Some cases requiring disambiguation include the upper-
case letter “O”, lower case “l”, upper case “I”, etc.

2.2 Line Finding and Recognition of Bonds


Any detected character groups are erased from the image and
the new image is thinned to unit width thickness. The line
segments, which can be free standing or connected as poly-
lines, are traversed and split at junctions where more than Figure 7: Using Disk to Identify Wedge/Bold Bonds
Wavy bonds, which are drawn as a wave pattern, are re- Reason #Images
duced by our thinning and line simplification process to a Incorrect stereochemistry 10
sawtooth pattern polyline of connected short line segments. Solid Circles without 3D Hydrogen Bond 5
This is straightforward to identify when following the poly- Image has touching components 6
line. Image has broken characters 3
Incorrect character grouping 5
Connectivity of superatoms 3
2.3 Graph Construction
Problematic Bridge Bonds 3
At this stage, an initial undirected graph is constructed where Unhandled bond type 1
each bond is an edge and each junction is a node. This is Unrecognised syntax 5
done by grouping line segment endpoints by distance and Dashed wedge bonds mis-identified 15
by connectivity to the bounding box of character groups in Diagram caption confusion 5
order to construct each node. Each vertex of the graph is
labelled with the character group at the corresponding posi- Table 1: Reasons for Mis-Recognition of Molecules.
tion. Bonds and character groups in the graph are examined
for common causes of ambiguity. Disambiguation of lower
case “l”, upper case “I”, the digit “1” and a vertical single heuristics use information about the numbers of neighbours
bond is carried out at this point. on each end of the bond in question.

2.4 Superatom Embedding 2.6 Output Generation


We mined MOL files in the OSRA dataset [2], and inte- Finally, the MOL file [8] is generated from the graph. For
grated the freely available Marvin abbreviation group collec- training and evaluation, we used OpenBabel [5] which pro-
tion superatom dictionary [4] to identify complete informa- vides the ability to compare different MOL files semanti-
tion about superatoms. Superatoms in MOL file structures cally, ignoring unimportant syntactic differences.
identify the superatoms in situ, marking each atom of the
whole structure that belongs to the superatom, together with
the internal bonds in the superatom and the external bonds 3 Analysis of MolRec’s Performance
between the superatom and the surrounding structure. We
then replace the superatom nodes in the graph with an em-
When run twice on the 1000 images in the TREC11 data
bedding of the structure of the superatom.
set, MolRec achieved a 95% and a 94.9% correct recov-
This still leaves us with the connection permutation prob- ery rate, respectively. This corresponds to 50 diagrams mis-
lem, but at least gives us unambiguous internal connection recognised in the first run, and 51 in the second. In fact, be-
information and identification of the connecting atoms. cause most of the diagrams mis-recognised in the first were
also mis-recognised in the second, where the internal param-
2.5 Stereo Bond Resolution eters in MolRec were slightly adjusted, there were a total of
55 different diagrams mis-recognised in one or other of the
For 3D-bonds with unknown stereochemistry, as in Figure 8, two runs. Some of these 55 diagrams failed for multiple rea-
MolRec needs to decide the stereo nature of the bond, i.e. sons, so we were able to identify 61 reasons for diagram
which side of the bond is the stereo-centre, and, in the case of recognition failures in total.
the wavy bond, whether the direction of the bond is towards Table 1 summarises the reasons for failed recognition to-
the background or the foreground of the image. gether with the exact number of mis-recognised images for
each reason. In the remainder of this section we discuss each
of the cases with some suggestions for future improvements.

3.1 Incorrect Stereochemistry


(a) Bold bond (b) Dashed bond (c) Dashed bold bond The stereochemistry of some bonds is not derivable purely
from the syntactic properties of the diagram and, in the ab-
Figure 8: Bonds with Unknown Stereochemistry sence of deeper domain knowledge, our use of heuristics re-
duces the number of incorrect choices of 3D bond direction
The syntax of the diagram does not give sufficient infor- but does not eliminate them. When our heuristics guessed
mation to resolve this correctly so, in the absence of better the wrong stereochemistry in such cases, a mis-recognition
domain knowledge, we employ a number of heuristics based occurs. Adding further domain knowledge should improve
on observed patterns in diagrams we have analysed. These recognition rate here.
3.2 Solid Circles without 3D Hydrogen Bond 3.5 Incorrect Character Grouping
A number of diagrams in the test set use the solid circle no- Some characters were incorrectly grouped because they
tation that indicates a 3D Hydrogen bond, but without that were too close to each other for MolRec to reliably separate,
stereo bond information appearing in the solution MOL files. as in Figure 11. This could be improved by adding more
This seems to be a particularity of some of the samples in the knowledge on what molecule groupings are permissible.
test set, an issue that is further discussed in section 4. Mol-
Rec’s results always generates the stereo bond information
in its output MOL files and thus does not match the solution
MOL files in these cases.

3.3 Touching Components


MolRec does not currently handle touching components
such as in Figure 9. These can include:
• Touching characters, although MolRec handles liga-
tures such as NH (where the serifs touch).
(a) (b)
• Letters touching symbols
• Characters touching bonds, although MolRec handles Figure 11: Incorrect Character Grouping
the common pattern where a diagonal bond is glued to
a character from the bottom.
• Some cases of bonds accidentally touching bonds, usu- 3.6 Connectivity of Superatoms
ally due to ink bleed between close parallel lines mak-
ing a connection that should not be there. If there are two or more bonds between the superatom and
the surrounding structure, it is sometimes unclear how to de-
termine the appropriate permutation of connection possibili-
ties to match the actual chemical structures in question.

(a) (b)

Figure 9: Example of Touching Components

While a number of solutions have been proposed in the (a)


literature to handle touching characters, this problem is no-
toriously difficult.

3.4 Broken Characters


MolRec does not currently handle broken characters such as (b)
in Figure 10. As with touching characters this is a known
difficult problem in document analysis.

(c)

Figure 12: Example of Superatoms

(a) (b) MolRec essentially makes a random guess as to the cor-


rect permutation, and therefore gets in wrong in some cases.
Figure 10: Example of Broken Characters Again, additional knowledge on the structure of superatoms
could reduce this problem.
3.7 Problematic Bridge Bonds 3.9 Unrecognised Syntax
MolRec detects open bridge bonds when a broken straight We classified a total of five images as possessing a syntax
line with another straight line passing through the gap are unrecognised by MolRec.
found. The broken line is reconnected. Closed bridge bonds Three images in the dataset included user annotations
are identified by checking the crossing junctions and check- (c.f. Figure 15). The corresponding solution MOL files do
ing that the lines forming the junction are part of irregularly not appear to treat them as part of the structure. In the case
shaped cycles. of Figure 15(c), the solution MOL file uses a Carbon atom in
place of the question mark symbol, i.e. it ignores the symbol.
Non-bridge bond cycles are always regularly shaped pen-
tagons, hexagons etc. MolRec detects such irregularities
by finding junction angles outside expected ranges and in-
terprets the structure accordingly. However, in some cases
in the test set, either the angles involved were outside the
thresholds used by MolRec or the perspective was suffi-
ciently extreme that MolRec confused which node should
be associated with certain line segment endings. Figure 13
presents two such examples. Here further fine tuning of our
recognition approach is needed. (a) Unknown Symbol (far left) (b) User Annotation

(c)

Figure 15: Unrecognised User Annotations

(a) Closed Bridge Bond (b) Open Bridge Bond One image, displayed in Figure 16, shows a dashed wedge
bond with a wavy line crossing it. We are not familiar
Figure 13: Problematic Closed and Open Bridge Bonds with this notation and do not know how to interpret it. An
analysis of the corresponding solution MOL file showed the
same contents as if the wavy line were not present at all.
MolRec interpreted the wavy line as a wavy bond (c.f. Fig-
ure 1(j)) and recognised the crossing of the wavy bond with
3.8 Unhandled Bond Type the dashed wedge as 4 bonds connected at the centre: two
dashed wedges and two wavy bonds.
The dashed dative bond is shown in Figure 14. We have not
encountered such a symbol before and are not clear about its
intended interpretation. The solution MOL file interprets it
as a planar single bond. Since MolRec’s bond recognition

Figure 16: Unrecognised Wavy Line Syntax

Finally, MolRec does not currently recognise structures


with frequency variations such as the one in Figure 17, which
appears in the test set.

Figure 14: Dashed Dative Bond

is rule based, an extension to include currently unhandled


bonds should be straightforward. Figure 17: Repetition Structures
While it is unlikely that one can devise a recognition pro- Here we consider the few cases of this nature in the TREC11
cedure that can take care of arbitrary user annotations, we test set.
are hopeful that at least regular Markush structures [3], such Solid Circles: The test set contains several structures with
as the above frequency variations, could be handled within a solid circle covering a junction. We understand these solid
our rule based approach. circles to indicate the existence of a hydrogen atom con-
nected to that node via a solid wedge bond (c.f. Figure 19),
3.10 Dashed Wedge Bonds Mis-Identified an interpretation borne out by the corresponding MOL files
in the solution set for a number of diagrams with such solid
These include incorrect identification of some dashed wedge circles in the test set. However, some images in the test set
bonds and of some bridge bonds. For the dashed wedge have the solid circle but the provided solution MOL file indi-
bonds in question, the short dashes at the narrow end of the cates that the node in question does not have the correspond-
bond were considered by MolRec as part of a dashed bond, ing solid wedge bond to a hydrogen atom, e.g. US06372153-
while the longer dashes were treated as part of a dashed 20020416-C02522.
wedge, or dashed bold, bond. This has led to interpreting
MolRec recognises and interprets these solid circles in the
some dashed wedge bonds as two connected bonds (a dashed
way we believe correct. However, for the problem cases de-
bond and a dashed wedge, or dashed bold, bond), which
scribed above, MolRec obviously produces MOL files that
meant an extra non-existent node and bond were added. Fur-
do not match the solution MOL file.
ther honing MolRec’s recognition parameters should take
care of this problem in the future.

3.11 Diagram Caption Confusion


On 5 images in the test set a diagram caption appears in the
image (c.f. Figure 18). As MolRec is aimed to recognise (a) Solid Circle Notation (b) Common Equivalent Notation
molecule structures only, it does not do any image segmen-
tation. Consequently it fails to recognise the image because Figure 19: These two notations are equivalent
it cannot find a suitable interpretation for the caption as part
of the molecule structure. Dative Bonds: The test set contains several structures with
dative (polar) bonds (a bond in the form of an arrow as in
Figure 20). We understand dative bonds to indicate the ex-
istence of a negatively charged atom at the head of the ar-
row [7]. However, the training set sometimes interprets such
arrows as double bonds, e.g. image US20020143030A1-
20021003-C00004, and sometimes as a normal single bond,
e.g. image US20020143030A1-20021003-C00004. While
in the test set’s solution MOL files, all dative bonds seem to
have been interpreted as normal planar bonds (i.e. as normal
single bonds).
MolRec does not yet recognise arrows. Currently it simpli-
Figure 18: Diagram with Caption fies them into simple line segments, which are then inter-
preted as normal single bonds. Serendipitously, this treat-
ment agrees with the solution MOL files in TREC11, al-
though one should really consider both MolRec and the so-
4 Issues with the Test Set lution set to be equally in error in such cases.

We have not found a definitive graphical syntax specification


for molecular diagrams, and it is clear that there exist dia-
grams which use some graphical elements in different and
inconsistent ways from each other. Also there appear to be
syntactic notations in these diagrams that do not give suf-
ficient information on their own to uniquely determine the
corresponding MOL file up to isomorphism. Finally there Figure 20: Dative (Polar) Bond
are many diagrams to be found in the literature which con-
tain definite errors. Over Connected Atoms: The test set contains one image,
In such cases, it is a difficult choice for a molecular dia- US06334922-20020101-C00005, where 3 carbon atoms had
gram recognition system as to what it should, or could, do. 5 bonds each (the circled atoms in Figure 21). We do not
have sufficient chemical domain knowledge to be sure our Solid triangle and bold line detection: Given our cleaned
interpretation is correct but, as we understand it, a carbon skeleton paths, we identify and orient solid triangles, and
atom normally has 4 bonds (or less with omitted hydrogen simultaneously detect bold lines, by finding components
atoms). When a carbon atom has more than 4 bonds, this within which a disk of radius larger than the line width can
means it should be positively charged and a plus sign (+) is fit, and then walking the disk along the direction of the com-
used to indicate this charge. However, there were no positive ponent that allows maximal growth in the disk (or minimal
charge signs in the diagram to indicate this, meaning that the shrinkage). This has also proven to be very fast and robust.
diagram is internally inconsistent. The solution MOL file Mining superatoms in MOL files: Unavailability of com-
does not indicate the extra positive charge as it should. prehensive superatom dictionaries, and the lack of the level
MolRec is not currently designed with sufficient domain of detail of internal information about the superatoms that
knowledge to detect this inconsistency, and the MOL file it are necessary for use in MOL files, led us to mine collec-
generates corresponds to the incorrect solution MOL file. tions of MOL files for their superatom content. This was
a fairly simple process of extracting the superatom defini-
tions from the MOL files and relabelling their contents so
that they could be reused in other MOL files. This has sig-
nificantly increased the number of diagrams we can recog-
nise automatically, although we still face the open problem
of connectivity permutations in superatoms which have more
than one external connections from different internal atoms.
Breaking joints: Many joints in a molecular diagram touch
end-to-end to indicate the presence of an unmentioned Car-
bon atom. However, often Carbon atoms are explicitly writ-
ten in to a space separating the bond lines. Rather than have
Figure 21: Circled carbon atoms have 5 bonds and no charge to deal with the combinatorial possibilities of the various
ways that bonds might connect, we chose to explicitly break
all such connected joints so that we could treat them all in
5 Conclusions a uniform way. This has significantly simplified our code
and the logic for dealing with connections and has yielded
Although a MolRec’s 95% recognition rate in TREC 2011 is unexpected dividends in, to name one example, dealing with
already high, there is still plenty of room for improvement. implicit nodes such as in Figure 2.
Some of the mis-recognition problems we faced are in-
herently uncorrectable, in the sense that, just like in the real
world, some of the test cases either have errors or have in- References
correct solution MOL files. Such problems must simply be
[1] D. Douglas and T. Peucker. Algorithms for the reduction of
accepted. We believe many of the mis-recognition prob-
the number of points required to represent a digitized line or its
lems can be solved with some relatively simple enhance- caricature. Cartographica, 10(2):112–122, 1973.
ments of our system, e.g. the 15 dashed wedge bond mis- [2] I. Filippov and M. Nicklaus. Optical structure recognition soft-
identifications or the 5 diagram caption confusion cases. For ware to recover chemical information: OSRA, an open source
a significant number of the problems we need to incorpo- solution. J. Chem. Inf. Model., 49(3):740–743, 2009.
rate more chemical domain knowledge into our system, e.g. [3] J. Gasteiger. Handbook of Chemoinformatics: From Data to
for the 10 incorrect stereochemistry problems or the 3 super- Knowledge. Advances in Electrochemical Sciences and Engi-
atom connectivity problems. neering Series. Wiley-VCH, 2003.
Overall, we are pleased that a number of approaches [4] The Marvin abbreviation group collection. http://
turned out to be very successful and we recommend them atchimiebiologie.free.fr/marvin/chemaxon/
to any who work in this field: marvin/templates/default.abbrevgroup.
[5] Open Babel: The open source chemistry toolbox. http://
Line finding and simplification: Early experiments using
www.openbabel.org/.
Hough transforms for line finding yielded disappointing re- [6] N. Sadawi, A. Sexton, and V. Sorge. Chemical structure recog-
sults with poor robustness. Instead we start with connected nition: A rule based approach. In 19th Document Recognition
component analysis, filter characters using OCR on each and Retrieval Conference (DRR 2012). SPIE, January 2012. to
connected component, skeletonise the remaining compo- appear.
nents, and use the Douglas-Peucker algorithm to simplify [7] Cambridge Soft. Chembiodraw v12.0 user documen-
the skeletons and remove skeletonisation artifacts, in order tation, 2010. http://www.cambridgesoft.com/
to produce clean paths along lines. We can then walk the software/ChemDraw/.
lines in the original image using the cleaned skeletons to de- [8] Symyx. CTfile formats, 2010. http://www.symyx.com/
tect and analyse the various types of bonds. This approach downloads/public/ctfile/ctfile.jsp.
has proven to be fast and particularly successful.

You might also like