0% found this document useful (0 votes)
14 views5 pages

OCR For Printed Kannada Text To Machine Editable F

The document describes an optical character recognition system for printed Kannada text. It discusses segmentation of text lines, words, and characters. A database approach is used for character recognition, aiming for 100% accuracy.

Uploaded by

manmithrane149
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views5 pages

OCR For Printed Kannada Text To Machine Editable F

The document describes an optical character recognition system for printed Kannada text. It discusses segmentation of text lines, words, and characters. A database approach is used for character recognition, aiming for 100% accuracy.

Uploaded by

manmithrane149
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/234762329

OCR for printed Kannada text to machine editable format using database
approach

Article in WSEAS TRANSACTIONS ON COMPUTERS · June 2008

CITATIONS READS

20 10,724

3 authors:

Sagar B M G. Shobha
Bangalore University Rashtreeya Vidyalaya College of Engineering
5 PUBLICATIONS 72 CITATIONS 117 PUBLICATIONS 1,139 CITATIONS

SEE PROFILE SEE PROFILE

Ramakanth Kumar P.
Rashtreeya Vidyalaya College of Engineering
72 PUBLICATIONS 391 CITATIONS

SEE PROFILE

All content following this page was uploaded by Sagar B M on 25 October 2016.

The user has requested enhancement of the downloaded file.


9th WSEAS International Conference on AUTOMATION and INFORMATION (ICAI'08), Bucharest, Romania, June 24-26, 2008

OCR for printed Kannada text to Machine editable format


using Database approach
B.M. SAGAR1, Dr. SHOBHA G2, Dr. RAMAKANTH KUMAR P3
Information Science1, Computer Science2, Computer Science3
Visvesvaraya Technological University1, 2, 3
Lecturer , Professor , Professor3, R.V.C.E, Bangalore-59, Karnataka, INDIA
1 2

Abstract: - This paper describes an Optical Character Recognition (OCR) system for printed text
documents in Kannada, a South Indian language. The proposed OCR system for the recognition
of printed Kannada text, which can handle all types of Kannada characters. The system first
extracts image of Kannada scripts, then from the image to line segmentation then segments the
words into sub-character level pieces. For character recognition we have used database approach.
The level of accuracy reached to 100%.

Key-words: - Optical Character Recognition, Segmentation, Kannada Scripts

1. Introduction almost identical to that of other Indian


Optical character recognition (OCR) refers languages. It is written horizontally from left
to reading text from paper and translating to right and the concept of lower and upper
the images into a form that the computer can case is absent. [1]
manipulate. OCR systems have been Kannada language has 16 vowels and 34
effectively developed for the recognition of consonants as the basic alphabet of the
printed characters of non-Indian languages. language. The number of written symbols,
Until quite recently, the focus of this however, is far more than the 50 characters,
endeavor has been on characters of English because different characters can be
Language. Such systems are also available combined to form compound characters
for many European languages as well as (ottaksharas).
some of the Asian languages such as
Japanese, Chinese, etc. However, there are 2. Background Study
not many reported efforts at developing Due to the impact and the advancements in
OCR systems for Indian languages the Information Technology, nowadays
especially for a South Indian language like more emphasis is given in Karnataka to use
Kannada. Kannada at all levels and hence the use of
Section 3 describes work done on Kannada Kannada in computer systems is also a
character recognition. Section 4 describes necessity. Therefore, efficient OCR systems
the segmentation process of line, word and for Kannada are one of the present day
character. Section 5 describes the proposed requirements. Currently there are many
system for the Kannada character OCR systems available for handling printed
recognition. Section 6 describes the method English documents with reasonable levels of
of character recognition with the increased accuracy [1]. It is difficult to find OCR
efficiency. Section 7 describes the systems for Kannada with the increased
experimental results and then conclusion accuracy. Few researchers are worked on
and future work. Kannada character recognition with novel
set of features for the recognition problem
1.1 Introduction to Kannada Scripts which are computationally simple to extract.
Kannada is one of the South Indian The recognition achieved by employing a
languages. The Kannada character set is number of 2-class classifiers based on the

ISBN: 978-960-6766-77-0 322 ISSN 1790-5117


9th WSEAS International Conference on AUTOMATION and INFORMATION (ICAI'08), Bucharest, Romania, June 24-26, 2008

Support Vector Machine (SVM) method.


The recognition is independent of the font
and size of the printed text and the system is
seen to deliver reasonable performance [3].
Another researcher who worked on Kannada
character recognition with Hu’s invariant
moments and Zernike moments. Those are
used in the system to extract the features of Figure: 3.1 Shows the line and character
printed Kannada characters. Neural segmentation
classifiers have been effectively used for the
classification of characters based on moment 3.2 Word Segmentation
features. An encouraging recognition rate of
As we know that there is a distance between
96·8% has been obtained [1].
one word to another word. We use that
concept for word segmentation. After the
In our system we have used database
line segmentation scan the image vertically
approach for the character recognition.
for word segmentation.
Section 5 describes the method of character
recognition with the increased efficiency.
Steps for the word Segmentation is as
follows
3. Segmentation Process 1. Scan the BMP image vertically for the
Due to the peculiarities of the Kannada recognized line segment, to find first ON
script, the following segmentation scheme is pixel and remember that x coordinate as x1.
proposed where lines are segmented then Treat this as starting coordinate for the
words and finally characters. These are then word.
put together to the effect of recognition of 2. Continue scanning the BMP image then
individual aksharas or characters. we would find lots of ON pixel since the
As Kannada is a non-cursive script, the word would have started.
individual characters in a word are isolated. 3. Finally we get the successive five (this is
Spacing between the characters can be used assumed word distance) OFF pixel column
for segmentation. and remember that x coordinate as x2.
4. x1 to x2 is the word.
3.1 Line Segmentation 5. Repeat the above steps till the end of the
Line segmentation is the process of line segment.
identifying lines in a given image. 6. Repeat the above steps for all the
Steps for the line Segmentation is as follows recognized line segments.

1. Scan the BMP image horizontally to find 3.3 Character Segmentation


first ON pixel and remember that y 1. Scan the BMP image vertically for the
coordinate as y1. recognized word segment, to find first ON
2. Continue scanning the BMP image then pixel and remember that x coordinate as x1.
we would find lots of ON pixel since the Treat this as starting coordinate for the
characters would have started. character.
3. Finally we get the first OFF pixel and 2. Continue scanning the BMP image then
remember that y coordinate as y2. we would find lots of ON pixel since the
4. y1 to y2 is the line. characters would have started.
5. Repeat the above steps till the end of the 3. Finally we get the OFF pixel column and
image. remember that x coordinate as x2.
4. x1 to x2 is the character.
5. Repeat the above steps till the end of the
word segment, line segment.

ISBN: 978-960-6766-77-0 323 ISSN 1790-5117


9th WSEAS International Conference on AUTOMATION and INFORMATION (ICAI'08), Bucharest, Romania, June 24-26, 2008

6. Repeat the above steps for all the 5. Character Recognition


recognized line segments. After we got the character by character
segmentation we store the character image
4. Proposed system in a structure. This character as to be
The OCR’s task is to identify the characters identified for the pre defined character set.
of Kannada script and the word processor There will be preliminary data will be stored
provides an interface for viewing and editing for all the kannada characters for a identified
documents in Kannada. Figure 4.1 shows the font and size. This data contains the
details. In this work, the sequence of following information
operations carried out is as follows. A page 1. Character ascii value
of Kannada text is scanned. The image 2. Character name
format used is the bmp format. The input to 3. Character BMP image
the system is a scanned image file in BMP 4. Character width and length
format of pure Kannada document. The 5. Total number of ON pixel in the
document is then segmented into lines and image.
each line into individual characters. The For every recognized Character above
documented is scanned and a line in the mentioned information will be captured. The
image file is extracted. The extracted line is recognized character information will be
given as input to the Character compared with the pre defined data which
Segmentation. Within each line the we have stored in the system.
characters are segmented one by one. The As we are using the same font and size for
extracted character that is still to be the recognition there will be exact one
recognized is given as input to the Character unique match for the character. This will
Recognizing Module. identify us the name of the character.
If the size of the character varies it will be
scaled to the known standard and then
recognizing process will be done.

6. Experimental Results
Figure 4.1 shows the input to the system and
once we say recognize we get the output at
the bottom.
Since we are using database approach for
the character recognition we get 100%
accuracy. But the limitation for this
approach is that for each character we need
to have details like Character ASCII value,
Character name, Character BMP image,
Character width, length and total number of
ON pixel in the image. This takes lot of
space as well as lot of computation involved
in recognizing the character. But we get
Output 100% accuracy.

Figure: 4.1 shows interface for viewing 8. Conclusion & future work
and editing documents in Kannada. In this paper, we have presented a database
approach for recognizing Kannada
characters.
Kannada is widely used language in South
India. Lots of applications need Kannada

ISBN: 978-960-6766-77-0 324 ISSN 1790-5117


9th WSEAS International Conference on AUTOMATION and INFORMATION (ICAI'08), Bucharest, Romania, June 24-26, 2008

OCR which can give 100% accuracy. The VTU. His research interests are Pattern
database approach shows the required Recognition. He has guided more than 25
accuracy but with the above said limitation. under graduate projects. He has presented
Using Neural Network, Support Vector and published papers at national conference
Machine recognition work can be carried out / International Conference.
but not with the required accuracy. But we
can make use of dictionary approach to
increase the accuracy.

Reference:
[1] R SANJEEV KUNTE and R D
SUDHAKER SAMUEL "A simple and
efficient optical character recognition
system for basic symbols in printed Kannada Dr. Shobha G., Professor of Computer
text" by Science & Engg. She has been awarded
Ph.D for her thesis titled “Knowledge
[2] "Hidden Markov Models for Online Discovery in Transactional Database
Handwritten Tamil Word Recognition" Systems” from Mangalore University,
Bharath A, Sriganesh Madhvanath, HP Mangalore. She obtained her M.S. degree in
Laboratories India HPL-2007-108, July 6, Software Systems from BITS, Pillani and
2007 BE in Computer Science from Gulbarga
University. Her research interests are Data
[3] T V ASHWIN and P S SASTRY "A Mining, DBMS, and Operating Systems &
font and size-independent OCR system for Networking. She has guided more than 30
printed Kannada documents using support undergraduate and 09 post graduate projects.
vector machines", Department of Electrical
Engineering, Indian Institute of Science,
Bangalore 560 012, India

[4] Rohana K. Rajapakse, A. Ruvan


Weerasinghe "A Neural Network based
character recognition system for Sinhala
Script” , Department of Statistics and
Computer Science, University of Colombo
Dr. Ramakanta Kumar, P was awarded
[5] SEETHALAKSHMI R "Optical Doctorate from Mangalore University, has
Character Recognition for printed Tamil text teaching experience of around 14 years in
using Unicode", Thanjavur, Tamil Nadu academics and Industry. His area of research
is on Artificial Intelligence, Pattern
About the authors recognition. He has to his credits 03
National Journals, 02 International Journals,
12 Conferences and 15 Research
Publications. He is guiding 04 MTech
students and 03 Phd students.

B.M.Sagar, Lecturer of Department of


Information Science and Engineering. He
obtained his Master’s Degree in Computer
Science & Engineering from VTU and B.E.
in Computer Science & Engineering from

ISBN: 978-960-6766-77-0 325 ISSN 1790-5117

View publication stats

You might also like