ocr_progress4
ocr_progress4
Submitted By
Saurabh Sutradhar
Research Scholar, Discipline of Computer Science,
Hiranya Chandra Bhuyan School of Science and Technology, KKHSOU
List of Figures:
SL. No. Name Page No.
1. Figure 1: Zonal Division of Assamese Character 3
2. Figure 2: Proposed Methodology 5
3. Figure 3: Out of the frame data 6
4. Figure 4: Mixed Word 6
5. Figure 5: Noises 7
6. Figure 6: Binarization 7
7. Figure 7: Comparison of Images 8
8. Figure 8: Screenshot of single character dataset ক 9
9. Figure 9: Screenshot of single character dataset খ 9
10. Figure 10: Proposed Segmentation Framework 16
11. Figure 11: Word Segmentation 27
1. Introduction
Handwritten Word Recognition (HWR) is a kind of optical character recognition (OCR)
system. It is a technique of detecting, segmenting, and identifying characters from images.
The main objective of handwritten character recognition is to replicate the human reading
capabilities so that the computer can understand, read, and work as humans do with text. It
has been one of the most interesting and laborious research areas in the field of image
processing and pattern recognition lately. Several research works have been focusing on new
methods that would reduce the processing time without compromising high recognition
accuracy. A lot of work for handwritten character recognition for languages like European,
Chinese, and Arabic has been done. Genetic algorithm and Artificial Neural Network (ANN)
were used as a classifier to recognize English digits and alphabets in [1]. Deep Learning was
also used to recognize English digits and alphabets in [2]. Even for domestic Indian
languages like Hindi, Bangla [3], Oriya [4], Malayalam [5], Tamil [6], etc. a lot of research
has been done till now. Indic scripts like Oriya, Telugu, Bangla, Roman have already been
explored extensively [3,4]. The Assamese language has not been much explored due to Its
less usage. The Assamese language is a part of the Assamese-Bengali Script prominently
used in the North-Eastern region of India. The Assamese script consists of 11 vowels and 40
consonants [7]. It also has 122 conjunct characters [7]. Recognition of Assamese Characters
is a difficult task because of the complex features of the characters like convoluted edges
(ক্ষ, শ, ঋ, etc.), similarity in appearance (অ, আ, য, য়, etc.), presence of loops (ভ, ঞ, etc.),
etc. Unlike English script, the Assamese script does not have the concept of capitalization of
first characters. Also, different people have different handwriting styles therefore it becomes
a very laborious task of determining the characters. Nowadays, CNNs have been effectively
applied to pattern recognition, image classification, and forecast studies to name a few. For
the classification of Assamese digits and characters, a few versions of CNN have been used
so far [8,9]. This research area of recognizing Assamese characters has not been explored
much. The potential of this field is immense considering the popularity of the language in its
native state.
2. Literature Review
There are lots of work available on handwritten character recognition. But due to
limited time frame and scope of the topic it is not possible to go through all of them. There
we need to enforce some exclusion criteria for literature survey. In our exclusion criteria we
have excluded all the literature which are not related to the scripts of Indian subcontinent.
And for inclusion criteria we focus research work with Devanagari scripts primarily Bengali
and Assamese due to correlation with our work.
The literature survey can be predominantly divided into two phases Pre
implementation phase and Post implementation phase. In the Pre-Implementation phases, we
go through all literature as per my knowledge which falls under inclusion criteria to solidify
our research objectives. In Post Implementation phase we go through literature review related
to implementation part only.
1
Pre-Implementation Phase:
In this phase we will discuss different literature realter to Assamese of Bengali scripts.
Printed and handwritten Assamese digits were compared in [10]. For the feature extraction
and classification, part feed-forward neural networks and a tree-based classifier were used.
Digits were scanned from the document after pre-processed images were cropped with a
bounding box to obtain individual digits. Features were extracted from each bounded area of
the image. The grayscale documents were converted into a binary image and the background
noises were removed with the help of linear filtering, medial filtering, and adaptive filtering.
After that skew detection and correction were performed, following line, word, and character
segmentation.
Word-level script identification for six handwritten Indic scripts (Bangla, Devanagari,
Gurmukhi, Malayalam, Oriya Telugu, and Roman) was proposed in [4]. This paper proposed
the elliptical and approximation approach to design features. The original images are in
grayscale and Otsu’s Global thresholding method is used to convert them into binary images,
removing the noisy pixels from the binarized image using Gaussian Filter. For the
classification part, the feature sets have been applied to 7 different classifiers namely- Naïve
Bayes, Bayes Net, MLP, SVM, Random Forest, Bagging, and Multi-Class Classifiers. The
multi-layer perceptron (MLP) achieved the highest accuracy of 94.35%.
A computational model for Handwritten Assamese text was proposed in [3] to perform
the pre-processing, text segmentation, and then extraction of different features from
individual characters. To segment, a document image into various parts, a global projection
profile approach of a word was used to identify the upper, lower, and middle areas of a word.
A combination of diagonal features using zoning concept and texture features via GLCM
(Grey Level Co-occurrence Matrix) were computed for extracting various features of
individual characters. The mean and standard median filters (SMF) were used to clean noises
from an input image. Random transform-based techniques were used for skew detection and
correction. Anrtificeal neural network as a backend was used to perform classification and
recognition tasks.
A modified ResNet-18 architecture (a convolution neural network architecture) was
proposed in [3] to recognize Bangla handwritten characters. This paper considered three main
challenges: Recognizing convoluted edges to distinguish between the repetitions of the same
pattern in different characters and different handwritten patterns for the same characters. To
have wider input for the generalized performance of the network, input images are pre-
processed through the removal of noise with the median filter, edge thickening filter, and the
image is resized to a square shape with appropriate paddings 4 by default.
A hybrid approach for recognizing Malayalam handwritten characters was proposed in
[5]. It considers both the dependent and independent features of the language. MATLAB is
used to implement the proposed OCR system. The proposed method of OCR is a hybrid
approach for feature extraction combining both structural and statistical features. In this
paper, the curve features are extracted using the water reservoir principle and a decision tree
classifier is used for classification.
An Artificial Neural Network (ANN) based approach was used to segment handwritten
text in Assamese [12]. After the feature extraction part, the input was fed to an ANN model,
2
and the similarity measure was found. A text-to-speech synthesizer was used in [13] to
facilitate English text reading. The text was manually typed in the screen and the system was
made using MATLAB. Handwritten Sindhi numerals were similarly recognized in [14] using
K-NN and SVM. They evaluated their system using the correlation coefficient. The Sindhi
numeral 0 achieved an accuracy of 100% whereas the numeral 3 achieved 63%. Then for
Tamil script, features like character height, width, slope, etc. were considered in [6]. Zernike
moments were used for feature extraction and then fed to a backpropagation model.
Similarly, for the classification of Assamese handwritten digits, vowels, and consonants a
zoning feature was used in [15].
A feed-forward neural network model was used with a sigmoid function at every neuron
to calculate the output. For digits, vowels, and consonants a recognition accuracy of 70.6%,
69.62%, and 71.23% was achieved respectively. In [9] Assamese handwritten digits were
recognized by using CNN (DigiNet model). This paper used six alternative Convolution and
Max-Pooling layers which were later followed by a Fully Connected layer and a SoftMax
classifier.
Post-Implementation Phase:
Here we will discuss the research works which can help in our implementation
process. Generally, a document images may be contaminated with noise during transmission,
scanning or conversion to digital form. A approach has been taken to categorize noises by
identifying their features and can search for similar patterns in a document image to choose
appropriate methods for their removal [15].
Skew detection is one of the important parts of any document image processing or
character recognition system. The successful skew detection and correction may lead to
success of document image processing or character recognition system. Different skew
detection technique for Devanagari scripts has been discussed by Trupti Jundale in his work
[16].
3. Problem Statement:
All the works we have discussed considered mainly zonal approach [11,12] [Figure:1] where
Matra (horizontal line above character) is a main feature to recognize Assamese charters
despite presence of characters with convoluted edges (ক্ষ, শ, ঋ, etc য, য়, etc.) and
presence of loops (ভ, ঞ, etc.). We will try to implement a novel vertical zonal recognition
approach after world level segmentation and in our approach we segment character without
detaching from modifiers and compound character will be taken care as a special character
rather than . So, our aim in this work is to recognize Assamese handwritten Character using
machine learning techniques and digitize them.
3
4. Objective of Study
Our aim is to recognize Assamese handwritten characters and digitize and further we will
try to recognize Assamese handwritten scripts. During this course of work, we will study
different handwritten styles for Assamese and try to find out common features to recognize
them. We will apply different algorithms in various phases of our work to find out the best
one. We will try to emphasize our study on compound character and conjugate character.
Most of the time, due to poorly written quality, it is nearly impossible to recognize a
handwritten character. So, we will try to implement a word level lexicon so that we can
predict a character in a word. We can objectify our study into following points:
To study different handwriting styles for Assamese scripts
To create a Novel Dataset for handwritten characters in Assamese
To extract distinguish features to identify Assamese Handwritten scripts.
To implement various algorithms for segmentation and identification of characters
To enhance the algorithms according to our need for optimal design.
To further customize the whole model to an appropriate system which can identify
conjunct and complex character also.
5. Methodology
Our methodology mainly has three phases:
i. Data Collection Phase
ii. Pre-processing Phase
iii. Machine Learning Phase
ii. Next phase is Pre-processing, here we will reduce the noises in the scanned
images of handwritten scripts and then will convert them into a binary image.
Then we will detect if there is any skew present in the image with various skew
detection algorithm and will remove it.
iii. In the last phase we will implement various existing and modified segmentation
techniques using Machine learning approach. From there we will extract the
feature to recognize the character. We apply various machine learning techniques
for features extraction to see which is the better and choose the appropriate one.
We will also try to implement a character level lexicon and word level corpus to
identify the unrecognized character on the basis of earlier used data.
4
Figure 2: Proposed Methodology
5
7. Data Collection
In this phase we will discuss various data collection method and the problem we faced
with data collection .We have collected almost 100 samples of Assamese handwritten scripts
from different users. After preliminary analysis we able locate following problems with the
samples
1. Out of frame data: Most of the data we collected are scanned or photographed
images of handwritten document there some of the samples are out of frame. Which
presents us unique challenges of to extract only necessary data from following
samples.
2. Mixed Word: Our work tries to recognize Assamese character but now a days due to
excessive use of English user tends to write English word with Roman scripts in the
document. This can lead to problem of unrecognizable character.
6
3. Noises: Scanned images or photographed images always incorporate some noises
like salt paper noises, Gaussian Noises etc. Furthermore, we have treated blurring of
ink. Disproportion in thickness of the scripts and striking out of words for errors as a
noise also.
Figure 5: Noises
Pre-processing
In our earlier pre-processing part, we first try to binarize the image than noise removal than
go for skew correction. But direct binarization of images result in poor quality
Figure 6: Binarization
7
So we go for Gray scaling the image than noise removal which give better image quality.
After that we transform the image to binary image using adaptive thresholding techniques
which yields better results
8
Figure 7: Comparison of Images
7.1 Discussion
The resultant images getting from the pre-
processing are not good quality to do segmentation
characters. The subsequent research is going on to
improve the quality and to prove my hypothesis I
have collected some data of single characters.
9
In SVM model
First, we import the necessary libraries: os for reading files from a directory, cv2 for image
processing, numpy for numerical operations, svm for the support vector machine algorithm,
and train_test_split for splitting the dataset into training and testing sets.
Next, we set the path to our dataset folder, which contains subfolders for each handwritten
script. Inside each subfolder are images of handwritten characters for that script. We have
seven subfolders for these 7 characters: ক খ গ ঘ ঙ চ ছ .
We load the images and labels from the dataset folder by iterating through each subfolder and
image file and appending the image and its corresponding label (the name of the subfolder) to
separate lists.
We convert the images and labels to numpy arrays for use in the SVM model.
We split the dataset into training and testing sets, with 80% of the data used for training and
20% for testing.
We train an SVM model on the training set, using a linear kernel.
We predict the characters in the testing set using the trained model.
We calculate the accuracy of the model by comparing the predicted labels to the actual labels
in the testing set.
Finally, we print out the accuracy of the model which comes around 51%.
In CNN model
First, the script imports the necessary libraries:
import numpy as np
import matplotlib.pyplot as plt
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Conv2D, MaxPooling2D, Flatten
from keras.utils import to_categorical
numpy is a library for working with arrays and matrices of numerical data.
Matplotlib is a library for creating data visualizations, such as graphs and charts.
Keras.datasets provides the MNIST dataset of handwritten digits.
Keras.models and keras.layers are Keras libraries for defining neural network models and
layers.
10
Keras.utils provides utilities for working with data, such as one-hot encoding.
Next, the script loads the MNIST dataset and preprocesses the data:
(X_train, y_train), (X_test, y_test) = mnist.load_data()
X_train = X_train.reshape((X_train.shape[0], 28, 28, 1))
X_test = X_test.reshape((X_test.shape[0], 28, 28, 1))
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)
mnist.load_data() loads the MNIST dataset and returns four NumPy arrays: X_train, y_train,
X_test, and y_test.
X_train and X_test are reshaped to 4D arrays with shape (batch_size, height, width,
channels), where batch_size is the number of images in each batch, height and width are the
dimensions of each image, and channels is the number of color channels (1 for grayscale, 3
for RGB).
Y_train and y_test are one-hot encoded, meaning that each label is represented as a binary
vector with a 1 in the position corresponding to the true label and 0s elsewhere.
The script then defines the model architecture using Keras layers:
model = Sequential()
model.add(Conv2D(32, (3, 3), activation=’relu’, input_shape=(28, 28, 1)))
model.add(MaxPooling2D((2, 2)))
model.add(Conv2D(64, (3, 3), activation=’relu’))
model.add(MaxPooling2D((2, 2)))
model.add(Conv2D(64, (3, 3), activation=’relu’))
model.add(Flatten())
model.add(Dense(64, activation=’relu’))
model.add(Dense(10, activation=’softmax’))
Sequential() creates a new neural network model as a linear stack of layers.
Conv2D() adds a convolutional layer with a specified number of filters (32 or 64 in this case),
kernel size (3x3), activation function (ReLU), and input shape (28x28 grayscale images with
1 channel).
MaxPooling2D() adds a max pooling layer with a pool size of 2x2, which downsamples the
feature maps by taking the maximum value within each 2x2 window.
11
Flatten() flattens the output of the convolutional layers into a 1D array, which can be passed
to a fully connected layer.
Dense() adds a fully connected layer with a specified number of units (64 or 10 in this case)
and activation function (ReLU or softmax).
The script compiles the model with an optimizer and loss function:
model.compile(optimizer=’adam’, loss=’categorical_crossentropy’, metrics=[‘accuracy’])
compile() configures
Segmentation
Earlier in our research we have been able to recognize handwritten single character mainly
consonant using a CNN model. But a handwritten script will have words and sentences of
these characters, so it is advisable to first separate each character from a word before
recognizing it. Recognizing entire words or sentences requires segmenting individual
characters to feed them into your existing CNN model. Segmentation is a process used to
divide a continuous stream of text or characters into individual units or segmentsOur
approach is to apply segmentation technique to separate each character from the document.
Related Works in Segmentation:
In 2008, Y. Li et al. [17] have recommended a probability map, where each element
represents the probability that the underlying pixel belongs to a text line. The level set
method was then exploited to determine the boundary of neighboring text lines by evolving
an initial estimate. Unlike connected component based methods, the proposed algorithm does
not use any script-specific knowledge.
In 2022, Husam et al. [19] have proposed to use the word image’s vertical linear
density for clarifying character boundaries and districting between characters. In the proposed
method, three pre-processing steps was applied: fill close and open holes (missing circle),
remove punctuation to clarify the area of ligature points and avoid characters overlapping,
and crop the word image to remove excess white space. The goal of filling close and open
holes was to increase the character’s pixel density and then apply the vertical linear density.
12
In 2010, Hamad and Raed [20] have recommended a novel neural-based technique for
validating prospective segmentation points of Arabic handwriting was proposed and
investigated based on direction features. In particular, the vital process of handwriting
segmentation was examined in great detail. The classifier chosen for segmentation point
validation was a feed-forward neural network trained with the back-propagation algorithm.
Segmenting Assamese handwriting scripts is a complex task that plays a crucial role
in various applications, from digitizing historical manuscripts to developing robust OCR
systems tailored to the Assamese language. This process involves breaking down handwritten
text into its basic components, such as characters, words, and lines, to facilitate further
analysis and recognition. The uniqueness of the Assamese script, with its distinct characters,
modifiers, and conjuncts, presents both opportunities and challenges for the development of
effective segmentation techniques. Various research were performed for segmentation of
handwritten scripts are discussed in Table. 1. The research gap of segmenting Assamese
handwriting scripts model is given below.
13
minimizes the impact of overfitting, thereby enhancing the robustness and
generalizability of the segmentation model.
Table 2: Features and challenges of the existing deep learning-based segmenting Assamese
handwriting scripts model
Proposed Methodology
14
After going through various related works, it has been established that character segmentation
in handwritten scripts can be a challenging task due to the variability in writing styles,
shapes, and sizes of characters. The effectiveness of segmentation methods depends on the
characteristics of the script and the quality of the handwriting. Here are some segmentation
methods that are commonly used for handwritten character segmentation:
Connected Component Analysis (CCA):
CCA is a simple and widely used method for character segmentation. It involves grouping
connected pixels into components, which can represent characters. This method can work
well for well-separated characters.
Stroke Width Transform (SWT):
SWT can be effective in handling variations in stroke width in handwritten scripts. It helps in
identifying regions of similar stroke width, which can correspond to individual characters.
Projection Profiles:
Projection profiles can be applied horizontally and vertically to analyze the distribution of ink
in the image. Peaks in the profiles may indicate potential character boundaries.
Contour-based Segmentation:
Analysing contours of connected components can help identify character boundaries. This
approach is useful when characters have distinct contours.
Machine Learning Based Segmentation
Convolutional Neural Networks (CNNs) can be trained to learn features relevant to
character segmentation. These models can capture complex patterns in handwriting and
generalize well to different writing styles.
Hybrid Approaches:
Combining multiple methods in a hybrid approach can enhance robustness. For example,
using a combination of CCA and machine learning techniques can leverage the strengths of
both approaches.
Word Segmentation Followed by Character Segmentation:
In some cases, it might be beneficial to first segment the words in a handwritten script and
then perform character segmentation within each word. This hierarchical approach can handle
cases where characters within a word are closely connected.
Historical Information:
Utilizing contextual information and the order in which characters are written in a script can
be helpful. Historical information about the stroke order or writing direction may guide the
segmentation process.
The choice of the best segmentation method depends on factors such as the complexity of the
script, the level of variability in handwriting styles, and the specific characteristics of the
documents being processed. It's often beneficial to experiment with multiple methods and
15
potentially combine them to achieve optimal results for a given dataset. Machine learning-
based approaches, particularly those leveraging deep learning, have shown promising results
in handling the complexity of handwritten character segmentation tasks.
The main Objective of this study:
• Improve the accuracy of character recognition for Assamese handwritten script and
address the character confusion issue.
• Apply different algorithms to compare the performance.
• Enhance OCR capabilities to recognize different handwriting efficiently.
16
Figure 11: Word Segmentation
17
References
1. Agarwal, M., Kaushik, B.: Text Recognition from Image using Artificial Neural Network
and Genetic Algorithm. IEEE, 2015
2. Vaidya, R., Trivedi, D., Satra, S.: Handwritten Character Recognition Using Deep-
Learning. 2nd International Conference on Inventive Communication Technologies
(ICICCT). 2018
3. Alif, M., Ahmed, S., Hasan, M.: Isolated Bangla Handwritten Character Recognition with
Convolutional Neural Network. 20th International Conference of Computer and Information
Technology (ICCIT). IEEE, 2017.
4. Singh, P., Doermann, D.: Word-level Script Identification for Handwritten Indic Scripts.
13th International Conference on Document Analysis and Recognition (ICDAR). IEEE,
2015
5. Sujala, K., James, A.Saravanan, C.: A Hybrid Approach for Feature Extraction in
Malayalam Handwritten Character Recognition. Second International Conference on
Electrical, Computer and Communication Technologies (ICECCT). IEEE, 2017.
6. Wahi, A., Sundaramurthy, S., Poovizhi, P.: Handwritten Tamil Character Recognition.
Fifth International Conference on Advanced Computing (ICoAC). 2013
7. Wikipedia, https://en.wikipedia.org/wiki/Assamese_alphabet, last accessed 2021/10/12
8. Yadav, M., Mangal, D., Srinivasan, N., Ganzha, M.: Assamese Character Recognition
using Convolutional Neural Networks. May 2021
9. Dutta, P., Muppalaneni, N.: DigiNet Prediction of Assamese Handwritten Digits using
Convolutional Neural Network. Concurrency and Computation: Practice and Experience.
June
2021
10. Medhi, K., Kalita, S.: Assamese Digit Recognition with Feed Forward Neural Network.
International Journal of Computer Applications. Volume 109 - No. 1 (January 2015).
11. Bania, R.: Handwritten Assamese Character Recognition using Texture and Diagonal
Orientation features with Artificial Neural Network. International Journal of Applied
Engineering Research 13.10 (2018): 7797-7805.
12. Bhattacharya, K., Sarma, K.: ANN-based Innovative Segmentation Method for
Handwritten Text in Assamese. IJCSI International Journal of Computer Science Issues, Vol.
5, 2009
13. Gopinath, J., Aravind, S., Chandran, P., Saranya, S.: Text to Speech Conversion System
using OCR. International Journal of Emerging Technology and Advanced Engineering,
Volume 5, Issue 1, January 2015
18
14. Sanjrani, A., Baber, J., Bakhtyar, M., Noor, W., Khalid, M.: Handwritten Optical
Character
Recognition System for Sindhi Numerals. IEEE 2015
15. Medhi, K., Kalita, S.: Assamese Character Recognition Using Zoning Feature. Advances
in Electronics, Communication, and Computing. January 2018
16. Atena Farahmand, Abdolhossein Sarrafzadeh, and Jamshid Shanbehzadeh: Document Image
Noises and Removal Methods. Proceedings of the International MultiConference of Engineers and
Computer Scientists 2013 Vol I, IMECS 2013, March 13 - 15, 2013, Hong Kong
17. Y. Li, Y. Zheng, D. Doermann and S. Jaeger, "Script-Independent Text Line Segmentation in
Freestyle Handwritten Documents,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
vol. 30, no. 8, pp. 1313-1329, Aug. 2008.
18. Sukhandeep Kaur, Seema Bawa and Ravinder Kumar,” Heuristic-based text segmentation of
bilingual handwritten documents for Gurumukhi-Latin scripts",Volume 83, pages 18667–18697,
2024.
19. Husam Ahmed Al Hamad, Laith Abualigah, Mohammad Shehab, Khalil H. A. Al-Shqeerat and
Mohammad Otair," Improved linear density technique for segmentation in Arabic handwritten text
recognition, Multimedia Tools and Applications, Volume 81, pages 28531–28558, 2022.
20. Husam A. Al Hamad and Raed Abu Zitar," Development of an efficient neural-based
segmentation technique for Arabic handwriting recognition",Pattern Recognition,Volume 43, Issue 8,
August 2010.
19