Raj Synopsis12
Raj Synopsis12
OCR Model
Bachelor of Engineering
In
Computer Science and Engineering
Proposed By
Raj Singh
1. Text Extraction:
Aim: The primary goal is to accurately extract text from images or scanned documents.
Objectives: Develop algorithms that can identify and recognize characters, words, and
sentences within images with high precision and recall.
2. Accuracy Improvement:
Aim: Enhance the overall accuracy of OCR by minimizing errors in character recognition.
Objectives: Employ advanced machine learning and deep learning techniques to improve the
model's ability to correctly identify characters, even in challenging scenarios like distorted text
or low-quality images.
3. Language Support:
Objectives: Train the OCR model to recognize and process text in various languages, ensuring a
broader application range and accessibility for users globally.
Objectives: Implement features that understand the organization of text in documents, including
headers, footers, paragraphs, tables, and other structural elements, to maintain document
integrity during the OCR process.
5. Continuous Improvement:
Objectives: Establish mechanisms for continuous learning and improvement, incorporating user
feedback and updating the model with new data to adapt to evolving patterns and challenges.
BACKGROUND STUDY
Deep learning solutions have taken the world by storm, and all kinds of organizations like tech
giants, well-grown companies, and startups are now trying to incorporate deep learning (DL)
and machine learning (ML) somehow in their current workflow. One of these important
solutions that have gained quite a popularity over the past few years is the OCR engine.
OCR (Optical Character Recognition) is a technique of reading textual information directly
from digital documents and scanned documents without any human intervention. These
documents could be in any format like PDF, PNG, JPEG, TIFF, etc. There are a lot of
Advantages of using OCR systems, these are:
It increases productivity as it takes very less time to process (extract information)
the documents.
It is resource-saving as you just need an OCR program that does the work
and no manual work would be required.
It eliminates the need for manual data entry.
Chances of error become less.
Extracting information from digital documents is still easy as they have metadata, that can give
you the text information. But for the scanned copies, you require a different solution as
metadata does not help there. Here comes the need for deep learning that provides solutions for
text information extraction from images.
In this article, you will learn about different lessons for building a deep learning-based OCR
model so that when you are working on any such use case, you may not face the issues that I
have faced during the development and deployment.
OCR has become very popular nowadays and has been adopted by several industries for faster
text data reading from images. While solutions like contour detection, image classification,
connected component analysis, etc. are used for documents that have comparable text size and
font, ideal lighting conditions, good image quality, etc., such methods are not effective for
irregular, heterogeneous text often called wild text or scene text. This text could be from a car’s
license plate, house number plate, poorly scanned documents (with no predefined conditions),
etc. For this, Deep Learning solutions are used. Using DL for OCR is a three-step process and
these steps are:
METHODOLOGY
Decision function:
In this paper a complete OCR methodology for recognizing historical documents, either printed
or handwritten without any knowledge of the font, is presented. This methodology
consists of three steps: The first two steps refer to creating a database for training using a
set of documents, while the third one refers to recognition of new document images.
First, a pre-processing step that includes image binarization and enhancement takes
place. At a second step a top - down segmentation approach is used in order to detect
text lines, words and characters. A clustering scheme is then adopted in order to group
characters of similar shape. This is a semi-automatic procedure since the user is able to
interact at any time in order to correct possible errors of clustering and assign an ASCII
label. After this step, a database is created in order to be used for recognition. Finally,
in the third step, for every new document image the above segmentation approach takes
place while the recognition is based on the character database that has been produced at
the previous step.
Working:
A scanner reads documents and converts them to binary data. The OCR software analyzes the
scanned image and classifies the light areas as background and the dark areas as text.
flow chart for methodology
TOOLS AND TECHNIQUES TO BE USED
PYTHON
BOTO 3
AWS
PROPOSED WORK
The large Amount of documents, either modern or historical, that we have in our possession
nowadays, due to the expansion of digital libraries, has pointed out the need for reliable and
accurate systems for processing them. Historical documents are of more importance because
they are a significant part of our cultural heritage. During the last decades a lot of research has
been done in the field of Optical Character Recognition (OCR). Numerous commercial
products have been released that convert digitized documents into text files, usually in ASCII
format. Although these products process machine printed documents successfully, when it
comes to handwritten documents the results are not satisfactory enough. Moreover, such
products are unable to process historical documents due to their low quality, lack of standard
alphabets and presence of unknown fonts.To this end, recognition of historical documents is
one of the most challenging tasks in OCR.