0% found this document useful (0 votes)
31 views5 pages

Raj Synopsis12

The document proposes developing an OCR model that can accurately extract text from images and scanned documents in multiple languages while preserving document structure. The project aims to improve OCR accuracy, support continuous learning, and extract text from historical documents. Methodologies will include pre-processing, segmentation, character clustering, and training an OCR database to recognize text in new documents using tools like Python, Boto3, and AWS.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views5 pages

Raj Synopsis12

The document proposes developing an OCR model that can accurately extract text from images and scanned documents in multiple languages while preserving document structure. The project aims to improve OCR accuracy, support continuous learning, and extract text from historical documents. Methodologies will include pre-processing, segmentation, character clustering, and training an OCR database to recognize text in new documents using tools like Python, Boto3, and AWS.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

SYNOPSIS FOR MAJOR PROJECT

OCR Model

Bachelor of Engineering
In
Computer Science and Engineering

Proposed By

Raj Singh

Registration No. 2012051299

Under the guidance of

Dr. Sarika Chaudhary


Associate Professor & HOD CSE

Department of Computer Science & Engineering


DPG Institute of Technology & Management
Gurugram, Haryana
Session: Jan 2024 – May 2024
AIM AND OBJECTIVE OF THE PROJECT

OCR, or Optical Character Recognition, is a technology that converts different types of


documents, such as scanned paper documents, PDFs, or images captured by a digital camera,
into editable and searchable data. The aim and objectives of an OCR model revolve around
enhancing the capabilities of extracting text information from images or scanned documents.
Here are the key objectives:

1. Text Extraction:

Aim: The primary goal is to accurately extract text from images or scanned documents.

Objectives: Develop algorithms that can identify and recognize characters, words, and
sentences within images with high precision and recall.

2. Accuracy Improvement:

Aim: Enhance the overall accuracy of OCR by minimizing errors in character recognition.

Objectives: Employ advanced machine learning and deep learning techniques to improve the
model's ability to correctly identify characters, even in challenging scenarios like distorted text
or low-quality images.

3. Language Support:

Aim: Support recognition of text in multiple languages.

Objectives: Train the OCR model to recognize and process text in various languages, ensuring a
broader application range and accessibility for users globally.

4. Document Layout Understanding:

Aim: Recognize and preserve the layout structure of documents.

Objectives: Implement features that understand the organization of text in documents, including
headers, footers, paragraphs, tables, and other structural elements, to maintain document
integrity during the OCR process.

5. Continuous Improvement:

Aim: Maintain and enhance OCR performance over time.

Objectives: Establish mechanisms for continuous learning and improvement, incorporating user
feedback and updating the model with new data to adapt to evolving patterns and challenges.
BACKGROUND STUDY
Deep learning solutions have taken the world by storm, and all kinds of organizations like tech
giants, well-grown companies, and startups are now trying to incorporate deep learning (DL)
and machine learning (ML) somehow in their current workflow. One of these important
solutions that have gained quite a popularity over the past few years is the OCR engine.
OCR (Optical Character Recognition) is a technique of reading textual information directly
from digital documents and scanned documents without any human intervention. These
documents could be in any format like PDF, PNG, JPEG, TIFF, etc. There are a lot of
Advantages of using OCR systems, these are:
 It increases productivity as it takes very less time to process (extract information)
the documents.
 It is resource-saving as you just need an OCR program that does the work
and no manual work would be required.
 It eliminates the need for manual data entry.
 Chances of error become less.

Extracting information from digital documents is still easy as they have metadata, that can give
you the text information. But for the scanned copies, you require a different solution as
metadata does not help there. Here comes the need for deep learning that provides solutions for
text information extraction from images.

In this article, you will learn about different lessons for building a deep learning-based OCR
model so that when you are working on any such use case, you may not face the issues that I
have faced during the development and deployment.

What is deep learning-based OCR?

OCR has become very popular nowadays and has been adopted by several industries for faster
text data reading from images. While solutions like contour detection, image classification,
connected component analysis, etc. are used for documents that have comparable text size and
font, ideal lighting conditions, good image quality, etc., such methods are not effective for
irregular, heterogeneous text often called wild text or scene text. This text could be from a car’s
license plate, house number plate, poorly scanned documents (with no predefined conditions),
etc. For this, Deep Learning solutions are used. Using DL for OCR is a three-step process and
these steps are:
METHODOLOGY

Decision function:

In this paper a complete OCR methodology for recognizing historical documents, either printed
or handwritten without any knowledge of the font, is presented. This methodology
consists of three steps: The first two steps refer to creating a database for training using a
set of documents, while the third one refers to recognition of new document images.
First, a pre-processing step that includes image binarization and enhancement takes
place. At a second step a top - down segmentation approach is used in order to detect
text lines, words and characters. A clustering scheme is then adopted in order to group
characters of similar shape. This is a semi-automatic procedure since the user is able to
interact at any time in order to correct possible errors of clustering and assign an ASCII
label. After this step, a database is created in order to be used for recognition. Finally,
in the third step, for every new document image the above segmentation approach takes
place while the recognition is based on the character database that has been produced at
the previous step.

Working:
A scanner reads documents and converts them to binary data. The OCR software analyzes the
scanned image and classifies the light areas as background and the dark areas as text.
flow chart for methodology
TOOLS AND TECHNIQUES TO BE USED

The main technologies used are:

 PYTHON
 BOTO 3
 AWS

PROPOSED WORK

The large Amount of documents, either modern or historical, that we have in our possession
nowadays, due to the expansion of digital libraries, has pointed out the need for reliable and
accurate systems for processing them. Historical documents are of more importance because
they are a significant part of our cultural heritage. During the last decades a lot of research has
been done in the field of Optical Character Recognition (OCR). Numerous commercial
products have been released that convert digitized documents into text files, usually in ASCII
format. Although these products process machine printed documents successfully, when it
comes to handwritten documents the results are not satisfactory enough. Moreover, such
products are unable to process historical documents due to their low quality, lack of standard
alphabets and presence of unknown fonts.To this end, recognition of historical documents is
one of the most challenging tasks in OCR.

In the literature, historical document processing is mainly focused on document retrieval.


Word-spotting techniques for searching and indexing historical documents have been
introduced. In word images are grouped into clusters of similar words by using image matching
to find similarity. Then, by annotating “interesting” clusters, an index that links words to the
locations where they occur can be built automatically. In holistic word recognition approaches
for historical documents are presented based on scalar and profile-based features and on
matching word contours respectively

You might also like