0% found this document useful (0 votes)
2 views

documentation

The document outlines a comprehensive framework for transaction analysis, emphasizing keyword matching, merchant codes, and synthetic data generation for machine learning. It also covers secure API integration for authentication and authorization, HTTP request structures, and error handling in web applications. Finally, it discusses logging, monitoring, and data preprocessing techniques essential for maintaining secure and efficient systems.

Uploaded by

ashamercy220602
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

documentation

The document outlines a comprehensive framework for transaction analysis, emphasizing keyword matching, merchant codes, and synthetic data generation for machine learning. It also covers secure API integration for authentication and authorization, HTTP request structures, and error handling in web applications. Finally, it discusses logging, monitoring, and data preprocessing techniques essential for maintaining secure and efficient systems.

Uploaded by

ashamercy220602
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 61

Documentation: DAY 1(12/12/2024)

1. Keyword Matching and Merchant Code


 Keyword Matching: Transactions can be classified by matching keywords in the
transaction description. For example, terms like "grocery", "restaurant", or "fuel" can be
used to categorize expenses.
 Merchant Code: Merchant codes provide precise categorization by identifying the type
of business. Additionally, merchant codes can locate the transaction's location, offering
geographic insights.

2. Fetching Bank Details Through API


 API Integration:
o Use secure bank APIs that provide transaction data.
 Advantages of Using APIs:
o Real-time data retrieval.
o Direct integration with banking systems ensures accuracy and reliability.
o APIs allow selective access to data, enhancing security and privacy.
 Secure Transactions:
o During login, provide a checkbox option for "Secure Transactions."
o When enabled, this feature anonymizes sensitive details, such as names and
personal identifiers, from transaction data.

3. Why Not Using Large Language Models (LLMs)


 Limitations of LLMs:
o LLMs cannot fetch real-time transaction data as they are not connected to live
databases.
o Dependence on pre-existing training data makes them less effective for specific,
dynamic financial insights.

3. Synthetic Data Generation


 Definition: Synthetic data involves creating demo data that mimics real transaction
records while hiding sensitive information. It is useful for training and testing machine
learning models.
 Sources for Synthetic Data:
o Public Datasets: Utilize publicly available datasets from platforms like Kaggle or
Google Dataset Search.
o Third-Party Applications:
 Played: Free application that provides secure transaction data.
 Yardly: Subscription-based application with advanced data
anonymization features.
o Custom Dataset Generation:
 Use spreadsheet software to manually create datasets.
 Ensure the data looks realistic by including common transaction attributes
(e.g., amount, merchant, date).

4. Model Implementation
 Tools and Environment:
o Use Visual Studio (VS) and Anaconda Navigator for development and testing.
 Training the Model:
o Train machine learning models on categorized transaction data.
o Incorporate features like keywords, merchant codes, and spending patterns.
 Analytics and Reporting:
o Analyze spending behavior and generate reports.
o Include metrics like total spending, category-wise breakdown, and transaction
frequency.

5. Pattern Analysis Without Merchant Code


 Keyword-Based Classification:
o If merchant codes are unavailable, use keyword matching to classify transactions.
 Setting Limits:
o Establish thresholds (e.g., amount ranges) to group transactions into categories.
 Repetitive Patterns:
o Identify recurring transactions based on similar descriptions or amounts. Use
these patterns for classification.
6. Report Generation
 Spending Summary:
o Calculate the total amount spent within a given period.
o Display spending trends over time.
 Category Analysis:
o Provide insights into the proportion of spending across various categories.
 Visual Representation:
o Use charts and graphs for clear data visualization.

Conclusion
This framework ensures efficient transaction analysis by leveraging merchant codes, keyword
matching, and synthetic data. The use of machine learning for analytics and secure handling of
sensitive data fosters both functionality and compliance with privacy standards.

Documentation: DAY 2(13/12/2024)

Concepts that Keep Systems Secure: Authentication and Authorization


Authentication and authorization are fundamental components of a secure system, ensuring that
only legitimate users gain access to resources while maintaining controlled access levels. They
act as a bridge between users and systems, verifying and granting permissions accordingly.

Authentication
Authentication is the process of verifying the identity of a user or entity. It ensures that only
authorized individuals can access the system.
Role of Authentication
 Acts as a gateway to confirm the user’s identity.
 Allows entry to the system after verification.
Methods of Authentication
1. Tokens
o Tokens are special digital keys that enable users to stay logged in during a
session.
o These temporary credentials eliminate the need to repeatedly enter passwords for
short periods.
2. Biometric Authentication
o Utilizes unique biological characteristics such as fingerprints, facial recognition,
or retinal scans to authenticate users.

Authorization
Authorization determines the specific resources or actions a verified user is allowed to access. It
focuses on defining permissions for authenticated users.
Role of Authorization
 Grants access to specific features or data based on user roles or privileges.
 Ensures controlled usage of system resources.
Techniques for Authorization
1. OAuth (Open Authorization)
o A widely-used protocol that enables secure delegated access to resources without
sharing credentials.
o Example: Allowing an app to access your social media account without exposing
your password.
2. Multifactor Authorization (MFA)
o Combines two or more authentication methods (e.g., password + OTP or
password + biometric) to enhance security.

Need for APIs


APIs (Application Programming Interfaces) play a critical role in implementing authentication
and authorization systems. They act as an intermediary between the user interface and the back-
end system, ensuring smooth verification and access control.
Key Roles of APIs
 Facilitate communication between the user interface and the back-end system.
 Enable seamless integration of authentication and authorization mechanisms.
 Allow developers to implement secure and standardized security practices.
By integrating authentication and authorization through robust APIs, systems can achieve
enhanced security and a streamlined user experience.

Documentation: DAY 3(16/12/2024)

HTTP Request and API Documentation


HTTP (Hypertext Transfer Protocol) is the foundational protocol used to communicate with
servers. It facilitates data transfer over the web.

HTTP Structure
HTTP is divided into two main components:
1. Request - Initiated by the client to request resources from the server.
2. Response - Sent by the server as a reply to the client's request.

HTTP Methods
HTTP methods define the actions to be performed on resources. Common methods include:
1. GET: Retrieves data from the server.
2. POST: Sends data to the server to create a new resource.
3. PUT: Modifies the entire resource.
4. DELETE: Deletes specific or complete data.

Additional HTTP Methods:


 HEAD: Retrieves only the headers without the body or footer. For example, it might
contain a user ID.
 PATCH: Updates specific or partial information.

Components of an HTTP Request


HTTP requests consist of four key components:
1. URL:
o Uniform Resource Locator (URL) denotes the address of the resource.
2. Headers:
o Contain metadata about the request and provide additional information about the
data.
o Categories include:
 Content Type: Specifies the format for retrieving data, e.g., JSON or
PDF.
 Authentication: Uses API keys or tokens to authenticate requests.
 Aspect: Verifies the legitimacy of the request.
3. Query Parameters:
o Provide additional information in the request.
4. Body:
o Contains data sent to the server, often in formats like JSON. Used with POST or
PUT methods.

HTTP Response
HTTP responses are categorized into three main components:
1. Status Line:
o Includes the HTTP version and status code.
2. Header:
o Provides metadata about the response.
3. Body:
o Contains the requested resource or error message.

Common HTTP Status Codes:


 1XX (Informational):
o Server has received the request and is processing it.
 200 (Success):
o The request was successful.
 301 (Moved Permanently):
o The resource has been moved to a different URL.
 400 (Bad Request):
o The server cannot process the request due to a client error.
 401 (Unauthorized):
o Authentication is required.
 404 (Not Found):
o The requested resource does not exist.
 500 (Internal Server Error):
o An error occurred on the server.
 503 (Service Unavailable):
o The server is currently unavailable.

File Formats and API


APIs often use specific file formats to exchange data. Common formats include:
JSON (JavaScript Object Notation):
 Lightweight and easy to read and write.
 Machine-readable, uses key-value pairs.
Advantages:
 Simple format, less verbose.
 Efficient data transfer.
 Built-in support.
 Easy for debugging.
Disadvantages:
 Does not support comments, making complex data harder to identify.
 Limited data types supported.
XML (Extensible Markup Language):
 Encodes data in a flexible, nested structure.
Advantages:
 Extensible and allows user-defined tags.
 Schema support to validate data.
 Ensures data integrity.
Disadvantages:
 More verbose.
 Complex structure.
Other Formats:
 Plain Text: Used for raw data without structure.
 HTML: Markup language for structuring web pages.
 CSV: Used for tabular data with rows and columns.

Documentation: DAY 4(17/12/2024)

Code:

from flask import Flask, jsonify


app=Flask(__name__)
@app.route("/",methods=["GET"])
def home():
return jsonify(message="Welcome to the home page")
@app.route("/hello",methods=["GET"])
def hello_world():
return jsonify(message="Hello,world")
if __name__ == "__main__":
app.run(debug=True)

1. Importing Necessary Modules


 Flask: This is the class from the Flask module used to create the web application.
 jsonify: A helper function that converts Python objects into JSON format.

2. Creating the Flask Application


app = Flask(__name__)
 Flask(__name__): Creates an instance of the Flask application. The __name__ argument
indicates that the current module is the main module that will run the application.

3. Defining Routes
@app.route("/", methods=["GET"])
In Flask, routes are defined using the @app.route() , which links a URL to a function. When the
URL is accessed, the function is executed.
HTTP Methods: Routes can handle methods like GET, POST,PUT,DELETE.

4. Running the Application


if __name__ == "__main__":
app.run(debug=True)
 This block of code ensures that the Flask app runs only when the script is executed
directly, not when it is imported as a module.
 app.run(debug=True): Starts the Flask development server with debugging enabled. This
allows real-time error tracking and auto-reloading.
Documentation: DAY 5(18/12/2024)

ADDING SECURE KEY

CODE

 Secure Key: The application uses a secret API key (API_KEY) stored in the server-side
code.
 Request Handling: A decorator (require_api_key) checks the request's x-api-key header to
ensure the correct API key is provided before allowing access to the secure route.
 Wrapper Layer: The decorator acts as an extra layer of security, ensuring that all routes
requiring API key validation are protected consistently.
 Headers: The client must include the API key in the x-api-key header to successfully access
protected resources.
TASK:WEATHER DATA EXTRACTION USING API

CODE

OUTPUT:
Enter the city name:LONDON
{
"base": "stations",
"clouds": {
"all": 40
},
"cod": 200,
"coord": {
"lat": 51.5085,
"lon": -0.1257
},
"dt": 1734605122,
"id": 2643743,
"main": {
"feels_like": 275.25,
"grnd_level": 1004,
"humidity": 78,
"pressure": 1008,
"sea_level": 1008,
"temp": 279.57,
"temp_max": 280.01,
"temp_min": 278.99
},
"name": "London",
"sys": {
"country": "GB",
"id": 2075535,
"sunrise": 1734595375,
"sunset": 1734623556,
"type": 2
},
"timezone": 0,
"visibility": 10000,
"weather": [
{
"description": "scattered clouds",
"icon": "03d",
"id": 802,
"main": "Clouds"
}
],
"wind": {
"deg": 280,
"speed": 7.72
}
}

Documentation: DAY 6(19/12/2024)

Rate Limiting and Throttling

 Rate Limiting: Specifies the number of requests allowed per unit of time (e.g., 5 requests per
minute). If the limit is exceeded, the server will respond with a 429 Too Many Requests status
code.
 Throttling: Often used interchangeably with rate limiting. It limits the rate at which a client
can interact with an API over a period of time.
CODE

Error Handling
Error handlers, which allow you to customize responses to different types of errors (e.g., 404,
500).

CODE
TASK: RATE LIMITING AND ERROR HANDLING

CODE:RATE LIMITING

from flask import Flask, jsonify


from flask_limiter import Limiter
from flask_limiter.util import get_remote_address

app = Flask(__name__)
limiter = Limiter(get_remote_address, app=app, default_limits=["5 per minute"])

@app.route('/login')
@limiter.limit("2 per minute")
def login():
return “ Too many requests will block you temporarily”

if __name__ == '__main__':
app.run(debug=False)

CODE:ERROR HANDLING

from flask import Flask, jsonify

app = Flask(__name__)

@app.errorhandler(404)
def not_found(error):
return jsonify({'error': 'Resource not found'}), 404

@app.errorhandler(500)
def server_error(error):
return jsonify({'error': 'Something went wrong'}), 500

@app.errorhandler(400)
def bad_request(error):
return jsonify({'error': 'Bad Request - Please check your input'}), 400

@app.errorhandler(403)
def forbidden(error):
return jsonify({'error': 'Forbidden - You do not have permission to access this resource'}), 403

if __name__ == '__main__':
app.run(debug=False)

Documentation: DAY 7(20/12/2024)

Logging and Monitoring API Usage


 Logging: Captures specific events (e.g., request access, errors) and saves them in a log file
for monitoring.
 Monitoring: Helps developers and administrators keep track of application behavior and
diagnose issues.

CODE
API Versioning

Allows you to evolve your API while maintaining support for existing clients, facilitating the
introduction of new features without breaking the old ones.

CODE

TASK: To Create a Google Login


Python code
OUTPUT:
Documentation: DAY 8(23/12/2024)
Token and Data Preprocessing Documentation
Token
Tokens are used in various systems to verify and authenticate users, maintain sessions, and
manage secure interactions between a client and a server.
1. Authentication Token
Purpose: To verify the identity of a user.
How it works:
 After a user logs in, the server issues an authentication token.
 This token is sent with subsequent requests to confirm that the request is from an
authenticated user.
2. Session Token
Purpose: To track a user's session.
How it works:
 A session token is generated when a user logs in and is stored either in a cookie or in
memory on the client side.
 The server uses this token to identify the user across different requests and maintain the
state (session) between requests.
Example:
 After logging in, the user is assigned a session ID. This session ID is sent with every
request as a cookie to identify the user on the server.
3. Refresh Token
Purpose: To obtain a new authentication token when the original one expires.
How it works:
 Refresh tokens are long-lived tokens that allow clients to get new short-lived access
tokens without requiring the user to log in again.
Example:
 When an access token expires, the client sends the refresh token to the server to obtain a
new access token.

Data Preprocessing
Data pre-processing is a critical step in preparing a dataset for analysis or machine learning. The
quality of the data directly influences the performance of machine learning models or statistical
analysis. Pre-processing ensures that the data is clean, consistent, and ready for modeling.
1. Handling Missing Data
Missing data can occur for various reasons and can affect the quality of your model. Handling
missing values is a key aspect of data pre-processing.
Steps:
1. Identify Missing Data:
o Check for missing values using descriptive statistics or visualization techniques.
2. Imputation:
o Fill missing values with specific values such as:
 Mean: Suitable for continuous data.
 Median: Useful when the data has outliers.
 Mode: For categorical data.
o Use advanced techniques like interpolation or machine learning models for
imputation.
3. Remove Rows or Columns:
o If the amount of missing data is substantial, consider removing rows or columns
with missing values.

Documentation: DAY 9(24/12/2024)

1. What is Data?
Data refers to raw facts, figures, or information that can be processed, analyzed, and interpreted
to derive meaning, insights, or make decisions. It is a collection of values, observations, or
measurements that can represent various things, such as:
 Numbers
 Text
 Images
 Sounds
 Sensor readings
2. Types of Data
a. Structured Data
Structured data is highly organized and typically stored in rows and columns, such as in
databases. Examples include:
 A table of employee names, IDs, and salaries.
 Sales records stored in a relational database.
b. Unstructured Data
Unstructured data lacks a predefined format and is more complex to analyze. Examples include:
 Text (e.g., social media posts, customer reviews)
 Images
 Audio files
 Videos
c. Semi-Structured Data
Semi-structured data is not strictly organized into rows and columns but has some structure or
markers for organization. Examples include:
 JSON files
 XML files
3. What is Data Transmission?
Data transmission refers to the process of converting one type of data into another. This
conversion can be essential to make the data usable for specific applications or systems.
Examples include:
 Converting semi-structured data into structured data.
 Transforming unstructured data into structured data for analysis.
Example: Parsing JSON files (semi-structured) into relational tables (structured) for querying
and analysis.

Documentation: DAY 10(30/12/2024)


Handling Missing Data in Machine Learning
1. Dealing with Missing Data Based on Percentage
 5% Missing Data:
o When the missing data constitutes only a small fraction (e.g., 5%) of the dataset, it
is generally safe to remove the missing data without significantly impacting the
overall analysis or model performance.
 30% Missing Data:
o When the missing data is substantial (e.g., 30%), removing it may result in a
significant loss of valuable information.
o In such cases, the missing data is filled (imputed) to retain as much data as
possible for better model accuracy.

2. Imputation Techniques
Imputation involves filling missing data by estimating plausible values. For numerical data,
common methods include:
 Mean Imputation:
o Fill missing values with the average of available data.
o Formula: Mean = (x1+x2)/2(x_1 + x_2) / 2
 Median Imputation:
o Arrange data in sequential order and choose the middle value as the replacement.
 Mode Imputation:
o Replace missing values with the most frequently occurring value.
 Forward Fill:
o Use the last known value to fill missing data.
 Backward Fill:
o Use the next known value to fill missing data.
Note: These methods may not provide highly accurate results but can achieve 80-90% accuracy,
which is often sufficient for many applications.

3. Using Prediction Models


Prediction models leverage machine learning to estimate missing values by analyzing the
relationships and patterns in the data. For example:
Age Height Math Score

15 160 cm 85

16 170 cm Missing

14 165 cm 80
 ML models can predict missing data (e.g., Math Score for age 16) by identifying hidden
patterns and relations between attributes (e.g., Age, Height).

4. Interpolation
Interpolation involves estimating missing values based on trends observed in the data. For
example:
Day Temperature

Mon 20°C

Tue Missing

Wed 22°C
 Using interpolation, the missing value for Tuesday can be estimated as 21°C.

5. Assigning an Unknown Category


 For categorical data, missing values can be replaced with an “Unknown” category.
 This approach ensures that the missing values are accounted for without introducing
inaccuracies.
When to Use Specific Methods
1. Remove Missing Data:
o When the percentage of missing data is very low (e.g., 5%).
2. Impute Missing Data:
o When dealing with numerical data within a reasonable range.
3. Use Prediction Models:
o When there are three or more attributes related to each other in ways that are not
easily discernible by humans but can be analyzed by machines.
4. Use Interpolation:
o When there is a discernible trend in the data.
5. Assign Unknown Category:
o When it is not meaningful to impute or remove the data, document the missing
data as "Unknown."
By carefully considering the nature of the data and the context, the appropriate method for
handling missing data can be selected to ensure optimal machine learning performance.

TASK: - Implementing the data filling operation.

Removing Missing data

Output:
Imputation

Output:
Prediction Model
Output:

Interpolation
Output:

Assigning unknown to the missing value


Output:

Documentation: DAY 11(31/12/2024)


Optical Character Recognition (OCR)
Introduction
Optical Character Recognition (OCR) is a technology that enables the conversion of different
types of documents, such as scanned paper documents, PDFs, or images captured by a camera,
into editable and searchable data. OCR bridges the gap between human-readable documents and
machine-readable data by recognizing and converting characters into digital text.

Existence of OCR
The origins of OCR date back to the early 20th century. The first OCR machines were developed
for the visually impaired, allowing printed materials to be read aloud. The technology evolved
significantly with the advent of computers in the mid-20th century.
Milestones in OCR Development:
1. Early 1920s: Manual activities involved passing light through text to identify characters
visually.
2. 1929: Gustav Tauschek developed the first OCR machine.
3. 1950s: The development of digital computers advanced OCR technology, allowing
machines to read printed text. These systems recognized letters, numbers, and words,
automatically reading and typing them.
4. 1970s: IBM developed OCR systems that compared scanned image pixels to identify
characters, enabling recognition of symbols and fonts. The converted text could be stored
in machine memory or even converted into audio.
5. 1990s: OCR technology expanded into various industries with the rise of digitization.
6. 2000s and beyond: With the advent of machine learning and deep learning, OCR
achieved higher accuracy and became integral to modern applications.

Components of OCR
OCR systems consist of several key components that work together to achieve text recognition:
1. Image Acquisition:
 Captures images of the text using scanners, cameras, or other devices.
2. Preprocessing:
 Improves the quality of the input image for better recognition. Common preprocessing
steps include:
o Noise reduction
o Binarization (converting to black and white)
o Deskewing (correcting image alignment)
o Normalization
3. Segmentation:
 Divides the image into smaller regions, such as characters, words, or lines.
4. Feature Extraction:
 Extracts essential features from the segmented characters, such as edges, strokes, and
contours.
5. Recognition:
 Applies algorithms, often powered by machine learning, to identify and convert the
characters into digital text.
6. Post-processing:
 Refines the recognized text to improve accuracy. This includes:
o Spell checking
o Contextual analysis

How OCR is Used in Real-Time Applications


OCR has found extensive use in real-world applications across various industries:
1. Document Digitization:
 Use Case: Libraries and archives digitize books, manuscripts, and records to preserve and
make them searchable.
2. Banking and Finance:
 Use Case: Automatic processing of checks and forms, extracting information from
invoices and receipts.
3. Healthcare:
 Use Case: Digitizing patient records, prescriptions, and handwritten medical notes for
easier storage and access.
4. Retail and E-commerce:
 Use Case: Extracting product details from images, barcodes, or QR codes for inventory
management.
5. Traffic Management:
 Use Case: Automatic Number Plate Recognition (ANPR) systems use OCR to identify
vehicle license plates for toll collection and law enforcement.
6. Translation Applications:
 Use Case: OCR identifies text in images, translating it into different languages for travel
or communication.
7. Accessibility Tools:
 Use Case: Assists visually impaired individuals by converting printed text into audio or
braille.
8. Identity Verification:
 Use Case: Extracting text from IDs, passports, and other documents for online KYC
(Know Your Customer) processes.

Conclusion
OCR is a transformative technology that bridges the gap between analog and digital data. With
its ability to process large volumes of text efficiently, OCR has become indispensable in many
industries, driving automation, accessibility, and innovation.

Documentation: DAY 12(2/1/2025)

OCR for Banking: Enhancing Accuracy in Scanned Document Processing


Introduction: Optical Character Recognition (OCR) technology has become a pivotal tool for
regional banks, particularly in processing scanned copies of documents. OCR enables banks to
extract text from scanned PDFs, images, and other unreadable formats, making it possible to
digitize and analyze critical data efficiently. A notable use case involves capturing specific
context from a group of contexts, leveraging OCR to enhance operational workflows.
How OCR Works:
1. Preprocessing:
o This step involves cleaning the image to improve OCR accuracy. Common
preprocessing techniques address blind spots, color palette differences, and blur.
These adjustments ensure the text is well-prepared for segmentation and
recognition.
2. Text Segmentation:
o OCR identifies text in three primary ways:
 Entire rows of text: Recognizing complete lines of text.
 Individual text with spaces: Separating text blocks based on spacing.
 Individual characters: Breaking down text into individual characters for
detailed recognition.
3. Character Recognition:
o OCR systems analyze each character within the segmented text and convert the
image-based text into a machine-readable format.
4. Postprocessing:
o Errors in OCR recognition may arise, such as confusing similar characters (e.g.,
the number “0” and the letter “O”, or Roman numerals). Postprocessing helps
rectify these errors by validating and refining the output.
Preprocessing Techniques for Better OCR Results:
1. Grayscale Conversion:
o Converts images to black and white, simplifying the detection of text and
improving accuracy.
2. Binarization:
o Sharpens text by converting it into fully black regions, removing background
noise, and enhancing visibility.
3. Noise Removal:
o Eliminates unnecessary spots or data that do not form part of the text.
4. Deskewing:
o Corrects tilted text by straightening the image, ensuring accurate recognition.

5. Thresholding:
o Enhances crucial parts of the image, such as text, by removing noise and
emphasizing relevant data. This differentiation allows OCR to isolate text more
effectively.
Applications in Banking:
 Extracting data from scanned financial statements, invoices, and handwritten forms.
 Converting large volumes of legacy documents into digital formats.
 Improving document management systems by enabling quick search and retrieval.
 Enhancing customer service by automating data extraction processes.

Documentation: DAY 13(3/1/2025)

What is an OCR Engine?


An OCR (Optical Character Recognition) engine is a software tool or library designed to extract
text from images, scanned documents, or other visual inputs. It works by analyzing the visual
data, identifying characters, and converting them into machine-readable text. OCR engines are
integral to applications that digitize text-based content for storage, editing, searching, or further
processing.
Main Tools in OCR:
 Tesseract
 Google Cloud OCR
 Microsoft OCR

Why Use Tesseract for OCR?


Tesseract is a widely-used, free, and open-source OCR engine initially developed by HP and
later improved by Google. It offers multiple language support and high accuracy, especially with
clean and structured text. Tesseract is particularly beneficial in applications such as:
 Document digitization
 Automated data entry
 Preprocessing for machine learning tasks
Key Features:
 Open-source and free for developers
 Multi-language support
 High accuracy for structured text

Versions of Tesseract
Tesseract 1.x (1980s)
 Initial development by HP
 Only capable of extracting one column of text
 Limited capabilities and basic OCR functionality
 Released as open-source in 2005
Tesseract 2.x (2005)
 Transitioned to open-source maintenance by HP
 Improved to extract multi-column text
 Enhanced layout analysis and processing pipelines
 Basic OCR functionality, primarily for English

Tesseract 3.x (2010)


 Maintained by Google
 Added support for multiple languages
 Introduced improved OCR engine accuracy
 Training capabilities for custom languages and fonts
Tesseract 4.x (2018)
 Major update with LSTM-based (Long Short-Term Memory) neural network integration
 Improved recognition accuracy for complex and unstructured text
 Enhanced performance using modern machine learning techniques
 Support for PDFs and multi-page documents
 Backward compatibility with the legacy OCR engine
Tesseract 5.x (2021)
 Continued improvements in LSTM-based OCR accuracy
 Optimizations for faster processing
 Broader support for additional languages and scripts
 Bug fixes and new features, including improved PDF generation

How to Check for Changes in Tesseract Code


Since Tesseract is open-source, its code can be modified by anyone. To identify changes made to
the code, compare the modified version with the original or official version using a version
control system like Git.
Steps to Check Changes:
1. Check Commit History:
o Run git log to view the commit history.
o Look for unusual commits, authors, or messages that indicate modifications.
2. Trace Specific Changes:
o Use git blame to trace changes to specific lines of code and identify who made
them.
By employing these methods, you can track and verify any modifications made to the Tesseract
source code.

TASK:
To Extract text from image using tesseract
INPUT

CODE

OUTPUT:
Documentation: DAY 14(6/1/2025)

Image Preprocessing Techniques for Accurate OCR Output


To achieve accurate Optical Character Recognition (OCR) results, preprocessing techniques are
crucial.

1. Grayscale Conversion
 Converts the RGB image into a grayscale image by reducing the three color channels
(Red, Green, and Blue) into a single intensity channel.
 This simplifies the image and reduces computational complexity.
Why:
 OCR algorithms focus on text structure and patterns, not colors.
 Grayscale conversion reduces image intensity to a single channel, which improves clarity
for OCR processing.
Process:
 Break the RGB image into pixels.
 Calculate the grayscale equivalent using a weighted sum of the R, G, and B values (e.g.,
0.2989*R + 0.5870*G + 0.1140*B).
 Convert the RGB pixels to grayscale.

2. Thresholding
Description:
Thresholding converts the grayscale image into a binary image (black and white). This step
focuses on distinguishing the text from the background for improved OCR detection.
Why:
 Binary images are simpler and more effective for OCR engines.
 Eliminates noise and non-essential details, focusing only on text regions.
Subtypes:
1. Global Thresholding:
o Applies a single threshold value across the entire image.
o Pixels with intensity above a chosen value (e.g., 127) are converted to white, and
those below it are converted to black.
Example: If a pixel intensity is greater than 127, it becomes white (1); otherwise, it becomes
black (0).
2. Adaptive Thresholding:
o Divides the image into smaller blocks and applies thresholding locally within
each block.
o This method handles varying lighting conditions and intensity levels better than
global thresholding.
Advantages:
o Reduces noise specific to regions.
o Enhances text detection in images with uneven illumination.

3. Otsu's Binarization
Description:
Otsu's Binarization is an advanced thresholding technique that automatically determines the
optimal threshold value based on the image’s intensity histogram. It aligns the image into a
single binary format by minimizing intra-class intensity variance.
Why:
 Eliminates the need for manually setting a threshold value.
 Ideal for images with varying intensity distributions.
Process:
 Calculate the histogram of grayscale intensities.
 Use the algorithm to determine a threshold that separates the foreground (text) from the
background.
 Apply the threshold to convert the image into binary format.
TASK:
Preprocess the image and extract the text
INPUT:

CODE:(GreyScale Conversion)
OUTPUT:

CODE:(Global Thresholding)
OUTPUT:
CODE:( Otsu's Binarization)

Output:
Documentation: DAY 15(7/1/2025)

OpenCV and Tesseract OCR


OpenCV (Open Source Computer Vision Library):
OpenCV is a powerful library for image and video processing. While it significantly enhances
the quality of images through advanced processing techniques, it does not have the capability to
detect text.
Tesseract OCR:
Tesseract specializes in extracting text from images. However, it can only process image formats
like JPG and PNG. It does not handle PDF or DOC files directly. To work with these formats,
they must first be converted to images using libraries like PyPDF2 or PDFPlumber.

Lifecycle of OCR
1. Input Preparation:
o Take an image directly or convert a PDF into images using libraries like
pdf2image.
2. Preprocessing (Enhance Image Quality):
o Improve accuracy by enhancing image quality.
o Steps include:
 Noise Removal: Remove unwanted noise from the image.
 Blurry Spot Removal: Enhance blurred regions using denoising
techniques.
3. H Deduction (Boundary Detection):
o Detect boundaries and remove unwanted areas of the image.
o Image Resizing: Crop or resize the image if it is too large or too small.
4. Segmentation:
o OCR automatically detects and segments words and characters in the image.
OCR Tools Comparison
Tesseract:
 Performs well with structured data (e.g., bank data).
 Supports multiple languages.
 Simple to use.
 Disadvantage:
o Struggles with low-quality images.
EasyOCR:
 Excels at detecting handwritten text.
 Performs well with Asian languages.

Image Detection Workflow


1. OCR Detection Using Tesseract:
o Identify text from images.
2. Bounding Box:
o Locate specific regions in the image where text is detected (provides coordinates).
Libraries to Install:
 OpenCV
 Python-Tesseract (Pytesseract)
Postprocessing Steps
1. Text Accuracy Improvement:
o Remove special characters using reject filters.
o Use dictionary-based corrections to fix errors.
2. Regular Expressions:
o Extract specific information as required.
TASK: Preprocessing Techniques
Denoising(input)

CODE:

Output:
Boundboxing
Output:
Documentation: DAY 16(8/1/2025)

Text Extraction Using Pytesseract and OpenCV

Preprocessing Techniques
Binarization
 Converts an image into a binary format (black and white).
 Useful when there is an uneven distribution of light or shadows in the image.
Thresholding
 Separates pixels into foreground and background based on intensity values.
Greyscaling
 Converts a colored image to grayscale to simplify the data for processing.
Denoising
 Used to handle blurry images by removing noise, enhancing text clarity.

Bounding Box
 Detects specific regions where text is present.
 Encases the text in an invisible box for easier detection and manipulation.
 Pytesseract automatically handles coordinates for resizing, cropping, and bounding
boxes.

Contours
 Available in the OpenCV library.
 Automatically detects boundaries of text in the image, improving accuracy.
 For further precision, Photoshop can be used to manually specify coordinates.

Postprocessing
 Corrects errors in text extraction by refining the output.
Example Corrections
 "o" -> "0" (e.g., "ra1n" -> "rain")
 "0" -> "o" (e.g., "1o00" -> "1000")
ReGEX and Custom Error Handling
 ReGEX: Refers to predefined patterns or characters that are likely errors in the extracted
text. These are filtered out during postprocessing.
o Example: Incorrect substitutions like "ra1n" instead of "rain" or "1o00" instead of
"1000" can be identified and corrected.
 Custom Error Rules: Implement rules to handle domain-specific corrections. For
example:
o Replacing numbers that resemble letters (e.g., "1" with "I" or "0" with "O") based
on the context of the extracted text.
o Defining reject lists or dictionaries to validate extracted words and replace them
with the correct terms.

Improving Text Accuracy


OEM (OCR Engine Mode)
 Specifies the type of engine Pytesseract uses to extract text.
o 0: Legacy engine for simple text.
o 1: Neural network engine for complex text and varied fonts.
o 3: Combines legacy and neural network engines, choosing the best output.
PSM (Page Segmentation Mode)
 Helps Pytesseract understand the layout of text in an image.
o 3: Fully automatic page segmentation.
o 6: Assumes one uniform block of text.
o 11: Sparse text detection, useful for small scattered text regions.

Workflow
1. Load Image
o Input the image into the program.
2. Preprocess the Image
o Apply greyscaling, binarization, and thresholding techniques.
3. Text Extraction
o Use Pytesseract to extract text from the processed image.
4. Postprocessing
o Correct errors and refine text using:
 ReGEX lists.
 Custom rules (e.g., replacing "o" with "0").

TASK: Perform Rejects


CODE:
Output:

Documentation: DAY 17(9/1/2025)

OCR Multi-Language Support Documentation


OCR (Optical Character Recognition) is widely used to extract text from various types of
documents and images, including scanned documents, PDFs with images, handwritten notes, and
screenshots. To support text extraction in multiple languages, external language-specific models
must be installed and integrated into the OCR system.
Use Cases
1. Scanned Documents: Extracting text from scanned images or PDF files.
2. Handwritten Notes: Recognizing and converting handwritten content into digital text.
3. Screenshots: Converting text within image screenshots into editable formats.
4. Multilingual Support: Extracting text in various languages based on installed language
models.

File Formats and Accuracy


File Format Description Quality/Compression

JPG Low quality, lossy compression Low

PNG High quality, lossless compression High

TIFF Clear image format with enhanced quality High

GIF For stickers or short videos; low quality Low

PDF Supports text, images, and scanned docs Low (for scanned documents)

Handling Low-Quality Images


When low-quality images are encountered, preprocessing techniques can be applied to enhance
clarity and improve OCR accuracy. Common preprocessing methods include:
 Noise reduction
 Contrast adjustment
 Binarization (converting to black and white)
 Image sharpening
 Rescaling

Applications in Banking
OCR technology is extensively used in the banking sector to:
1. Digitize Documents: Convert physical documents into digital format.
2. KYC (Know Your Customer): Extract customer details from ID cards and other
documents.
3. Automated Form Filling: Automatically populate forms using extracted text.
Multilingual Document Extraction
Multilingual document extraction refers to the process of identifying and extracting text from
documents written in multiple languages. This is achieved by integrating OCR systems with
models trained for different languages. The system first detects the language(s) present in the
document and then applies the appropriate OCR model to extract the text. This process ensures:
 High accuracy in text extraction across various languages.
 Proper handling of language-specific characters and scripts.
 Seamless integration with multilingual workflows, such as translation and content
management.

TASK: Multiple language text extraction


Input:

CODE:
OUTPUT

TASK: Full code of automation(input)

CODE:
OUTPUT:
TASK: Web scarapping
CODE:

Data stored in csv file:

You might also like