documentation
documentation
4. Model Implementation
Tools and Environment:
o Use Visual Studio (VS) and Anaconda Navigator for development and testing.
Training the Model:
o Train machine learning models on categorized transaction data.
o Incorporate features like keywords, merchant codes, and spending patterns.
Analytics and Reporting:
o Analyze spending behavior and generate reports.
o Include metrics like total spending, category-wise breakdown, and transaction
frequency.
Conclusion
This framework ensures efficient transaction analysis by leveraging merchant codes, keyword
matching, and synthetic data. The use of machine learning for analytics and secure handling of
sensitive data fosters both functionality and compliance with privacy standards.
Authentication
Authentication is the process of verifying the identity of a user or entity. It ensures that only
authorized individuals can access the system.
Role of Authentication
Acts as a gateway to confirm the user’s identity.
Allows entry to the system after verification.
Methods of Authentication
1. Tokens
o Tokens are special digital keys that enable users to stay logged in during a
session.
o These temporary credentials eliminate the need to repeatedly enter passwords for
short periods.
2. Biometric Authentication
o Utilizes unique biological characteristics such as fingerprints, facial recognition,
or retinal scans to authenticate users.
Authorization
Authorization determines the specific resources or actions a verified user is allowed to access. It
focuses on defining permissions for authenticated users.
Role of Authorization
Grants access to specific features or data based on user roles or privileges.
Ensures controlled usage of system resources.
Techniques for Authorization
1. OAuth (Open Authorization)
o A widely-used protocol that enables secure delegated access to resources without
sharing credentials.
o Example: Allowing an app to access your social media account without exposing
your password.
2. Multifactor Authorization (MFA)
o Combines two or more authentication methods (e.g., password + OTP or
password + biometric) to enhance security.
HTTP Structure
HTTP is divided into two main components:
1. Request - Initiated by the client to request resources from the server.
2. Response - Sent by the server as a reply to the client's request.
HTTP Methods
HTTP methods define the actions to be performed on resources. Common methods include:
1. GET: Retrieves data from the server.
2. POST: Sends data to the server to create a new resource.
3. PUT: Modifies the entire resource.
4. DELETE: Deletes specific or complete data.
HTTP Response
HTTP responses are categorized into three main components:
1. Status Line:
o Includes the HTTP version and status code.
2. Header:
o Provides metadata about the response.
3. Body:
o Contains the requested resource or error message.
Code:
3. Defining Routes
@app.route("/", methods=["GET"])
In Flask, routes are defined using the @app.route() , which links a URL to a function. When the
URL is accessed, the function is executed.
HTTP Methods: Routes can handle methods like GET, POST,PUT,DELETE.
CODE
Secure Key: The application uses a secret API key (API_KEY) stored in the server-side
code.
Request Handling: A decorator (require_api_key) checks the request's x-api-key header to
ensure the correct API key is provided before allowing access to the secure route.
Wrapper Layer: The decorator acts as an extra layer of security, ensuring that all routes
requiring API key validation are protected consistently.
Headers: The client must include the API key in the x-api-key header to successfully access
protected resources.
TASK:WEATHER DATA EXTRACTION USING API
CODE
OUTPUT:
Enter the city name:LONDON
{
"base": "stations",
"clouds": {
"all": 40
},
"cod": 200,
"coord": {
"lat": 51.5085,
"lon": -0.1257
},
"dt": 1734605122,
"id": 2643743,
"main": {
"feels_like": 275.25,
"grnd_level": 1004,
"humidity": 78,
"pressure": 1008,
"sea_level": 1008,
"temp": 279.57,
"temp_max": 280.01,
"temp_min": 278.99
},
"name": "London",
"sys": {
"country": "GB",
"id": 2075535,
"sunrise": 1734595375,
"sunset": 1734623556,
"type": 2
},
"timezone": 0,
"visibility": 10000,
"weather": [
{
"description": "scattered clouds",
"icon": "03d",
"id": 802,
"main": "Clouds"
}
],
"wind": {
"deg": 280,
"speed": 7.72
}
}
Rate Limiting: Specifies the number of requests allowed per unit of time (e.g., 5 requests per
minute). If the limit is exceeded, the server will respond with a 429 Too Many Requests status
code.
Throttling: Often used interchangeably with rate limiting. It limits the rate at which a client
can interact with an API over a period of time.
CODE
Error Handling
Error handlers, which allow you to customize responses to different types of errors (e.g., 404,
500).
CODE
TASK: RATE LIMITING AND ERROR HANDLING
CODE:RATE LIMITING
app = Flask(__name__)
limiter = Limiter(get_remote_address, app=app, default_limits=["5 per minute"])
@app.route('/login')
@limiter.limit("2 per minute")
def login():
return “ Too many requests will block you temporarily”
if __name__ == '__main__':
app.run(debug=False)
CODE:ERROR HANDLING
app = Flask(__name__)
@app.errorhandler(404)
def not_found(error):
return jsonify({'error': 'Resource not found'}), 404
@app.errorhandler(500)
def server_error(error):
return jsonify({'error': 'Something went wrong'}), 500
@app.errorhandler(400)
def bad_request(error):
return jsonify({'error': 'Bad Request - Please check your input'}), 400
@app.errorhandler(403)
def forbidden(error):
return jsonify({'error': 'Forbidden - You do not have permission to access this resource'}), 403
if __name__ == '__main__':
app.run(debug=False)
CODE
API Versioning
Allows you to evolve your API while maintaining support for existing clients, facilitating the
introduction of new features without breaking the old ones.
CODE
Data Preprocessing
Data pre-processing is a critical step in preparing a dataset for analysis or machine learning. The
quality of the data directly influences the performance of machine learning models or statistical
analysis. Pre-processing ensures that the data is clean, consistent, and ready for modeling.
1. Handling Missing Data
Missing data can occur for various reasons and can affect the quality of your model. Handling
missing values is a key aspect of data pre-processing.
Steps:
1. Identify Missing Data:
o Check for missing values using descriptive statistics or visualization techniques.
2. Imputation:
o Fill missing values with specific values such as:
Mean: Suitable for continuous data.
Median: Useful when the data has outliers.
Mode: For categorical data.
o Use advanced techniques like interpolation or machine learning models for
imputation.
3. Remove Rows or Columns:
o If the amount of missing data is substantial, consider removing rows or columns
with missing values.
1. What is Data?
Data refers to raw facts, figures, or information that can be processed, analyzed, and interpreted
to derive meaning, insights, or make decisions. It is a collection of values, observations, or
measurements that can represent various things, such as:
Numbers
Text
Images
Sounds
Sensor readings
2. Types of Data
a. Structured Data
Structured data is highly organized and typically stored in rows and columns, such as in
databases. Examples include:
A table of employee names, IDs, and salaries.
Sales records stored in a relational database.
b. Unstructured Data
Unstructured data lacks a predefined format and is more complex to analyze. Examples include:
Text (e.g., social media posts, customer reviews)
Images
Audio files
Videos
c. Semi-Structured Data
Semi-structured data is not strictly organized into rows and columns but has some structure or
markers for organization. Examples include:
JSON files
XML files
3. What is Data Transmission?
Data transmission refers to the process of converting one type of data into another. This
conversion can be essential to make the data usable for specific applications or systems.
Examples include:
Converting semi-structured data into structured data.
Transforming unstructured data into structured data for analysis.
Example: Parsing JSON files (semi-structured) into relational tables (structured) for querying
and analysis.
2. Imputation Techniques
Imputation involves filling missing data by estimating plausible values. For numerical data,
common methods include:
Mean Imputation:
o Fill missing values with the average of available data.
o Formula: Mean = (x1+x2)/2(x_1 + x_2) / 2
Median Imputation:
o Arrange data in sequential order and choose the middle value as the replacement.
Mode Imputation:
o Replace missing values with the most frequently occurring value.
Forward Fill:
o Use the last known value to fill missing data.
Backward Fill:
o Use the next known value to fill missing data.
Note: These methods may not provide highly accurate results but can achieve 80-90% accuracy,
which is often sufficient for many applications.
15 160 cm 85
16 170 cm Missing
14 165 cm 80
ML models can predict missing data (e.g., Math Score for age 16) by identifying hidden
patterns and relations between attributes (e.g., Age, Height).
4. Interpolation
Interpolation involves estimating missing values based on trends observed in the data. For
example:
Day Temperature
Mon 20°C
Tue Missing
Wed 22°C
Using interpolation, the missing value for Tuesday can be estimated as 21°C.
Output:
Imputation
Output:
Prediction Model
Output:
Interpolation
Output:
Existence of OCR
The origins of OCR date back to the early 20th century. The first OCR machines were developed
for the visually impaired, allowing printed materials to be read aloud. The technology evolved
significantly with the advent of computers in the mid-20th century.
Milestones in OCR Development:
1. Early 1920s: Manual activities involved passing light through text to identify characters
visually.
2. 1929: Gustav Tauschek developed the first OCR machine.
3. 1950s: The development of digital computers advanced OCR technology, allowing
machines to read printed text. These systems recognized letters, numbers, and words,
automatically reading and typing them.
4. 1970s: IBM developed OCR systems that compared scanned image pixels to identify
characters, enabling recognition of symbols and fonts. The converted text could be stored
in machine memory or even converted into audio.
5. 1990s: OCR technology expanded into various industries with the rise of digitization.
6. 2000s and beyond: With the advent of machine learning and deep learning, OCR
achieved higher accuracy and became integral to modern applications.
Components of OCR
OCR systems consist of several key components that work together to achieve text recognition:
1. Image Acquisition:
Captures images of the text using scanners, cameras, or other devices.
2. Preprocessing:
Improves the quality of the input image for better recognition. Common preprocessing
steps include:
o Noise reduction
o Binarization (converting to black and white)
o Deskewing (correcting image alignment)
o Normalization
3. Segmentation:
Divides the image into smaller regions, such as characters, words, or lines.
4. Feature Extraction:
Extracts essential features from the segmented characters, such as edges, strokes, and
contours.
5. Recognition:
Applies algorithms, often powered by machine learning, to identify and convert the
characters into digital text.
6. Post-processing:
Refines the recognized text to improve accuracy. This includes:
o Spell checking
o Contextual analysis
Conclusion
OCR is a transformative technology that bridges the gap between analog and digital data. With
its ability to process large volumes of text efficiently, OCR has become indispensable in many
industries, driving automation, accessibility, and innovation.
5. Thresholding:
o Enhances crucial parts of the image, such as text, by removing noise and
emphasizing relevant data. This differentiation allows OCR to isolate text more
effectively.
Applications in Banking:
Extracting data from scanned financial statements, invoices, and handwritten forms.
Converting large volumes of legacy documents into digital formats.
Improving document management systems by enabling quick search and retrieval.
Enhancing customer service by automating data extraction processes.
Versions of Tesseract
Tesseract 1.x (1980s)
Initial development by HP
Only capable of extracting one column of text
Limited capabilities and basic OCR functionality
Released as open-source in 2005
Tesseract 2.x (2005)
Transitioned to open-source maintenance by HP
Improved to extract multi-column text
Enhanced layout analysis and processing pipelines
Basic OCR functionality, primarily for English
TASK:
To Extract text from image using tesseract
INPUT
CODE
OUTPUT:
Documentation: DAY 14(6/1/2025)
1. Grayscale Conversion
Converts the RGB image into a grayscale image by reducing the three color channels
(Red, Green, and Blue) into a single intensity channel.
This simplifies the image and reduces computational complexity.
Why:
OCR algorithms focus on text structure and patterns, not colors.
Grayscale conversion reduces image intensity to a single channel, which improves clarity
for OCR processing.
Process:
Break the RGB image into pixels.
Calculate the grayscale equivalent using a weighted sum of the R, G, and B values (e.g.,
0.2989*R + 0.5870*G + 0.1140*B).
Convert the RGB pixels to grayscale.
2. Thresholding
Description:
Thresholding converts the grayscale image into a binary image (black and white). This step
focuses on distinguishing the text from the background for improved OCR detection.
Why:
Binary images are simpler and more effective for OCR engines.
Eliminates noise and non-essential details, focusing only on text regions.
Subtypes:
1. Global Thresholding:
o Applies a single threshold value across the entire image.
o Pixels with intensity above a chosen value (e.g., 127) are converted to white, and
those below it are converted to black.
Example: If a pixel intensity is greater than 127, it becomes white (1); otherwise, it becomes
black (0).
2. Adaptive Thresholding:
o Divides the image into smaller blocks and applies thresholding locally within
each block.
o This method handles varying lighting conditions and intensity levels better than
global thresholding.
Advantages:
o Reduces noise specific to regions.
o Enhances text detection in images with uneven illumination.
3. Otsu's Binarization
Description:
Otsu's Binarization is an advanced thresholding technique that automatically determines the
optimal threshold value based on the image’s intensity histogram. It aligns the image into a
single binary format by minimizing intra-class intensity variance.
Why:
Eliminates the need for manually setting a threshold value.
Ideal for images with varying intensity distributions.
Process:
Calculate the histogram of grayscale intensities.
Use the algorithm to determine a threshold that separates the foreground (text) from the
background.
Apply the threshold to convert the image into binary format.
TASK:
Preprocess the image and extract the text
INPUT:
CODE:(GreyScale Conversion)
OUTPUT:
CODE:(Global Thresholding)
OUTPUT:
CODE:( Otsu's Binarization)
Output:
Documentation: DAY 15(7/1/2025)
Lifecycle of OCR
1. Input Preparation:
o Take an image directly or convert a PDF into images using libraries like
pdf2image.
2. Preprocessing (Enhance Image Quality):
o Improve accuracy by enhancing image quality.
o Steps include:
Noise Removal: Remove unwanted noise from the image.
Blurry Spot Removal: Enhance blurred regions using denoising
techniques.
3. H Deduction (Boundary Detection):
o Detect boundaries and remove unwanted areas of the image.
o Image Resizing: Crop or resize the image if it is too large or too small.
4. Segmentation:
o OCR automatically detects and segments words and characters in the image.
OCR Tools Comparison
Tesseract:
Performs well with structured data (e.g., bank data).
Supports multiple languages.
Simple to use.
Disadvantage:
o Struggles with low-quality images.
EasyOCR:
Excels at detecting handwritten text.
Performs well with Asian languages.
CODE:
Output:
Boundboxing
Output:
Documentation: DAY 16(8/1/2025)
Preprocessing Techniques
Binarization
Converts an image into a binary format (black and white).
Useful when there is an uneven distribution of light or shadows in the image.
Thresholding
Separates pixels into foreground and background based on intensity values.
Greyscaling
Converts a colored image to grayscale to simplify the data for processing.
Denoising
Used to handle blurry images by removing noise, enhancing text clarity.
Bounding Box
Detects specific regions where text is present.
Encases the text in an invisible box for easier detection and manipulation.
Pytesseract automatically handles coordinates for resizing, cropping, and bounding
boxes.
Contours
Available in the OpenCV library.
Automatically detects boundaries of text in the image, improving accuracy.
For further precision, Photoshop can be used to manually specify coordinates.
Postprocessing
Corrects errors in text extraction by refining the output.
Example Corrections
"o" -> "0" (e.g., "ra1n" -> "rain")
"0" -> "o" (e.g., "1o00" -> "1000")
ReGEX and Custom Error Handling
ReGEX: Refers to predefined patterns or characters that are likely errors in the extracted
text. These are filtered out during postprocessing.
o Example: Incorrect substitutions like "ra1n" instead of "rain" or "1o00" instead of
"1000" can be identified and corrected.
Custom Error Rules: Implement rules to handle domain-specific corrections. For
example:
o Replacing numbers that resemble letters (e.g., "1" with "I" or "0" with "O") based
on the context of the extracted text.
o Defining reject lists or dictionaries to validate extracted words and replace them
with the correct terms.
Workflow
1. Load Image
o Input the image into the program.
2. Preprocess the Image
o Apply greyscaling, binarization, and thresholding techniques.
3. Text Extraction
o Use Pytesseract to extract text from the processed image.
4. Postprocessing
o Correct errors and refine text using:
ReGEX lists.
Custom rules (e.g., replacing "o" with "0").
PDF Supports text, images, and scanned docs Low (for scanned documents)
Applications in Banking
OCR technology is extensively used in the banking sector to:
1. Digitize Documents: Convert physical documents into digital format.
2. KYC (Know Your Customer): Extract customer details from ID cards and other
documents.
3. Automated Form Filling: Automatically populate forms using extracted text.
Multilingual Document Extraction
Multilingual document extraction refers to the process of identifying and extracting text from
documents written in multiple languages. This is achieved by integrating OCR systems with
models trained for different languages. The system first detects the language(s) present in the
document and then applies the appropriate OCR model to extract the text. This process ensures:
High accuracy in text extraction across various languages.
Proper handling of language-specific characters and scripts.
Seamless integration with multilingual workflows, such as translation and content
management.
CODE:
OUTPUT
CODE:
OUTPUT:
TASK: Web scarapping
CODE: