0% found this document useful (0 votes)
180 views

"Introduction To Computer Vision": Submitted by

This document is a seminar report on introduction to computer vision submitted in partial fulfillment of requirements for a bachelor's degree in computer science and engineering. It provides an abstract and covers topics such as the history of computer vision, image formation, image processing, feature detection and matching, image stitching, recognition, implementation, applications and purpose, and future scope. The report was submitted by M. Sailohith under the guidance of two assistant professors from the computer science and engineering department of GITAM University.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
180 views

"Introduction To Computer Vision": Submitted by

This document is a seminar report on introduction to computer vision submitted in partial fulfillment of requirements for a bachelor's degree in computer science and engineering. It provides an abstract and covers topics such as the history of computer vision, image formation, image processing, feature detection and matching, image stitching, recognition, implementation, applications and purpose, and future scope. The report was submitted by M. Sailohith under the guidance of two assistant professors from the computer science and engineering department of GITAM University.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

A Seminar Report on

“Introduction to Computer Vision”


A Seminar report Submitted in partial fulfillment of the Requirement for the
Award of the degree of

BACHELOR OF TECHNOLOGY
IN
Computer Science and Engineering

Submitted By

M.SAI LOHITH 221710304034

Under the Esteemed guidance of


Mr.M.Raghavendra,
(Asst.Professor, Dept. of CSE)
Mr.Y.Srinivasulu.
(Asst.Professor, Dept. of CSE)

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

GITAM(Deemed to be University)
Rudraram Mandal, Sangareddy district, Patancheru, Hyderabad,
Telangana 502329
Introduction to Computer Vision
Contents
Abstract………………………………………………………………………...1

1. Introduction 2
1.1 What is computer vision?.........................................................................2
1.2 A brief history…………………………………………………………...2
1.3 Algorithm……………………………………………………………….4

2. Image formation 5
2.1 Photometric image formation………………………………………….5
2.1.1 Lighting…………………………………………………………..6
2.1.2 Reflectance and shading………………………………………….7
2.1.3 Optics……………………………………………………………..7
2.3 The Digital Camera……………………………………………………..8
2.3.1 Color cameras…………………………………………………….9
2.3.2 Color filter arrays……………………………………………….10
2.3.3 Compression…………………………………………………….10

3 Image processing 12
3.1 Point operators……………………………………………………….12
3.1.1 Color Transforms……………………………………………….14
3.2 Linear Filtering……………………………………………………...15
3.3 Band pass and steerable filters……………………………………….15

4 Feature detection and matching 17


4.1 points and patches……………………………………………………17
4.2 Feature Matching……………………………………………………..19
4.3 Edges…………………………………………………………………20
4.4 Edge Linking…………………………………………………………21
4.5 Lines………………………………………………………………….22
4.6 Layered Motion………………………………………………………23

GITAM(Deemed to be University) Page ii


Introduction to Computer Vision

5 Image Stitching 24
5.1 Motion model…………………………………………………………...24
5.2 Gap Closing……………………………………………………………..24
5.3 Video summarization and compression…………………………………26
5.4 Recognizing panoramas…………………………………………………26
5.5 Heads and faces…………………………………………………………28

6 Recognition 29
6.1 Face Detection………………………………………………………….29
6.2 Neural network in detection…………………………………………..29
6.3 Location recognition……………………………………………………30

7 Implementation 32
7.1 Object Detection……………………………………………………….32
7.2 Implementation of object detection code……………………………...33

8 Applications and purpose 36


8.1 Application and Usage……………………………………………….36
8.1.1 Optical character recognition(ocr)……………………………...36
8.2 Purpose……………………………………………………………….38
8.2.2 Disadvantages…………………………………………………..38
8.3 Components in computervision……………………………………….38
8.4 Context and scene understanding……………………………………..39

9 Conclusion and Future Scope 41


9.1 Conclusion……………………………………………………………41
9.2 Future Scope………………………………………………………….41

10 Bibliography 42

GITAM(Deemed to be University) Page iii


Introduction to Computer Vision

Abstract:

Computer vision is a field that includes methods for acquiring, processing,


analyzing, and understanding images in order to produce numerical or symbolic
information. A theme in the development of this field has been to duplicate the
abilities of human vision by electronically perceiving and understanding an
image. Sub domains of computer vision include object detection, video
tracking, object pose estimation, motion estimation, and image restoration.

Computer vision tasks used for acquiring, processing, analyzing and


understanding digital images, and extraction of high-dimensional data from the
real world in order to produce numerical or symbolic information, e.g. in the
forms of decisions. Understanding in this context means the transformation of
visual images (the input of the retina) into descriptions of the world that can
interface with other thought processes and elicit appropriate action.

Areas of artificial intelligence deal with autonomous planning or


deliberation for robotic systems to navigate through an environment.
Information about the environment could be provided by a computer vision
system, acting as a vision sensor and providing high-level information about the
environment and the robot.

As a result, it will develop the attempt to reproduce capability of human


vision.

Keywords: Artificial Intelligence, Object Detection Algorithm.

M. Sailohith

221710304034

GITAM(Deemed to be University) Page 1


Introduction to Computer Vision

CSE-B4

1.INTRODUCTION
1.1 What is computer vision?
Computervision is an interdisciplinary scientific field that deals with how computers can
gain high-level understanding from digital images or videos. From the perspective
of engineering, it seeks to understand and automate tasks that the human visual system can do.
Computer vision tasks include methods for acquiring, processing, analyzing and
understanding digital images, and extraction of high-dimensional data from the real world in
order to produce numerical or symbolic information, e.g. in the forms of
decisions. Understanding in this context means the transformation of visual images (the input of
the retina) into descriptions of the world that make sense to thought processes and can elicit
appropriate action. This image understanding can be seen as the disentangling of symbolic
information from image data using models constructed with the aid of geometry, physics,
statistics, and learning theory.

1.2 A brief history

When computer vision first started out in the early 1970s, it was viewed as the visual
perception component of an ambitious agenda to mimic human intelligence and to endow robots
with intelligent behavior. At the time, it was believed by some of the early pioneers of artificial
intelligence and robotics (at places such as MIT, Stanford, and CMU) that solving the “visual
input” problem would be an easy step along the path to solving more difficult problems such as
higher-level reasoning and planning. According to one well-known story, in 1966, Marvin
Minsky at MIT asked his undergraduate student Gerald Jay Sussman to “spend the summer
linking a camera to a computer and getting the computer to describe what it saw” 5 We now
know that the problem is slightly more difficult than that.6 What distinguished computer vision
from the already existing field of digital image processing (Rosenfeld and Pfaltz 1966; Rosenfeld
and Kak 1976) was a desire to recover the three-dimensional structure of the world from images

GITAM(Deemed to be University) Page 2


Introduction to Computer Vision

and to use this as a stepping stone towards full scene understanding. Winston (1975) and Hanson
and Riseman (1978) provide two nice collections of classic papers from this early period. Early
attempts at scene understanding involved extracting edges and then inferring the 3D structure of
an object or a “blocks world” from the topological structure of the 2D lines (Roberts 1965).
Several line labeling algorithms.

This past decade has continued to see a deepening interplay between the vision and graphics
fields. In particular, many of the topics introduced under the rubric of image-based rendering,
such as image stitching (see Chapter 9), light-field capture and rendering , and high dynamic
range (HDR) image capture through exposure bracketing and Mann and Picard 1995; Debevec
and Malik 1997), were re-christened as computational photography (see Chapter 10) to
acknowledge the increased use of such techniques in everyday digital photography. For example,
the rapid adoption of exposure bracketing to create high dynamic range images necessitated the
development of tone mapping algorithms to convert such images back to displayable results
(Fattal, Lischinski, and Werman 2002; Durand and Dorsey 2002; Reinhard, Stark, Shirley et al.
2002; Lischinski, Farbman, Uyttendaele et al. 2006a). In addition to merging multiple exposures,

GITAM(Deemed to be University) Page 3


Introduction to Computer Vision

techniques were developed to merge flash images with non-flash counterparts (Eisemann and
Durand 2004; Petschnigg, Agrawala, Hoppe et al. 2004) and to interactively or automatically
select different regions from overlapping images.

1.3 Algorithm

GITAM(Deemed to be University) Page 4


Introduction to Computer Vision

2. Image formation
2.1 Photometric image formation
In modeling the image formation process, we have described how 3D geometric features in
the world are projected into 2D features in an image. However, images are not composed of 2D
features. Instead, they are made up of discrete color or intensity values. Where do these values
come from? How do they relate to the lighting in the environment, surface properties and
geometry, camera optics, and sensor properties In this section, we develop a set of models to
describe these interactions and formulate a generative process of image formation. A more

GITAM(Deemed to be University) Page 5


Introduction to Computer Vision

detailed treatment of these topics can be found in other textbooks on computer graphics and
image synthesis
2.1.1 Lighting
Images cannot exist without light. To produce an image, the scene must be illuminated
with one or more light sources. (Certain modalities such as fluorescent microscopy and X-ray
tomography do not fit this model, but we do not deal with them in this book.) Light sources can
generally be divided into point and area light sources. A point light source originates at a single
location in space (e.g., a small light bulb), potentially at infinity (e.g., the sun). (Note that for
some applications such as modeling soft shadows (penumbras), the sun may have to be treated as
an area light source.) In addition to its location, a point light source has an intensity and a color
spectrum, i.e., a distribution over

Fig 2.2.1 A simplified model of photometric image formation. Light is emitted by one or more
light sources and is then reflected from an object’s surface. A portion of this light is directed towards the
camera. This simplified model ignores multiple reflections, which often occur in real-world scenes.

wavelengths L(λ). The intensity of a light source falls off with the square of the distance
between the source and the object being lit, because the same light is being spread over a larger
(spherical) area. A light source may also have a directional falloff (dependence), but we ignore
this in our simplified model. Area light sources are more complicated. A simple area light source
such as a fluorescent ceiling light fixture with a diffuser can be modeled as a finite rectangular
area emitting light equally in all directions (Cohen and Wallace 1993; Sillion and Puech 1994;
Glassner 1995). When the distribution is strongly directional, a four-dimensional lightfield can
be used instead (Ashdown 1993). A more complex light distribution that approximates, say, the
incident illumination on an object sitting in an outdoor courtyard, can often be represented using

GITAM(Deemed to be University) Page 6


Introduction to Computer Vision

an environment map (Greene 1986) (originally called a reflection map (Blinn and Newell 1976)).
This representation maps incident light directions vˆ to color values (or wavelengths, λ), L(vˆ; λ)

2.1.2 Reflectance and shading


When light hits an object’s surface, it is scattered and reflected . Many different models
have been developed to describe this interaction. In this section, we first describe the most
general form, the bidirectional reflectance distribution function, and then look at some more
specialized models, including the diffuse, specular, and Phong shading models. We also discuss
how these models can be used to compute the global illumination corresponding to a scene. The
Bidirectional Reflectance Distribution Function (BRDF) The most general model of light
scattering is the bidirectional reflectance distribution function (BRDF).5 Relative to some local
coordinate frame on the surface, the BRDF is a fourdimensional function that describes how
much of each wavelength arriving at an incident direction vˆi is emitted in a reflected direction
vˆr . The function can be written in terms of the angles of the incident and reflected directions
relative to the surface frame as fr(θi , φi , θr, φr; λ). The BRDF is reciprocal, i.e., because of the
physics of light transport, you can interchange the roles of vˆi and vˆr and still get the same
answer (this is sometimes called Helmholtz reciprocity).
2.1.3 Optics
Once the light from a scene reaches the camera, it must still pass through the lens before
reaching the sensor (analog film or digital silicon). For many applications, it suffices to treat the
lens as an ideal pinhole that simply projects all rays through a common center of projection
However, if we want to deal with issues such as focus, exposure, vignetting, and aberration, we
need to develop a more sophisticated model, which is where the study of optics comes in (Moller
1988 ¨ ; Hecht 2001; Ray 2002). shows a diagram of the most basic lens model, i.e., the thin lens
composed of a single piece of glass with very low, equal curvature on both sides. According to
the lens law (which can be derived using simple geometric arguments on light ray refraction), the
relationship between the distance to an object zo and the distance behind the lens at which a
focused image is formed zi can be expressed as 1 zo + 1 zi = 1 f

GITAM(Deemed to be University) Page 7


Introduction to Computer Vision

Fig2.2.3 A thin lens of focal length f focuses the light from a plane a distance zo in front of the
lens at a distance zi

2.2 The digital camera


After starting from one or more light sources, reflecting off one or more surfaces in the
world, and passing through the camera’s optics (lenses), light finally reaches the imaging sensor.
How are the photons arriving at this sensor converted into the digital (R, G, B) values that we
observe when we look at a digital image? In this section, we develop a simple model that
accounts for the most important effects such as exposure (gain and shutter speed), nonlinear
mappings, sampling and aliasing, and noise.which is based on camera models developed by
Healey and Kondepudy (1994); Tsin, Ramesh, and Kanade (2001); Liu, Szeliski, Kang et al.
(2008), shows a simple version of the processing stages that occur in modern digital cameras.
Chakrabarti, Scharstein, and Zickler (2009) developed a sophisticated 24-parameter model that is
an even better match to the processing performed in today’s cameras.

GITAM(Deemed to be University) Page 8


Introduction to Computer Vision

Fig 2.3 Image sensing pipeline, showing the various sources of noise as well as typical digital post-
processing steps

In a CCD, photons are accumulated in each active well during the exposure time. Then, in a
transfer phase, the charges are transferred from well to well in a kind of “bucket brigade” until
they are deposited at the sense amplifiers, which amplify the signal and pass it to an analog-to-
digital converter (ADC).10 Older CCD sensors were prone to blooming, when charges from one
over-exposed pixel spilled into adjacent ones, but most newer CCDs have anti-blooming
technology (“troughs” into which the excess charge can spill). In CMOS, the photons hitting the
sensor directly affect the conductivity (or gain) of a photodetector, which can be selectively
gated to control exposure duration, and locally amplified before being read out using a
multiplexing scheme. Traditionally, CCD sensors outperformed CMOS in quality sensitive
applications, such as digital SLRs, while CMOS was better for low-power applications, but today
CMOS is used in most digital cameras. The main factors affecting the performance of a digital
image sensor are the shutter speed, sampling pitch, fill factor, chip size, analog gain, sensor
noise, and the resolution (and quality)

2.2.1 Color cameras


In fact, the design of RGB video cameras has historically been based around the
availability of colored phosphors that go into television sets. When standard-definition color
television was invented (NTSC), a mapping was defined between the RGB values that would
drive the three color guns in the cathode ray tube (CRT) and the XYZ values that unambiguously
define perceived color (this standard was called ITU-R BT.601). With the advent of HDTV and
newer monitors, a new standard called ITU-R BT.709 was created, which specifies the XYZ.
Manufacturer provides us with this data or we observe the response of the camera to a
whole spectrum of monochromatic lights, these sensitivities are not specified by a standard such
as BT.709. Instead, all that matters is that the tri-stimulus values for a given color produce the
specified RGB values. The manufacturer is free to use sensors with sensitivities that do not
match the standard XYZ definitions, so long as they can later be converted (through a linear
transform) to the standard colors. Similarly, while TV and computer monitors are supposed to
produce RGB values as specified by Equation (2.108), there is no reason that they cannot use
digital logic to transform the incoming RGB values into different signals to drive each of the

GITAM(Deemed to be University) Page 9


Introduction to Computer Vision

color channels. Properly calibrated monitors make this information available to software
applications that perform color management, so that colors in real life, on the screen, and on the
printer all match as closely as possible.

2.2.2 Color filter arrays

Fig 2.3.2Bayer RGB pattern: (a) color filter array layout; (b) interpolated pixel values, with unknown
(guessed) values shown as lower case.
The most commonly used pattern in color cameras today is the Bayer pattern (Bayer
1976), which places green filters over half of the sensors (in a checkerboard pattern), and red and
blue filters over the remaining ones (Figure 2.3.2). The reason that there are twice as many green
filters as red and blue is because the luminance signal is mostly determined by green values and
the visual system is much more sensitive to high frequency detail in luminance than in
chrominance (a fact that is exploited in color image compression—see Section 2.3.2). The
process of interpolating the missing color values so that we have valid RGB values for all the
pixels is known as demosaicing and is covered in detail. Similarly, color LCD monitors typically
use alternating stripes of red, green, and blue filters placed in front of each liquid crystal active
area to simulate the experience of a full color display. As before, because the visual system has
higher resolution (acuity) in luminance than chrominance, it is possible to digitally pre-filter
RGB (and monochrome) images to enhance the perception of crispness

2.2.3 Compression
The last stage in a camera’s processing pipeline is usually some form of image
compression (unless you are using a lossless compression scheme such as camera RAW or
PNG). All color video and image compression algorithms start by converting the signal into

GITAM(Deemed to be University) Page 10


Introduction to Computer Vision

YCbCr (or some closely related variant), so that they can compress the luminance signal with
higher fidelity than the chrominance signal.

Fig 2.3.3 Color space transformations: (a–d) RGB; (e–h) rgb. (i–k) L*a*b*; (l–n) HSV. Note that the rgb,
L*a*b*, and HSV values are all re-scaled to fit the dynamic range of the printed page.

GITAM(Deemed to be University) Page 11


Introduction to Computer Vision

3. Image processing

3.1 Point operators

The simplest kinds of image processing transforms are point operators, where each output
pixel’s value depends on only the corresponding input pixel value (plus, potentially, some
globally collected information or parameters). Examples of such operators include brightness and
contrast adjustments (Figure 3.1) as well as color correction and transformations. In the image
processing literature, such operations are also known as point processes.
Now that we have seen how images are formed through the interaction of 3D scene
elements, lighting, and camera optics and sensors, let us look at the first stage in most computer
vision applications, namely the use of image processing to preprocess the image and convert it
into a form suitable for further analysis. Examples of such operations include exposure
correction and color balancing, the reduction of image noise, increasing sharpness, or
straightening the image by rotating it . While some may consider image processing to be outside
the purview of computer vision, most computer vision applications, such as computational
photography and even recognition, require care in designing the image processing stages in order
to achieve acceptable results.
Another important class of global operators are geometric transformations, such as
rotations, shears, and perspective deformations. Finally, we introduce global optimization
approaches to image processing, which involve the minimization of an energy functional or,
equivalently, optimal estimation using Bayesian Markov random field models

GITAM(Deemed to be University) Page 12


Introduction to Computer Vision

Fig 3.1 Some local image processing operations: (a) original image along with its three color (per-
channel) histograms; (b) brightness increased (additive offset, b = 16); (c) contrast increased
(multiplicative gain, a = 1.1); (d) gamma (partially) linearized (γ = 1.2); (e) full histogram equalization; (f)
partial histogram equalization.

GITAM(Deemed to be University) Page 13


Introduction to Computer Vision

Fig 3.1.1 Visualizing image data: (a) original image; (b) cropped portion and scanline plot using an
image inspection tool; (c) grid of numbers; (d) surface plot. For figures (c)–(d), the image was first
converted to grayscale.

3.1.1 Color transforms

While color images can be treated as arbitrary vector-valued functions or collections of


independent bands, it usually makes sense to think about them as highly correlated signals with
strong connections to the image formation process , sensor design, and human perception.
Consider, for example, brightening a picture by adding a constant value to all three channels.
Similarly, color balancing (e.g., to compensate for incandescent lighting) can be
performed either by multiplying each channel with a different scale factor or by the more
complex process of mapping to XYZ color space, changing the nominal white point, and
mapping back to RGB, which can be written down using a linear 3 × 3 color twist transform
matrix. explore some of these issues. Another fun project, best attempted after you have
mastered the rest of the material , is to take a picture with a rainbow in it and enhance the
strength of the rainbow

Fig 3.1.1.1 (a) source image; (b) extracted foreground object F; (c) alpha matte α shown in grayscale;
(d) new composite C.

GITAM(Deemed to be University) Page 14


Introduction to Computer Vision

3.2 Linear filtering

Locally adaptive histogram equalization is an example of a neighborhood operator or


local operator, which uses a collection of pixel values in the vicinity of a given pixel to
determine its final output value. In addition to performing local tone adjustment, neighborhood
operators can be used to filter images in order to add soft blur, sharpen details, accentuate edges,
or remove noise In this section, we look at linear filtering operators, which involve weighted
combinations of pixels in small neighborhoods. In Section 3.3, we look at non-linear operators
such as morphological filters and distance transforms. The most commonly used type of
neighborhood operator is a linear filter, in which an output pixel’s value is determined as a
weighted sum of input pixel values, g(i, j) = X k,l f(i + k, j + l)h(k, l). (3.12) The entries in the
weight kernel or mask h(k, l) are often called the filter coefficients. The above correlation
operator can be more compactly notated as g = f ⊗ h. A common variant on this formula is g(i,
j) = X k,l f(i − k, j − l)h(k, l) = X k,l f(k, l)h(i − k, j − l)

Fig 3.2 Neighborhood filtering (convolution): The image on the left is convolved with
the filter in the middle to yield the image on the right. The light blue pixels indicate the source
neighborhood for the light green destination pixel

3.3 Band-pass and steerable filters

The Sobel and corner operators are simple examples of band-pass and oriented filters.
More sophisticated kernels can be created by first smoothing the image with a (unit area)
Gaussian filter, G(x, y; σ) = 1 2πσ2 e − x 2+y 2 2σ2 , (3.24) and then taking the first or second
derivatives. Such filters are known collectively as band-pass filters, since they filter out both low
and high frequencies. The (undirected) second derivative of a two-dimensional image, ∇2 f = ∂

GITAM(Deemed to be University) Page 15


Introduction to Computer Vision

2f ∂x2 + ∂ 2y ∂y2 , (3.25) is known as the Laplacian operator. Blurring an image with a Gaussian
and then taking its Laplacian is equivalent to convolving directly with the Laplacian of Gaussian
(LoG) filter, ∇2G(x, y; σ) = x 2 + y 2 σ 4 − 2 σ 2 G(x, y; σ)

Fig 3.3 (a) original image of Einstein; (b) orientation map computed from the second-order oriented
energy; (c) original image with oriented structures enhanced.

GITAM(Deemed to be University) Page 16


Introduction to Computer Vision

4. Feature detection and matching

4.1 Points and patches


Point features can be used to find a sparse set of corresponding locations in different
images, often as a pre-cursor to computing camera pose, which is a prerequisite for computing a
denser set of correspondences using stereo matching. Such correspondences can also be used to
align different images, e.g., when stitching image mosaics or performing video stabilization .
They are also used extensively to perform object instance and category recognition. A key
advantage of key points is that they permit matching even in the presence of clutter (occlusion)
and large scale and orientation changes.

Fig 4.1 Image pairs with extracted patches below. Notice how some patches can be localized
or matched with higher accuracy than others.
Adaptive non-maximal suppression (ANMS). While most feature detectors simply look
for local maxima in the interest function, this can lead to an uneven distribution of feature points
across the image, e.g., points will be denser in regions of higher contrast. To mitigate this
problem, Brown, Szeliski, and Winder (2005) only detect features that are both local maxima
and whose response value is significantly (10%) greater than that of all of its neighbors within a
radius r . They devise an efficient way to associate suppression radii with all local maxima by
first sorting them by their response strength and then creating a second list sorted by decreasing
suppression radius shows a qualitative comparison of selecting the top n features and using
ANMS. Measuring repeatability. Given the large number of feature detectors that have been

GITAM(Deemed to be University) Page 17


Introduction to Computer Vision

developed in computer vision, how can we decide which ones to use? Schmid, Mohr, and
Bauckhage (2000) were the first to propose measuring the repeatability of feature detectors,
which they define as the frequency with which keypoints detected in one image are found within
(say, = 1.5) pixels of the corresponding location in a transformed image. In their paper, they
transform their planar images by applying rotations, scale changes, illumination changes,
viewpoint changes, and adding noise. They also measure the information content available at
each detected feature point, which they define as the entropy of a set of rotationally invariant
local grayscale descriptors. Among the techniques they survey, they find that the improved
(Gaussian derivative) version of the Harris operator with σd = 1 (scale of the derivative
Gaussian) and σi = 2 (scale of the integration Gaussian) works best.

Fig 4.1The upper two images show the strongest 250 and 500 interest points, while the lower
two images show the interest points selected with adaptive non-maximal suppression, along with the
corresponding suppression radius r. Note how the latter features have a much more uniform spatial
distribution across the image.

GITAM(Deemed to be University) Page 18


Introduction to Computer Vision

4.2 Feature matching


Determining which feature matches are reasonable to process further depends on the
context in which the matching is being performed. Say we are given two images that overlap to a
fair amount (e.g., for image stitching, as in , or for tracking objects in a video). We know that
most features in one image are likely to match the other image, although some may not match
because they are occluded or their appearance has changed too much. On the other hand, if we
are trying to recognize how many known objects appear in a cluttered scene most of the features
may not match. Furthermore, a large number of potentially matching objects must be searched,
which requires more efficient strategies, as described below. To begin with, we assume that the
feature descriptors have been designed so that Euclidean (vector magnitude) distances in feature
space can be directly used for ranking potential matches. If it turns out that certain parameters
(axes) in a descriptor are more reliable than others, it is usually preferable to re-scale these axes
ahead of time, e.g., by determining how much they vary when compared against other known
good matches (Hua, Brown, and Winder 2007). A more general process, which involves
transforming feature vectors into a new scaled basis, is called whitening and is discussed in more
detail in the context of eigenface-based face recognition. Given a Euclidean distance metric, the
simplest matching strategy is to set a threshold (maximum distance) and to return all matches
from other images within this threshold. Setting the threshold too high results in too many false
positives, i.e., incorrect matches being returned. Setting the threshold too low results in too many
false negatives, i.e., too many correct matches being missed.

Fig 4.2 The affine warp of each recognized database image onto the scene is shown as a larger
parallelogram

GITAM(Deemed to be University) Page 19


Introduction to Computer Vision

4.3 Edges
While interest points are useful for finding image locations that can be accurately
matched in 2D, edge points are far more plentiful and often carry important semantic
associations. For example, the boundaries of objects, which also correspond to occlusion events
in 3D, are usually delineated by visible contours. Other kinds of edges correspond to shadow
boundaries or crease edges, where surface orientation changes rapidly. Isolated edge points can
also be grouped into longer curves or contours, as well as straight line segments .It is interesting
that even young children have no difficulty in recognizing familiar objects or animals from such
simple line drawings.
The local gradient vector J points in the direction of steepest ascent in the intensity
function. Its magnitude is an indication of the slope or strength of the variation, while its
orientation points in a direction perpendicular to the local contour. Unfortunately, taking image
derivatives accentuates high frequencies and hence amplifies noise, since the proportion of noise
to signal is larger at high frequencies. It is therefore prudent to smooth the image with a low-pass
filter prior to computing the gradient. Because we would like the response of our edge detector to
be independent of orientation, a circularly symmetric smoothing filter is desirable. As we saw in
Section 3.2, the Gaussian is the only separable circularly symmetric filter and so it is used in
most edge detection algorithms.
In fact, it is not strictly necessary to take differences between adjacent levels when
computing the edge field. Think about what a zero crossing in a “generalized” difference of
Gaussians image represents. The finer (smaller kernel) Gaussian is a noise-reduced version of the
original image. The coarser (larger kernel) Gaussian is an estimate of the average intensity over a
larger region. Thus, whenever the DoG image changes sign, this corresponds to the (slightly
blurred) image going from relatively darker to relatively lighter, as compared to the average
intensity in that neighborhood. Once we have computed the sign function S(x), we must find its
zero crossings and convert these into edge elements (edgels). An easy way to detect and
represent zero crossings is to look for adjacent pixel locations xi and xj where the sign changes
value, i.e., [S(xi) > 0] 6= [S(xj ) > 0].

GITAM(Deemed to be University) Page 20


Introduction to Computer Vision

Fig 4.3 The darkness of the edges corresponds to how many human subjects marked an object
boundary at that location.

4.4 Edge linking


While isolated edges can be useful for a variety of applications, such as line detection and
sparse stereo matching they become even more useful when linked into continuous contours. If
the edges have been detected using zero crossings of some function, linking them up is
straightforward, since adjacent edges share common endpoints. Linking the edges into chains
involves picking up an unlinked edge and following its neighbors in both directions. Either a
sorted list of edges (sorted first by x coordinates and then by y coordinates, for example) or a 2D
array can be used to accelerate the neighbor finding. If edges were not detected using zero
crossings, finding the continuation of an edges can be tricky. In this case, comparing the
orientation (and, optionally, phase) of adjacent edges.

Fig 4.4Chain code representation of a grid-aligned linked edge chain. The code is represented
as a series of direction codes, e.g, 0 1 0 7 6 5, which can further be compressed using predictive and run-
length coding.

GITAM(Deemed to be University) Page 21


Introduction to Computer Vision

4.5 Lines
While edges and general curves are suitable for describing the contours of natural objects,
the man-made world is full of straight lines. Detecting and matching these lines can be useful in
a variety of applications, including architectural modeling, pose estimation in urban
environments, and the analysis of printed document layouts. In this section, we present some
techniques for extracting piecewise linear descriptions from the curves computed in the previous
section. We begin with some algorithms for approximating a curve as a piecewise-linear
polyline. We then describe the Hough transform, which can be used to group edgels into line
segments even across gaps and occlusions. Finally, we describe how 3D lines with common
vanishing points can be grouped together. These vanishing points can be used to calibrate a
camera and to determine its orientation relative.
RANSAC-based line detection. Another alternative to the Hough transform is the
RANdom Sample Consensus (RANSAC) algorithm described in more detail . In brief, RANSAC
randomly chooses pairs of edges to form a line hypothesis and then tests how many other edges
fall onto this line. (If the edge orientations are accurate enough, a single edge can produce this
hypothesis.) Lines with sufficiently large numbers of inliers (matching edges) are then selected
as the desired line segments. An advantage of RANSAC is that no accumulator array is needed
and so the algorithm can be more space efficient and potentially less prone to the choice of bin
size. The disadvantage is that many more hypotheses may need to be generated and tested than
those obtained by finding peaks in the accumulator array. In general, there is no clear consensus
on which line estimation technique performs best. It is therefore a good idea to think carefully
about the problem at hand and to implement several approaches (successive approximation,
Hough, and RANSAC) to determine the one that works best for your application.
Once the Hough accumulator space has been populated, peaks can be detected in a
manner similar to that previously discussed for line detection. Given a set of candidate line
segments that voted for a vanishing point, which can optionally be kept as a list at each Hough
accumulator cell, I then use a robust least squares fit to estimate a more accurate location for
each vanishing point.

GITAM(Deemed to be University) Page 22


Introduction to Computer Vision

Fig 4.5 Real-world vanishing points: (a) architecture (Sinha, Steedly, Szeliski et al. 2008), (b)
furniture c 2008 IEEE, and (c) calibration patterns (Zhang 2000)

4.6 Layered motion


To compute a layered representation of a video sequence, Wang and Adelson (1994) first
estimate affine motion models over a collection of non-overlapping patches and then cluster
these estimates using k-means. They then alternate between assigning pixels to layers and
recomputing motion estimates for each layer using the assigned pixels, using a technique first
proposed by Darrell and Pentland (1991). Once the parametric motions and pixel-wise layer
assignments have been computed for each frame independently, layers are constructed by
warping and merging the various layer pieces from all of the frames together. Median filtering is
used to produce sharp composite layers that are robust to small intensity variations, as well as to
infer occlusion relationships between the layers. Figure 8.15 shows the results of this process on
the flower garden sequence. You can see both the initial and final layer assignments for one of
the frames, as well as the composite flow and the alpha-matted layers with their corresponding
flow vectors overlaid.
Frame interpolation is another widely used application of motion estimation, often
implemented in the same circuitry as de-interlacing hardware required to match an incoming
video 8.5 Layered motion 419 to a monitor’s actual refresh rate. As with de-interlacing,
information from novel in-between frames needs to be interpolated from preceding and
subsequent frames.

GITAM(Deemed to be University) Page 23


Introduction to Computer Vision

5. Image stitching
5.1 Motion models
The simplest possible motion model to use when aligning images is to simply translate
and rotate them in 2D. This is exactly the same kind of motion that you would use if you had
overlapping photographic prints. It is also the kind of technique favored by David Hockney to
create the collages that he calls joiners (Zelnik-Manor and Perona 2007; Nomura, Zhang, and
Nayar 2007). Creating such collages, which show visible seams and inconsistencies that add to
the artistic effect, is popular on Web sites such as Flickr, where they more commonly go under
the name panography Translation and rotation are also usually adequate motion models to
compensate for small camera motions in applications such as photo and video stabilization and
merging. In Section 6.1.3, we saw how the mapping between two cameras viewing a common
plane can be described using a 3×3 homography (2.71). Consider the matrix M10 that arises
when mapping a pixel in one image to a 3D point and then back onto a second image, x˜1 ∼
P˜1P˜ −1 0 x˜0 = M10x˜0. (9.1) When the last row of the P 0 matrix is replaced with a plane
equation nˆ0 ·p+c0 and points are assumed to lie on this plane, i.e., their disparity is d0 = 0, we
can ignore the last column of M10 and also its last row, since we do not care about the final z-
buffer depth. The resulting homography matrix H˜ 10 (the upper left 3 × 3 sub-matrix of M10)
describes the mapping between pixels in the two images, x˜1 ∼ H˜ 10x˜0.

Fig 5.1 Two-dimensional motion models and how they can be used for image stitching.

5.2 Gap closing


The techniques presented in this section can be used to estimate a series of rotation
matrices and focal lengths, which can be chained together to create large panoramas.
Unfortunately, because of accumulated errors, this approach will rarely produce a closed 360◦

GITAM(Deemed to be University) Page 24


Introduction to Computer Vision

panorama. Instead, there will invariably be either a gap or an overlap (Figure 9.5). We can solve
this problem by matching the first image in the sequence with the last one. The difference
between the two rotation matrix estimates associated with the repeated first indicates the amount
of misregistration. This error can be distributed evenly across the whole sequence by taking the
quotient of the two quaternions associated with these rotations and dividing this “error
quaternion” by the number of images in the sequence (assuming relatively constant inter-frame
rotations). We can also update the estimated focal length based on the amount of mis
registration. To do this, we first convert the error quaternion into a gap angle, θg and then update
the focal length using the equation f 0 = f(1 − θg/360◦ ). shows the end of registered image
sequence and the first image. There is a big gap between the last image and the first which are in
fact the same image. The gap is 32◦ because the wrong estimate of focal length (f = 510) was
used. Figure 9.5b shows the registration after closing the gap with the correct focal length (f =
468). Notice that both mosaics show very little visual misregistration (except at the gap), yet has
been computed using a focal length that has 9% error.

Fig 5.2Gap closing (Szeliski and Shum 1997) c 1997 ACM: (a) A gap is visible when the focal length is
wrong (f = 510). (b) No gap is visible for the correct focal length (f = 468).

GITAM(Deemed to be University) Page 25


Introduction to Computer Vision

5.3 Video summarization and compression


An interesting application of image stitching is the ability to summarize and compress
videos taken with a panning camera. This application was first suggested by Teodosio and
Bender (1993), who called their mosaic-based summaries salient stills. These ideas were then
extended by Irani, Hsu, and Anandan (1995), Kumar, Anandan, Irani et al., and Irani and
Anandan (1998) to additional applications, such as video compression and video indexing. While
these early approaches used affine motion models and were therefore restricted to long focal
lengths, the techniques were generalized by Lee, ge Chen, lung Bruce Lin et al to full eight-
parameter homographies and incorporated into the MPEG-4 video compression standard, where
the stitched background layers were called video sprites . While video stitching is in many ways
a straightforward generalization of multiple-image stitching Baudisch, Tan, Steedly et al. 2006),
the potential presence of large amounts of independent motion, camera zoom, and the desire to
visualize dynamic events impose additional challenges. For example, moving foreground objects
can often be removed using median filtering. Alternatively, foreground objects can be extracted
into a separate layer (Sawhney and Ayer 1996) and later composited back into the stitched
panoramas, sometimes as multiple instances to give the impressions of a “Chronophotograph”

Fig 5.3 Video stitching the background scene to create a single sprite image that can be
transmitted and used to re-create the background in each frame

5.4 Recognizing panoramas

GITAM(Deemed to be University) Page 26


Introduction to Computer Vision

If the user takes images in sequence so that each image overlaps its predecessor and also
specifies the first and last images to be stitched, bundle adjustment combined with the process of
topology inference can be used to automatically assemble a panorama (Sawhney and Kumar
1999). However, users often jump around when taking panoramas, e.g., they may start a new row
on top of a previous one, jump back to take a repeat shot, or create 360◦ panoramas where end-
to-end overlaps need to be discovered. Furthermore, the ability to discover multiple panoramas
taken by a user over an extended period of time can be a big convenience. To recognize
panoramas, Brown and Lowe (2007) first find all pairwise image overlaps using a feature-based
method and then find connected components in the overlap graph to “recognize” individual
panoramas. The feature-based matching stage first extracts scale invariant feature transform
(SIFT) feature locations and feature descriptors (Lowe 2004) from all the input images and
places them in an indexing structure, as described in Section 4.1.3. For each image pair under
consideration, the nearest matching neighbor is found for each feature in the first image, using
the indexing structure to rapidly find candidates and then comparing feature descriptors to find
the best match. RANSAC is used to find a set of inlier matches; pairs of matches are used to
hypothesize similarity motion models that are then used to count the number of inliers. (A more
recent RANSAC algorithm tailored specifically for rotational panoramas is described by Brown,
Hartley, and Nister´ (2007).) In practice, the most difficult part of getting a fully automated
stitching algorithm to work is deciding which pairs of images actually correspond to the same
parts of the scene. Repeated structures such as windows can lead to false matches when using a
feature-based approach. One way to mitigate this problem is to perform a direct pixelbased
comparison between the registered images to determine if they actually are different views of the
same scene. Unfortunately, this heuristic may fail if there are moving objects in the scene.

Fig 5.4 (a) composite with parallax; (b) after a single deghosting step (patch size 32); (c) after multiple
steps

GITAM(Deemed to be University) Page 27


Introduction to Computer Vision

5.6 Heads and faces


Perhaps the most widely used application of 3D head modeling is facial animation. Once
a parameterized 3D model of shape and appearance (surface texture) has been constructed, it can
be used directly to track a person’s facial motions (Figure 12.18a) and to animate a different
character with these same motions and expressions (Pighin, Szeliski, and Salesin 2002). An
improved version of such a system can be constructed by first applying principal component
analysis (PCA) to the space of possible head shapes and facial appearances. Blanz and Vetter
(1999) describe a system where they first capture a set of 200 colored range scans of faces,
which can be represented as a large collection of (X, Y, Z, R, G, B) samples (vertices).9 In order
for 3D morphing to be meaningful, corresponding vertices in different people’s scans must first
be put into correspondence. Once this is done, PCA can be applied to more naturally
parameterize the 3D morphable model. The flexibility of this model can be increased by
performing separate analyses in different subregions, such as the eyes, nose, and mouth, just as
in modular eigenspaces.

Fig 5.6(a) set of five input images along with user-selected keypoints; (b) the complete set of
keypoints and curves; (c) three meshes—the original, adapted after 13 keypoints, and after an additional
99 keypoints; (d) the partition of the image into separately animatable regions.

GITAM(Deemed to be University) Page 28


Introduction to Computer Vision

6. Recognition
6.1Face detection
Appearance-based approaches scan over small overlapping rectangular patches of the
image searching for likely face candidates, which can then be refined using a cascade of more
expensive but selective detection algorithms. In order to deal with scale variation, the image is
usually converted into a sub-octave pyramid and a separate scan is performed on each level.
Most appearance-based approaches today rely heavily on training classifiers using sets of labeled
face and non-face patches. Sung and Poggio (1998) and Rowley, Baluja, and Kanade (1998a)
present two of the earliest appearance-based face detectors and introduce a number of
innovations that are widely used in later work by others. To start with, both systems collect a set
of labeled face patches as well as a set of patches taken from images that are known not to
contain faces, such as aerial images or vegetation The collected face images are augmented by
artificially mirroring, rotating, scaling, and translating the images by small amounts to make the
face detectors less sensitive to such effects. After an initial set of training images has been
collected, some optional pre-processing can be performed, such as subtracting an average
gradient (linear function) from the image to compensate for global shading effects and using
histogram equalization to compensate for varying camera contrast.

Fig 6.1(a) artificially mirroring, rotating, scaling, and translating training images for greater variability;
(b) using images without faces (looking up at a tree) to generate non-face examples; (c) pre-processing
the patches by subtracting a best fit linear function (constant gradient) and histogram equalizing

6.2 Neural networks in detection

GITAM(Deemed to be University) Page 29


Introduction to Computer Vision

Instead of first clustering the data and computing Mahalanobis distances to the cluster
centers, Rowley, Baluja, and Kanade apply a neural network (MLP) directly to the 20×20 pixel
patches of gray-level intensities, using a variety of differently sized hand-crafted “receptive
fields” to capture both large-scale and smaller scale structure). The resulting neural network
directly outputs the likelihood of a face at the center of every overlapping patch in a multi-
resolution pyramid. Since several overlapping patches (in both space and resolution) may fire
near a face, an additional merging network is used to merge overlapping detections. The authors
also experiment with training several networks and merging their outputs. shows a sample result
from their face detector. To make the detector run faster, a separate network operating on 30×30
patches is trained to detect both faces and faces shifted by ±5 pixels. This network is evaluated at
every 10th pixel in the image (horizontally and vertically) and the results of this “coarse” or
“sloppy” detector are used to select regions on which to run the slower single-pixel overlap
technique. To deal with in-plane rotations of faces, Rowley, Baluja, and Kanade (1998b) train a
router network to estimate likely rotation angles from input patches and then apply the estimated
rotation to each patch before running the result through their upright face detector.

Fig 6.1 Overlapping patches are extracted from different levels of a pyramid and then pre-
processed . A three-layer neural network is then used to detect likely face locations.

6.2 Location recognition


Some approaches to location recognition assume that the photos consist of architectural
scenes for which vanishing directions can be used to pre-rectify the images for easier matching.
Other approaches use general affine covariant interest points to perform wide baseline matching
was the first to apply these kinds of ideas to large-scale image matching and (implicit) location

GITAM(Deemed to be University) Page 30


Introduction to Computer Vision

recognition from Internet photo collections taken under a wide variety of viewing conditions.
The main difficulty in location recognition is in dealing with the extremely large community
(user-generated) photo collections on Web sites such as Flickr or commercially captured
databases. The prevalence of commonly appearing elements such as foliage, signs, and common
architectural elements further complicates the task. shows some results on location recognition
from community photo collections, while shows sample results from denser commercially
acquired datasets. In the latter case, the overlap between adjacent database images can be used to
verify and prune potential matches using “temporal” filtering, i.e., requiring the query image to
match nearby overlapping database images before accepting the match. Another variant on
location recognition is the automatic discovery of landmarks, i.e., frequently photographed
objects and locations. Simon, Snavely, and Seitz (2007) show how these kinds of objects can be
discovered simply by analyzing the matching graph constructed as part of the 3D modeling
process in Photo Tourism. More recent work has extended this approach to larger data sets using
efficient clustering techniques as well as combining meta-data such as GPS and textual tags with
visual search It is now even possible to automatically associate object tags with images based on
their co-occurrence in multiple loosely tagged images.

Fig 6.2 Automatic mining, annotation, and localization of community photo collections .
This figure does not show the textual annotations

GITAM(Deemed to be University) Page 31


Introduction to Computer Vision

7. IMPLEMENTATION
7.1 Object detection
If we are given an image to analyze, such as the group portrait , we could try to apply a
recognition algorithm to every possible sub-window in this image. Such algorithms are likely to
be both slow and error-prone. Instead, it is more effective to construct special purpose detectors,
whose job it is to rapidly find likely regions where particular objects might occur. We begin this
section with face detectors, which are some of the more successful examples of recognition. For
example, such algorithms are built into most of today’s digital cameras to enhance auto-focus
and into video conferencing systems to control pan-tilt heads. We then look at pedestrian
detectors, as an example of more general methods for object detection. Such detectors can be
used in automotive safety applications, e.g., detecting pedestrians and other cars from moving
vehicles.
All the visual tasks we might ask a computer to perform, analyzing a scene and
recognizing all of the constituent objects remains the most challenging. While computers excel at
accurately reconstructing the 3D shape of a scene from images taken from different views, they
cannot name all the objects and animals present in a picture, even at the level of a twoyear-old
child. There is not even any consensus among researchers on when this level of performance
might be achieved.
The key to the success of boosting is the method for incrementally selecting the weak
learners and for re-weighting the training examples after each stage The AdaBoost (Adaptive
Boosting) algorithm does this by re-weighting each sample as a function of whether it is
correctly classified at each stage, and using the stage-wise average classification error to
determine the final weightings αj among the weak classifiers, While the resulting classifier is
extremely fast in practice, the training time can be quite slow (in the order of weeks), because of
the large number of feature (difference of rectangle) hypotheses that need to be examined at each
stage. To further increase the speed of the detector, it is possible to create a cascade of classifiers,
where each classifier uses a small number of tests (say, a two-term AdaBoost classifier) to reject
a large fraction of non-faces while trying to pass through all potential face candidates.

GITAM(Deemed to be University) Page 32


Introduction to Computer Vision

Fig 7.1 Output for object detectionobject-detection.jpg

7.2 Implementation object detection code


import argparse
import numpy as np
import cv
ap = argparse.ArgumentParser()
ap.add_argument('-i', '--image', required=True,
help ='path to input image')
ap.add_argument('-c', '--config', required=True,
help ='path to yolo config file')
ap.add_argument('-w', '--weights', required=True,
help ='path to yolo pre-trained weights')
ap.add_argument('-cl', '--classes', required=True,
help ='path to text file containing class names')
args = ap.parse_args()
defget_output_layers(net):
layer_names = net.getLayerNames()
output_layers = [layer_names[i[0] -1] for i in net.getUnconnectedOutLayers()]
return output_layers

GITAM(Deemed to be University) Page 33


Introduction to Computer Vision

defdraw_prediction(img, class_id, confidence, x, y, x_plus_w, y_plus_h):


label =str(classes[class_id])
color =COLORS[class_id]
cv2.rectangle(img, (x,y), (x_plus_w,y_plus_h), color, 2)
cv2.putText(img, label, (x-10,y-10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, color, 2)
image = cv2.imread(args.image)
Width= image.shape[1]
Height= image.shape[0]
scale =0.00392
classes =None
withopen(args.classes, 'r') as f:
classes = [line.strip() for line in f.readlines()]
COLORS= np.random.uniform(0, 255, size=(len(classes), 3))
net = cv2.dnn.readNet(args.weights, args.config)
blob = cv2.dnn.blobFromImage(image, scale, (416,416), (0,0,0), True, crop=False)
net.setInput(blob)
outs = net.forward(get_output_layers(net))
class_ids = []
confidences = []
boxes = []
conf_threshold =0.5
nms_threshold =0.4
for out in outs:
for detection in out:
scores = detection[5:]
class_id = np.argmax(scores)
confidence = scores[class_id]
if confidence >0.5:
center_x =int(detection[0] *Width)
center_y =int(detection[1] *Height)
w =int(detection[2] *Width)
h =int(detection[3] *Height)
x = center_x - w /2
y = center_y - h /2
class_ids.append(class_id)
confidences.append(float(confidence))
boxes.append([x, y, w, h])
indices = cv2.dnn.NMSBoxes(boxes, confidences, conf_threshold, nms_threshold)
for i in indices:
i = i[0]
box = boxes[i]
x = box[0]
y = box[1]

GITAM(Deemed to be University) Page 34


Introduction to Computer Vision

w = box[2]
h = box[3]
draw_prediction(image, class_ids[i], confidences[i], round(x), round(y), round(x+w),
round(y+h))
cv2.imshow("object detection", image)
cv2.waitKey()
cv2.imwrite("object-detection.jpg", image)
cv2.destroyAllWindows()

Output is in the Fig 7.1 object-detection.jpg

Fig 7.1.1 Person Detection of trained model

GITAM(Deemed to be University) Page 35


Introduction to Computer Vision

8. Applications and Purpose

8.1 Applications and usage


• Optical character recognition (OCR)
• Machine inspection
• Retail (e.g. automated checkouts)
• 3D model building (photogrammetry)
• Medical imaging
• Automotive safety
• Match move (e.g. merging CGI with live actors in movies)
• Motion capture (mocap)
• Surveillance
• Fingerprint recognition and biometrics

8.1.1 Optical character recognition (OCR)


Optical character recognition is the electronic or mechanical conversion of images of
typed, handwritten or printed text into machine-encoded text, whether from a scanned document,
a photo of a document, a scene-photo (for example the text on signs and billboards in a landscape
photo) or from subtitle text superimposed on an image (for example: from a television
broadcast).
Widely used as a form of data entry from printed paper data records – whether passport
documents, invoices, bank statements, computerized receipts, business cards, mail, printouts of
static-data, or any suitable documentation – it is a common method of digitizing printed texts so that
they can be electronically edited, searched, stored more compactly, displayed on-line, and used in
machine processes such as cognitive computing, machine translation, (extracted) text-to-speech,
key data and text mining. OCR is a field of research in pattern recognition, artificial
intelligence and computer vision.

GITAM(Deemed to be University) Page 36


Introduction to Computer Vision

Fig 8.1.1 OCR Number Plate Recognition

Fig 8.1.2 Applications where Computer Vision used

GITAM(Deemed to be University) Page 37


Introduction to Computer Vision

8.2 Purpose
As humans, we perceive the three-dimensional structure of the world around us with
apparent ease. Think of how vivid the three-dimensional percept is when you look at a vase of
flowers sitting on the table next to you. You can tell the shape and translucency of each petal
through the subtle patterns of light and shading that play across its surface and effortlessly
segment each flower from the background of the scene. Looking at a framed group portrait, you
can easily count (and name) all of the people in the picture and even guess at their emotions from
their facial appearance. Perceptual psychologists have spent decades trying to understand how
the visual system works and, even though they can devise optical illusions1 to tease apart some
of its principles, a complete solution to this puzzle remains elusive.

8.2.1 Disadvantages
• Necessity of specialists: there is a huge necessity of specialist related to the field of
Machine Learning and Artificial Intelligence. A professional that knows how those
devices work and take full advantage of Computer Vision. Also, the person can repair
them when necessary. There are a lot of work opportunities after doing a Master in
Artificial Intelligences. However, companies still wait for those specialists.

• Spoiling: eliminate the human factor may be good in some cases. But when the machine
or device fails, it doesn’t announce or anticipate that problem. Whereas a human person
can tell in advance when the person won’t come.
• Failing in image processing: when the device fails because of a virus or other software
issues, it is highly probable that Computer Vision and image processing will fail. But if
we do not solve the problem, the functions of the device can disappear. It can froze the
entire production in the case of warehouses.

8.3 Components in Computer Vision

GITAM(Deemed to be University) Page 38


Introduction to Computer Vision

Fig 8.3.1 Components used in Computer Vision

8.4 Context and scene understanding


Thus far, we have mostly considered the task of recognizing and localizing objects in
isolation from that of understanding the scene (context) in which the object occur. This is a
severe limitation, as context plays a very important role in human object recognition. As we will
see in this section, it can greatly improve the performance of object recognition algorithms, as
well as providing useful semantic clues for general scene understanding Consider the two
photographs in Figure. Can you name all of the objects, especially those circled in images (c–d)?
Now have a closer look at the circled objects. So much for our ability to recognize object by their
shape! Another (perhaps more artificial) example of recognition in context. Try to name all of
the letters and numbers, and then see if you guessed right. Even though we have not addressed
context explicitly earlier in this chapter, we have already seen several instances of this general
idea being used. A simple way to incorporate spatial information into a recognition algorithm is
to compute feature statistics over different regions, as in the spatial pyramid system of Lazebnik,
Schmid, and Ponce . Part-based models , use a kind of local context, where various parts need to

GITAM(Deemed to be University) Page 39


Introduction to Computer Vision

be arranged in a proper geometric relationship to constitute an object. The biggest difference


between part-based and context models is that the latter combine objects into scenes and the
number of constituent objects from each class is not known in advance. In fact, it is possible to
combine part-based and context models.

Fig 8.4 (a) some street scenes and their corresponding labels (magenta = buildings, red =
cars, green = trees, blue = road); (b) some office scenes (red = computer screen, green =
keyboard, blue = mouse); (c) learned contextual models built from these labeled scenes. The top
row shows a sample label image and the distribution of the objects relative to the center red (car
or screen) object. The bottom rows show the distributions of parts that make up each object.

Applications range from tasks such as industrial machine vision systems which, say,
inspect bottles speeding by on a production line, to research into artificial intelligence and
computers or robots that can comprehend the world around them. The computer vision and
machine vision fields have significant overlap. Computer vision covers the core technology of
automated image analysis which is used in many fields. Machine vision usually refers to a
process of combining automated image analysis with other methods and technologies to provide
automated inspection and robot guidance in industrial applications. In many computer-vision
applications, the computers are pre-programmed to solve a particular task, but methods based on
learning are now becoming increasingly common.

GITAM(Deemed to be University) Page 40


Introduction to Computer Vision

9. CONCLUSION AND FUTURE SCOPE

9.1 Conclusion
computer vision is to understand the content of digital images. Typically, this involves
developing methods that attempt to reproduce the capability of human vision.

The research results published have formed a foundation for further investigation of the
ability of computer vision systems to better provide quantitative information that is unobtainable
subjectively, leading to the eventual replacement of human graders.

It is clear that computer vision technology will continue to be a very useful tool in
tackling a wide variety of challenges in meat classification and quality prediction. Good
predictions and classification results have been found on a large number of occasions using
simple, rapid and affordable technology and the new challenges encountered .

9.2 Future Scope


As computer power becomes cheaper, more accessible, and portable, we can expect an
increasing number of computer vision applications for businesses in the near future.
In Future ,developing an accurate, regulated and purpose-driven facial recognition
software is the challenge of computer vision engineering. It is also the future of how people do
their everyday tasks: from unlocking a phone to signing into a bank app or logging the time they
get to the office. Facial recognition amazes more and more with its new applications. Today,
people can go to a concert without the tickets they bought online, they just have to bring…their
face and show it at the gate.
With the help of algorithms computer vision is currently capable of recognizing and
tracking objects. But it can go much further than that. Now, neural networks are at such a high
level that they are capable of restoring and even creating new images. All of these features that
seemed to be available only in futuristic or science fiction movies are now in the market thanks
to AI algorithms, such as Generative Adversarial Networks (GAN).
Today, big cloud providers have released their own AI and vision models including facial
recognition APIs. Nevertheless, these models are conceived with rather general purpose
algorithms that might be sensitive to specific environmental details, but don’t offer the
possibility to retrain models with data gathered in a productive environment

GITAM(Deemed to be University) Page 41


Introduction to Computer Vision

10. BIBILIOGRAPHY
1. https://machinelearningmastery.com/what-is-computer-vision/
2. https://en.wikipedia.org/wiki/Computer_vision
3. https://www.pcmag.com/news/what-is-computer-vision
4. https://www.sciencedaily.com/terms/computer_vision.htm
5. https://www.analyticsinsight.net/state-computer-vision-now-future/
6. https://www.innoplexus.com/blog/understanding-the-computer-vision-technology/

GITAM(Deemed to be University) Page 42

You might also like