"Introduction To Computer Vision": Submitted by
"Introduction To Computer Vision": Submitted by
BACHELOR OF TECHNOLOGY
IN
Computer Science and Engineering
Submitted By
GITAM(Deemed to be University)
Rudraram Mandal, Sangareddy district, Patancheru, Hyderabad,
Telangana 502329
Introduction to Computer Vision
Contents
Abstract………………………………………………………………………...1
1. Introduction 2
1.1 What is computer vision?.........................................................................2
1.2 A brief history…………………………………………………………...2
1.3 Algorithm……………………………………………………………….4
2. Image formation 5
2.1 Photometric image formation………………………………………….5
2.1.1 Lighting…………………………………………………………..6
2.1.2 Reflectance and shading………………………………………….7
2.1.3 Optics……………………………………………………………..7
2.3 The Digital Camera……………………………………………………..8
2.3.1 Color cameras…………………………………………………….9
2.3.2 Color filter arrays……………………………………………….10
2.3.3 Compression…………………………………………………….10
3 Image processing 12
3.1 Point operators……………………………………………………….12
3.1.1 Color Transforms……………………………………………….14
3.2 Linear Filtering……………………………………………………...15
3.3 Band pass and steerable filters……………………………………….15
5 Image Stitching 24
5.1 Motion model…………………………………………………………...24
5.2 Gap Closing……………………………………………………………..24
5.3 Video summarization and compression…………………………………26
5.4 Recognizing panoramas…………………………………………………26
5.5 Heads and faces…………………………………………………………28
6 Recognition 29
6.1 Face Detection………………………………………………………….29
6.2 Neural network in detection…………………………………………..29
6.3 Location recognition……………………………………………………30
7 Implementation 32
7.1 Object Detection……………………………………………………….32
7.2 Implementation of object detection code……………………………...33
10 Bibliography 42
Abstract:
M. Sailohith
221710304034
CSE-B4
1.INTRODUCTION
1.1 What is computer vision?
Computervision is an interdisciplinary scientific field that deals with how computers can
gain high-level understanding from digital images or videos. From the perspective
of engineering, it seeks to understand and automate tasks that the human visual system can do.
Computer vision tasks include methods for acquiring, processing, analyzing and
understanding digital images, and extraction of high-dimensional data from the real world in
order to produce numerical or symbolic information, e.g. in the forms of
decisions. Understanding in this context means the transformation of visual images (the input of
the retina) into descriptions of the world that make sense to thought processes and can elicit
appropriate action. This image understanding can be seen as the disentangling of symbolic
information from image data using models constructed with the aid of geometry, physics,
statistics, and learning theory.
When computer vision first started out in the early 1970s, it was viewed as the visual
perception component of an ambitious agenda to mimic human intelligence and to endow robots
with intelligent behavior. At the time, it was believed by some of the early pioneers of artificial
intelligence and robotics (at places such as MIT, Stanford, and CMU) that solving the “visual
input” problem would be an easy step along the path to solving more difficult problems such as
higher-level reasoning and planning. According to one well-known story, in 1966, Marvin
Minsky at MIT asked his undergraduate student Gerald Jay Sussman to “spend the summer
linking a camera to a computer and getting the computer to describe what it saw” 5 We now
know that the problem is slightly more difficult than that.6 What distinguished computer vision
from the already existing field of digital image processing (Rosenfeld and Pfaltz 1966; Rosenfeld
and Kak 1976) was a desire to recover the three-dimensional structure of the world from images
and to use this as a stepping stone towards full scene understanding. Winston (1975) and Hanson
and Riseman (1978) provide two nice collections of classic papers from this early period. Early
attempts at scene understanding involved extracting edges and then inferring the 3D structure of
an object or a “blocks world” from the topological structure of the 2D lines (Roberts 1965).
Several line labeling algorithms.
This past decade has continued to see a deepening interplay between the vision and graphics
fields. In particular, many of the topics introduced under the rubric of image-based rendering,
such as image stitching (see Chapter 9), light-field capture and rendering , and high dynamic
range (HDR) image capture through exposure bracketing and Mann and Picard 1995; Debevec
and Malik 1997), were re-christened as computational photography (see Chapter 10) to
acknowledge the increased use of such techniques in everyday digital photography. For example,
the rapid adoption of exposure bracketing to create high dynamic range images necessitated the
development of tone mapping algorithms to convert such images back to displayable results
(Fattal, Lischinski, and Werman 2002; Durand and Dorsey 2002; Reinhard, Stark, Shirley et al.
2002; Lischinski, Farbman, Uyttendaele et al. 2006a). In addition to merging multiple exposures,
techniques were developed to merge flash images with non-flash counterparts (Eisemann and
Durand 2004; Petschnigg, Agrawala, Hoppe et al. 2004) and to interactively or automatically
select different regions from overlapping images.
1.3 Algorithm
2. Image formation
2.1 Photometric image formation
In modeling the image formation process, we have described how 3D geometric features in
the world are projected into 2D features in an image. However, images are not composed of 2D
features. Instead, they are made up of discrete color or intensity values. Where do these values
come from? How do they relate to the lighting in the environment, surface properties and
geometry, camera optics, and sensor properties In this section, we develop a set of models to
describe these interactions and formulate a generative process of image formation. A more
detailed treatment of these topics can be found in other textbooks on computer graphics and
image synthesis
2.1.1 Lighting
Images cannot exist without light. To produce an image, the scene must be illuminated
with one or more light sources. (Certain modalities such as fluorescent microscopy and X-ray
tomography do not fit this model, but we do not deal with them in this book.) Light sources can
generally be divided into point and area light sources. A point light source originates at a single
location in space (e.g., a small light bulb), potentially at infinity (e.g., the sun). (Note that for
some applications such as modeling soft shadows (penumbras), the sun may have to be treated as
an area light source.) In addition to its location, a point light source has an intensity and a color
spectrum, i.e., a distribution over
Fig 2.2.1 A simplified model of photometric image formation. Light is emitted by one or more
light sources and is then reflected from an object’s surface. A portion of this light is directed towards the
camera. This simplified model ignores multiple reflections, which often occur in real-world scenes.
wavelengths L(λ). The intensity of a light source falls off with the square of the distance
between the source and the object being lit, because the same light is being spread over a larger
(spherical) area. A light source may also have a directional falloff (dependence), but we ignore
this in our simplified model. Area light sources are more complicated. A simple area light source
such as a fluorescent ceiling light fixture with a diffuser can be modeled as a finite rectangular
area emitting light equally in all directions (Cohen and Wallace 1993; Sillion and Puech 1994;
Glassner 1995). When the distribution is strongly directional, a four-dimensional lightfield can
be used instead (Ashdown 1993). A more complex light distribution that approximates, say, the
incident illumination on an object sitting in an outdoor courtyard, can often be represented using
an environment map (Greene 1986) (originally called a reflection map (Blinn and Newell 1976)).
This representation maps incident light directions vˆ to color values (or wavelengths, λ), L(vˆ; λ)
Fig2.2.3 A thin lens of focal length f focuses the light from a plane a distance zo in front of the
lens at a distance zi
Fig 2.3 Image sensing pipeline, showing the various sources of noise as well as typical digital post-
processing steps
In a CCD, photons are accumulated in each active well during the exposure time. Then, in a
transfer phase, the charges are transferred from well to well in a kind of “bucket brigade” until
they are deposited at the sense amplifiers, which amplify the signal and pass it to an analog-to-
digital converter (ADC).10 Older CCD sensors were prone to blooming, when charges from one
over-exposed pixel spilled into adjacent ones, but most newer CCDs have anti-blooming
technology (“troughs” into which the excess charge can spill). In CMOS, the photons hitting the
sensor directly affect the conductivity (or gain) of a photodetector, which can be selectively
gated to control exposure duration, and locally amplified before being read out using a
multiplexing scheme. Traditionally, CCD sensors outperformed CMOS in quality sensitive
applications, such as digital SLRs, while CMOS was better for low-power applications, but today
CMOS is used in most digital cameras. The main factors affecting the performance of a digital
image sensor are the shutter speed, sampling pitch, fill factor, chip size, analog gain, sensor
noise, and the resolution (and quality)
color channels. Properly calibrated monitors make this information available to software
applications that perform color management, so that colors in real life, on the screen, and on the
printer all match as closely as possible.
Fig 2.3.2Bayer RGB pattern: (a) color filter array layout; (b) interpolated pixel values, with unknown
(guessed) values shown as lower case.
The most commonly used pattern in color cameras today is the Bayer pattern (Bayer
1976), which places green filters over half of the sensors (in a checkerboard pattern), and red and
blue filters over the remaining ones (Figure 2.3.2). The reason that there are twice as many green
filters as red and blue is because the luminance signal is mostly determined by green values and
the visual system is much more sensitive to high frequency detail in luminance than in
chrominance (a fact that is exploited in color image compression—see Section 2.3.2). The
process of interpolating the missing color values so that we have valid RGB values for all the
pixels is known as demosaicing and is covered in detail. Similarly, color LCD monitors typically
use alternating stripes of red, green, and blue filters placed in front of each liquid crystal active
area to simulate the experience of a full color display. As before, because the visual system has
higher resolution (acuity) in luminance than chrominance, it is possible to digitally pre-filter
RGB (and monochrome) images to enhance the perception of crispness
2.2.3 Compression
The last stage in a camera’s processing pipeline is usually some form of image
compression (unless you are using a lossless compression scheme such as camera RAW or
PNG). All color video and image compression algorithms start by converting the signal into
YCbCr (or some closely related variant), so that they can compress the luminance signal with
higher fidelity than the chrominance signal.
Fig 2.3.3 Color space transformations: (a–d) RGB; (e–h) rgb. (i–k) L*a*b*; (l–n) HSV. Note that the rgb,
L*a*b*, and HSV values are all re-scaled to fit the dynamic range of the printed page.
3. Image processing
The simplest kinds of image processing transforms are point operators, where each output
pixel’s value depends on only the corresponding input pixel value (plus, potentially, some
globally collected information or parameters). Examples of such operators include brightness and
contrast adjustments (Figure 3.1) as well as color correction and transformations. In the image
processing literature, such operations are also known as point processes.
Now that we have seen how images are formed through the interaction of 3D scene
elements, lighting, and camera optics and sensors, let us look at the first stage in most computer
vision applications, namely the use of image processing to preprocess the image and convert it
into a form suitable for further analysis. Examples of such operations include exposure
correction and color balancing, the reduction of image noise, increasing sharpness, or
straightening the image by rotating it . While some may consider image processing to be outside
the purview of computer vision, most computer vision applications, such as computational
photography and even recognition, require care in designing the image processing stages in order
to achieve acceptable results.
Another important class of global operators are geometric transformations, such as
rotations, shears, and perspective deformations. Finally, we introduce global optimization
approaches to image processing, which involve the minimization of an energy functional or,
equivalently, optimal estimation using Bayesian Markov random field models
Fig 3.1 Some local image processing operations: (a) original image along with its three color (per-
channel) histograms; (b) brightness increased (additive offset, b = 16); (c) contrast increased
(multiplicative gain, a = 1.1); (d) gamma (partially) linearized (γ = 1.2); (e) full histogram equalization; (f)
partial histogram equalization.
Fig 3.1.1 Visualizing image data: (a) original image; (b) cropped portion and scanline plot using an
image inspection tool; (c) grid of numbers; (d) surface plot. For figures (c)–(d), the image was first
converted to grayscale.
Fig 3.1.1.1 (a) source image; (b) extracted foreground object F; (c) alpha matte α shown in grayscale;
(d) new composite C.
Fig 3.2 Neighborhood filtering (convolution): The image on the left is convolved with
the filter in the middle to yield the image on the right. The light blue pixels indicate the source
neighborhood for the light green destination pixel
The Sobel and corner operators are simple examples of band-pass and oriented filters.
More sophisticated kernels can be created by first smoothing the image with a (unit area)
Gaussian filter, G(x, y; σ) = 1 2πσ2 e − x 2+y 2 2σ2 , (3.24) and then taking the first or second
derivatives. Such filters are known collectively as band-pass filters, since they filter out both low
and high frequencies. The (undirected) second derivative of a two-dimensional image, ∇2 f = ∂
2f ∂x2 + ∂ 2y ∂y2 , (3.25) is known as the Laplacian operator. Blurring an image with a Gaussian
and then taking its Laplacian is equivalent to convolving directly with the Laplacian of Gaussian
(LoG) filter, ∇2G(x, y; σ) = x 2 + y 2 σ 4 − 2 σ 2 G(x, y; σ)
Fig 3.3 (a) original image of Einstein; (b) orientation map computed from the second-order oriented
energy; (c) original image with oriented structures enhanced.
Fig 4.1 Image pairs with extracted patches below. Notice how some patches can be localized
or matched with higher accuracy than others.
Adaptive non-maximal suppression (ANMS). While most feature detectors simply look
for local maxima in the interest function, this can lead to an uneven distribution of feature points
across the image, e.g., points will be denser in regions of higher contrast. To mitigate this
problem, Brown, Szeliski, and Winder (2005) only detect features that are both local maxima
and whose response value is significantly (10%) greater than that of all of its neighbors within a
radius r . They devise an efficient way to associate suppression radii with all local maxima by
first sorting them by their response strength and then creating a second list sorted by decreasing
suppression radius shows a qualitative comparison of selecting the top n features and using
ANMS. Measuring repeatability. Given the large number of feature detectors that have been
developed in computer vision, how can we decide which ones to use? Schmid, Mohr, and
Bauckhage (2000) were the first to propose measuring the repeatability of feature detectors,
which they define as the frequency with which keypoints detected in one image are found within
(say, = 1.5) pixels of the corresponding location in a transformed image. In their paper, they
transform their planar images by applying rotations, scale changes, illumination changes,
viewpoint changes, and adding noise. They also measure the information content available at
each detected feature point, which they define as the entropy of a set of rotationally invariant
local grayscale descriptors. Among the techniques they survey, they find that the improved
(Gaussian derivative) version of the Harris operator with σd = 1 (scale of the derivative
Gaussian) and σi = 2 (scale of the integration Gaussian) works best.
Fig 4.1The upper two images show the strongest 250 and 500 interest points, while the lower
two images show the interest points selected with adaptive non-maximal suppression, along with the
corresponding suppression radius r. Note how the latter features have a much more uniform spatial
distribution across the image.
Fig 4.2 The affine warp of each recognized database image onto the scene is shown as a larger
parallelogram
4.3 Edges
While interest points are useful for finding image locations that can be accurately
matched in 2D, edge points are far more plentiful and often carry important semantic
associations. For example, the boundaries of objects, which also correspond to occlusion events
in 3D, are usually delineated by visible contours. Other kinds of edges correspond to shadow
boundaries or crease edges, where surface orientation changes rapidly. Isolated edge points can
also be grouped into longer curves or contours, as well as straight line segments .It is interesting
that even young children have no difficulty in recognizing familiar objects or animals from such
simple line drawings.
The local gradient vector J points in the direction of steepest ascent in the intensity
function. Its magnitude is an indication of the slope or strength of the variation, while its
orientation points in a direction perpendicular to the local contour. Unfortunately, taking image
derivatives accentuates high frequencies and hence amplifies noise, since the proportion of noise
to signal is larger at high frequencies. It is therefore prudent to smooth the image with a low-pass
filter prior to computing the gradient. Because we would like the response of our edge detector to
be independent of orientation, a circularly symmetric smoothing filter is desirable. As we saw in
Section 3.2, the Gaussian is the only separable circularly symmetric filter and so it is used in
most edge detection algorithms.
In fact, it is not strictly necessary to take differences between adjacent levels when
computing the edge field. Think about what a zero crossing in a “generalized” difference of
Gaussians image represents. The finer (smaller kernel) Gaussian is a noise-reduced version of the
original image. The coarser (larger kernel) Gaussian is an estimate of the average intensity over a
larger region. Thus, whenever the DoG image changes sign, this corresponds to the (slightly
blurred) image going from relatively darker to relatively lighter, as compared to the average
intensity in that neighborhood. Once we have computed the sign function S(x), we must find its
zero crossings and convert these into edge elements (edgels). An easy way to detect and
represent zero crossings is to look for adjacent pixel locations xi and xj where the sign changes
value, i.e., [S(xi) > 0] 6= [S(xj ) > 0].
Fig 4.3 The darkness of the edges corresponds to how many human subjects marked an object
boundary at that location.
Fig 4.4Chain code representation of a grid-aligned linked edge chain. The code is represented
as a series of direction codes, e.g, 0 1 0 7 6 5, which can further be compressed using predictive and run-
length coding.
4.5 Lines
While edges and general curves are suitable for describing the contours of natural objects,
the man-made world is full of straight lines. Detecting and matching these lines can be useful in
a variety of applications, including architectural modeling, pose estimation in urban
environments, and the analysis of printed document layouts. In this section, we present some
techniques for extracting piecewise linear descriptions from the curves computed in the previous
section. We begin with some algorithms for approximating a curve as a piecewise-linear
polyline. We then describe the Hough transform, which can be used to group edgels into line
segments even across gaps and occlusions. Finally, we describe how 3D lines with common
vanishing points can be grouped together. These vanishing points can be used to calibrate a
camera and to determine its orientation relative.
RANSAC-based line detection. Another alternative to the Hough transform is the
RANdom Sample Consensus (RANSAC) algorithm described in more detail . In brief, RANSAC
randomly chooses pairs of edges to form a line hypothesis and then tests how many other edges
fall onto this line. (If the edge orientations are accurate enough, a single edge can produce this
hypothesis.) Lines with sufficiently large numbers of inliers (matching edges) are then selected
as the desired line segments. An advantage of RANSAC is that no accumulator array is needed
and so the algorithm can be more space efficient and potentially less prone to the choice of bin
size. The disadvantage is that many more hypotheses may need to be generated and tested than
those obtained by finding peaks in the accumulator array. In general, there is no clear consensus
on which line estimation technique performs best. It is therefore a good idea to think carefully
about the problem at hand and to implement several approaches (successive approximation,
Hough, and RANSAC) to determine the one that works best for your application.
Once the Hough accumulator space has been populated, peaks can be detected in a
manner similar to that previously discussed for line detection. Given a set of candidate line
segments that voted for a vanishing point, which can optionally be kept as a list at each Hough
accumulator cell, I then use a robust least squares fit to estimate a more accurate location for
each vanishing point.
Fig 4.5 Real-world vanishing points: (a) architecture (Sinha, Steedly, Szeliski et al. 2008), (b)
furniture c 2008 IEEE, and (c) calibration patterns (Zhang 2000)
5. Image stitching
5.1 Motion models
The simplest possible motion model to use when aligning images is to simply translate
and rotate them in 2D. This is exactly the same kind of motion that you would use if you had
overlapping photographic prints. It is also the kind of technique favored by David Hockney to
create the collages that he calls joiners (Zelnik-Manor and Perona 2007; Nomura, Zhang, and
Nayar 2007). Creating such collages, which show visible seams and inconsistencies that add to
the artistic effect, is popular on Web sites such as Flickr, where they more commonly go under
the name panography Translation and rotation are also usually adequate motion models to
compensate for small camera motions in applications such as photo and video stabilization and
merging. In Section 6.1.3, we saw how the mapping between two cameras viewing a common
plane can be described using a 3×3 homography (2.71). Consider the matrix M10 that arises
when mapping a pixel in one image to a 3D point and then back onto a second image, x˜1 ∼
P˜1P˜ −1 0 x˜0 = M10x˜0. (9.1) When the last row of the P 0 matrix is replaced with a plane
equation nˆ0 ·p+c0 and points are assumed to lie on this plane, i.e., their disparity is d0 = 0, we
can ignore the last column of M10 and also its last row, since we do not care about the final z-
buffer depth. The resulting homography matrix H˜ 10 (the upper left 3 × 3 sub-matrix of M10)
describes the mapping between pixels in the two images, x˜1 ∼ H˜ 10x˜0.
Fig 5.1 Two-dimensional motion models and how they can be used for image stitching.
panorama. Instead, there will invariably be either a gap or an overlap (Figure 9.5). We can solve
this problem by matching the first image in the sequence with the last one. The difference
between the two rotation matrix estimates associated with the repeated first indicates the amount
of misregistration. This error can be distributed evenly across the whole sequence by taking the
quotient of the two quaternions associated with these rotations and dividing this “error
quaternion” by the number of images in the sequence (assuming relatively constant inter-frame
rotations). We can also update the estimated focal length based on the amount of mis
registration. To do this, we first convert the error quaternion into a gap angle, θg and then update
the focal length using the equation f 0 = f(1 − θg/360◦ ). shows the end of registered image
sequence and the first image. There is a big gap between the last image and the first which are in
fact the same image. The gap is 32◦ because the wrong estimate of focal length (f = 510) was
used. Figure 9.5b shows the registration after closing the gap with the correct focal length (f =
468). Notice that both mosaics show very little visual misregistration (except at the gap), yet has
been computed using a focal length that has 9% error.
Fig 5.2Gap closing (Szeliski and Shum 1997) c 1997 ACM: (a) A gap is visible when the focal length is
wrong (f = 510). (b) No gap is visible for the correct focal length (f = 468).
Fig 5.3 Video stitching the background scene to create a single sprite image that can be
transmitted and used to re-create the background in each frame
If the user takes images in sequence so that each image overlaps its predecessor and also
specifies the first and last images to be stitched, bundle adjustment combined with the process of
topology inference can be used to automatically assemble a panorama (Sawhney and Kumar
1999). However, users often jump around when taking panoramas, e.g., they may start a new row
on top of a previous one, jump back to take a repeat shot, or create 360◦ panoramas where end-
to-end overlaps need to be discovered. Furthermore, the ability to discover multiple panoramas
taken by a user over an extended period of time can be a big convenience. To recognize
panoramas, Brown and Lowe (2007) first find all pairwise image overlaps using a feature-based
method and then find connected components in the overlap graph to “recognize” individual
panoramas. The feature-based matching stage first extracts scale invariant feature transform
(SIFT) feature locations and feature descriptors (Lowe 2004) from all the input images and
places them in an indexing structure, as described in Section 4.1.3. For each image pair under
consideration, the nearest matching neighbor is found for each feature in the first image, using
the indexing structure to rapidly find candidates and then comparing feature descriptors to find
the best match. RANSAC is used to find a set of inlier matches; pairs of matches are used to
hypothesize similarity motion models that are then used to count the number of inliers. (A more
recent RANSAC algorithm tailored specifically for rotational panoramas is described by Brown,
Hartley, and Nister´ (2007).) In practice, the most difficult part of getting a fully automated
stitching algorithm to work is deciding which pairs of images actually correspond to the same
parts of the scene. Repeated structures such as windows can lead to false matches when using a
feature-based approach. One way to mitigate this problem is to perform a direct pixelbased
comparison between the registered images to determine if they actually are different views of the
same scene. Unfortunately, this heuristic may fail if there are moving objects in the scene.
Fig 5.4 (a) composite with parallax; (b) after a single deghosting step (patch size 32); (c) after multiple
steps
Fig 5.6(a) set of five input images along with user-selected keypoints; (b) the complete set of
keypoints and curves; (c) three meshes—the original, adapted after 13 keypoints, and after an additional
99 keypoints; (d) the partition of the image into separately animatable regions.
6. Recognition
6.1Face detection
Appearance-based approaches scan over small overlapping rectangular patches of the
image searching for likely face candidates, which can then be refined using a cascade of more
expensive but selective detection algorithms. In order to deal with scale variation, the image is
usually converted into a sub-octave pyramid and a separate scan is performed on each level.
Most appearance-based approaches today rely heavily on training classifiers using sets of labeled
face and non-face patches. Sung and Poggio (1998) and Rowley, Baluja, and Kanade (1998a)
present two of the earliest appearance-based face detectors and introduce a number of
innovations that are widely used in later work by others. To start with, both systems collect a set
of labeled face patches as well as a set of patches taken from images that are known not to
contain faces, such as aerial images or vegetation The collected face images are augmented by
artificially mirroring, rotating, scaling, and translating the images by small amounts to make the
face detectors less sensitive to such effects. After an initial set of training images has been
collected, some optional pre-processing can be performed, such as subtracting an average
gradient (linear function) from the image to compensate for global shading effects and using
histogram equalization to compensate for varying camera contrast.
Fig 6.1(a) artificially mirroring, rotating, scaling, and translating training images for greater variability;
(b) using images without faces (looking up at a tree) to generate non-face examples; (c) pre-processing
the patches by subtracting a best fit linear function (constant gradient) and histogram equalizing
Instead of first clustering the data and computing Mahalanobis distances to the cluster
centers, Rowley, Baluja, and Kanade apply a neural network (MLP) directly to the 20×20 pixel
patches of gray-level intensities, using a variety of differently sized hand-crafted “receptive
fields” to capture both large-scale and smaller scale structure). The resulting neural network
directly outputs the likelihood of a face at the center of every overlapping patch in a multi-
resolution pyramid. Since several overlapping patches (in both space and resolution) may fire
near a face, an additional merging network is used to merge overlapping detections. The authors
also experiment with training several networks and merging their outputs. shows a sample result
from their face detector. To make the detector run faster, a separate network operating on 30×30
patches is trained to detect both faces and faces shifted by ±5 pixels. This network is evaluated at
every 10th pixel in the image (horizontally and vertically) and the results of this “coarse” or
“sloppy” detector are used to select regions on which to run the slower single-pixel overlap
technique. To deal with in-plane rotations of faces, Rowley, Baluja, and Kanade (1998b) train a
router network to estimate likely rotation angles from input patches and then apply the estimated
rotation to each patch before running the result through their upright face detector.
Fig 6.1 Overlapping patches are extracted from different levels of a pyramid and then pre-
processed . A three-layer neural network is then used to detect likely face locations.
recognition from Internet photo collections taken under a wide variety of viewing conditions.
The main difficulty in location recognition is in dealing with the extremely large community
(user-generated) photo collections on Web sites such as Flickr or commercially captured
databases. The prevalence of commonly appearing elements such as foliage, signs, and common
architectural elements further complicates the task. shows some results on location recognition
from community photo collections, while shows sample results from denser commercially
acquired datasets. In the latter case, the overlap between adjacent database images can be used to
verify and prune potential matches using “temporal” filtering, i.e., requiring the query image to
match nearby overlapping database images before accepting the match. Another variant on
location recognition is the automatic discovery of landmarks, i.e., frequently photographed
objects and locations. Simon, Snavely, and Seitz (2007) show how these kinds of objects can be
discovered simply by analyzing the matching graph constructed as part of the 3D modeling
process in Photo Tourism. More recent work has extended this approach to larger data sets using
efficient clustering techniques as well as combining meta-data such as GPS and textual tags with
visual search It is now even possible to automatically associate object tags with images based on
their co-occurrence in multiple loosely tagged images.
Fig 6.2 Automatic mining, annotation, and localization of community photo collections .
This figure does not show the textual annotations
7. IMPLEMENTATION
7.1 Object detection
If we are given an image to analyze, such as the group portrait , we could try to apply a
recognition algorithm to every possible sub-window in this image. Such algorithms are likely to
be both slow and error-prone. Instead, it is more effective to construct special purpose detectors,
whose job it is to rapidly find likely regions where particular objects might occur. We begin this
section with face detectors, which are some of the more successful examples of recognition. For
example, such algorithms are built into most of today’s digital cameras to enhance auto-focus
and into video conferencing systems to control pan-tilt heads. We then look at pedestrian
detectors, as an example of more general methods for object detection. Such detectors can be
used in automotive safety applications, e.g., detecting pedestrians and other cars from moving
vehicles.
All the visual tasks we might ask a computer to perform, analyzing a scene and
recognizing all of the constituent objects remains the most challenging. While computers excel at
accurately reconstructing the 3D shape of a scene from images taken from different views, they
cannot name all the objects and animals present in a picture, even at the level of a twoyear-old
child. There is not even any consensus among researchers on when this level of performance
might be achieved.
The key to the success of boosting is the method for incrementally selecting the weak
learners and for re-weighting the training examples after each stage The AdaBoost (Adaptive
Boosting) algorithm does this by re-weighting each sample as a function of whether it is
correctly classified at each stage, and using the stage-wise average classification error to
determine the final weightings αj among the weak classifiers, While the resulting classifier is
extremely fast in practice, the training time can be quite slow (in the order of weeks), because of
the large number of feature (difference of rectangle) hypotheses that need to be examined at each
stage. To further increase the speed of the detector, it is possible to create a cascade of classifiers,
where each classifier uses a small number of tests (say, a two-term AdaBoost classifier) to reject
a large fraction of non-faces while trying to pass through all potential face candidates.
w = box[2]
h = box[3]
draw_prediction(image, class_ids[i], confidences[i], round(x), round(y), round(x+w),
round(y+h))
cv2.imshow("object detection", image)
cv2.waitKey()
cv2.imwrite("object-detection.jpg", image)
cv2.destroyAllWindows()
8.2 Purpose
As humans, we perceive the three-dimensional structure of the world around us with
apparent ease. Think of how vivid the three-dimensional percept is when you look at a vase of
flowers sitting on the table next to you. You can tell the shape and translucency of each petal
through the subtle patterns of light and shading that play across its surface and effortlessly
segment each flower from the background of the scene. Looking at a framed group portrait, you
can easily count (and name) all of the people in the picture and even guess at their emotions from
their facial appearance. Perceptual psychologists have spent decades trying to understand how
the visual system works and, even though they can devise optical illusions1 to tease apart some
of its principles, a complete solution to this puzzle remains elusive.
8.2.1 Disadvantages
• Necessity of specialists: there is a huge necessity of specialist related to the field of
Machine Learning and Artificial Intelligence. A professional that knows how those
devices work and take full advantage of Computer Vision. Also, the person can repair
them when necessary. There are a lot of work opportunities after doing a Master in
Artificial Intelligences. However, companies still wait for those specialists.
• Spoiling: eliminate the human factor may be good in some cases. But when the machine
or device fails, it doesn’t announce or anticipate that problem. Whereas a human person
can tell in advance when the person won’t come.
• Failing in image processing: when the device fails because of a virus or other software
issues, it is highly probable that Computer Vision and image processing will fail. But if
we do not solve the problem, the functions of the device can disappear. It can froze the
entire production in the case of warehouses.
Fig 8.4 (a) some street scenes and their corresponding labels (magenta = buildings, red =
cars, green = trees, blue = road); (b) some office scenes (red = computer screen, green =
keyboard, blue = mouse); (c) learned contextual models built from these labeled scenes. The top
row shows a sample label image and the distribution of the objects relative to the center red (car
or screen) object. The bottom rows show the distributions of parts that make up each object.
Applications range from tasks such as industrial machine vision systems which, say,
inspect bottles speeding by on a production line, to research into artificial intelligence and
computers or robots that can comprehend the world around them. The computer vision and
machine vision fields have significant overlap. Computer vision covers the core technology of
automated image analysis which is used in many fields. Machine vision usually refers to a
process of combining automated image analysis with other methods and technologies to provide
automated inspection and robot guidance in industrial applications. In many computer-vision
applications, the computers are pre-programmed to solve a particular task, but methods based on
learning are now becoming increasingly common.
9.1 Conclusion
computer vision is to understand the content of digital images. Typically, this involves
developing methods that attempt to reproduce the capability of human vision.
The research results published have formed a foundation for further investigation of the
ability of computer vision systems to better provide quantitative information that is unobtainable
subjectively, leading to the eventual replacement of human graders.
It is clear that computer vision technology will continue to be a very useful tool in
tackling a wide variety of challenges in meat classification and quality prediction. Good
predictions and classification results have been found on a large number of occasions using
simple, rapid and affordable technology and the new challenges encountered .
10. BIBILIOGRAPHY
1. https://machinelearningmastery.com/what-is-computer-vision/
2. https://en.wikipedia.org/wiki/Computer_vision
3. https://www.pcmag.com/news/what-is-computer-vision
4. https://www.sciencedaily.com/terms/computer_vision.htm
5. https://www.analyticsinsight.net/state-computer-vision-now-future/
6. https://www.innoplexus.com/blog/understanding-the-computer-vision-technology/