0% found this document useful (0 votes)
2 views

Object detection research paper

This paper presents a system designed to assist blind individuals by identifying and locating objects in video scenes using object detection techniques based on local feature extraction, specifically the SIFT algorithm. The proposed method aims to enhance the accuracy of object recognition and provide auditory feedback to users, addressing the challenges faced by visually impaired individuals in identifying their environment. The results demonstrate the effectiveness of the system in various lighting conditions, highlighting the importance of robust feature matching for successful object detection.

Uploaded by

nikhilshet17.ns
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Object detection research paper

This paper presents a system designed to assist blind individuals by identifying and locating objects in video scenes using object detection techniques based on local feature extraction, specifically the SIFT algorithm. The proposed method aims to enhance the accuracy of object recognition and provide auditory feedback to users, addressing the challenges faced by visually impaired individuals in identifying their environment. The results demonstrate the effectiveness of the system in various lighting conditions, highlighting the importance of robust feature matching for successful object detection.

Uploaded by

nikhilshet17.ns
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Object Detection and Identification for Blind People

in Video Scene
Hanen J abnoun I, F aouzi Benzarti I, Hamid Amiri I
1 LR-II-ES 17 Signal, Images et Technologies de l'lnformation (LR-SITI-ENIT)
Universite de Tunis EI Manar, Ecole Nationale d'Ingenieur de Tunis
1002, Tunis Le Belvedere, Tunisie
[email protected] [email protected] [email protected]

Abstract- Vision is one of the very essential human senses


and it plays the most important role in human perception about II. OVERVIEW

surrounding environment. Hence, over thousands of papers Related works show that visual substitution devices accept
have been published on these subjects that propose a variety of input from the user's surroundings, decipher it to extract
computer vision products and services by developing new
information about entities in the user's environment, and then
electronic aids for the blind. This paper aims to introduce a
transmit that information to the subject via auditory or tactile
proposed system that restores a central function of the visual
system which is the identification of surrounding objects. This means or some combination of these two.
method is based on the local features extraction concept. The Among the various technologies used for blind people, the
simulation results using SFIT algorithm and keypoints majority is aids of mobility and obstacle detection [5, 8].
matching showed good accuracy for detecting objects. Thus, They are based on rules for converting images into data
our contribution is to present the idea of a visual substitution sensory substitution tactile or auditory stimuli. These systems
system based on features extractions and matching to recognize are efficient for mobility and localization of objects which is
and locate objects in images. sometimes with a lower precision.
However, one of the greatest difficulties of blind people is the
Keywords-Video processing; pattern recognition; SIFT;
identification of their environment and its [6]. Indeed, they can
keypoints matching; visual substituion system.
only be used to recognize simple patterns and cannot be used
as tools of substitution in natural environments. Also, they
I. INTRODUCTION
don't identify objects (e.g. whether it is a table or chair) and
According to the World Health Organization, there are they have in some cases a late detection of small objects. In
approximately 285 million people who are visual impairments, addition, some of them seek additional auditory, others require
39 million of them are blind and 246 million have a decrease of a sufficiently long period for learning and testing.
Visual acuity. Almost 90% who are visually impaired are Among the problems in object identification, we note the
living in low-income countries. In this context, Tunisia has redundancy of objects under different conditions: the change of
identified 30,000 people with visual impairments; including viewpoint, the change of illumination and the change of size.
13.3% of them are blind. We have the concept of intra-class variability (e.g. there are
These Visual impairment present severe consequences on many types of chairs) and the inter-class similarity (e.g.
certain capabilities related to visual function: television and computer).
The daily living activities (that require a vision at a
medium distance) For this reason, we are interested in the evaluation of an
Communication, reading, writing (which requires a algorithm for fast and robust computer vision application to
vision closely and average distance) recognize and locate objects in a video scene.
Evaluation of space and the displacement (which Thus, it is important to design a system based on the
require a vision far) recognition and detection of objects to meet the major
The pursuit of an activity requiring prolonged challenges of the blind in three main categories of needs:
maintenance of visual attention. displacement, orientation and object identification.
In the computer vision community, developing visual aids for
handicapped persons is one of the most active research III. OBJECT DETECTION BASED ON FEATURES EXTRACTION

projects. Mobility aids are intended to describe the Object recognition is a classical problem in computer
environment close to the person with an appreciation of the vision: the task of determining if the image data contains a
surrounding objects. These aids are essential for fine navigation specific object and it is noted that general object recognition
in an environment described in a coordinate system relative to approaches exploit features extraction. Features that have
the user. In this paper, we present an overview of vision received the most attention in the recent years are the local
substitution modalities [ 1- 12] and their functionalities. Then, features. The main idea is to focus on the areas containing the
we introduce our proposed system and the experiments tests. most discriminative information.

978-1-4673-8709-5/15/$31.00 ©2015 IEEE 363


A. Features extraction 2. Sub-unit localization and filtering of keypoints
The main purpose of using features instead of raw pixel values 3. Assign the orientations to keypoints
as the input to a learning algorithm is to reduce/increase the 4. Compute the keypoint descriptors
In fact, SIFT was shown to perform better than all other local
in-class/out-of class variability compared to the raw input
descriptors. SIFT descriptor is based on the idea of using the
data, and thus making classification easier [9]. Visual
local gradient patch around the keypoint to build a
substitution systems generally exploit a single camera to
representation for the point.
capture image data. Recognition is then performed based on
Different levels of Gaussian filtered image are subtracted to
various features extracted from that data. In addition, features
build the difference of Gaussian image for further calculation.
extraction is the process by which certain features of interest Differences of Gaussian (DoG) for each octave fmd extrema
within an image are detected and represented for further from DoG. Each octave is computed by convolution between
processing. It is a critical step in most computer vision and down sampled images with different Gaussian kernels
image processing solutions because it marks the transition The Difference of Gaussian is obtained from the difference of
from pictorial to non-pictorial (alphanumerical, usually two adjacent scales that are separated by a factor of k [ 14].
quantitative) data representation [ 13]. Types of features that
can be extracted from image depend on the type of image, the D(X,O") = (G(X, kO") - G(X, 0")) * leX) ( 1)
level of granularity desired, and the context of the application.
Once the features have been extracted, theirs representation Where:
depends on the technique used. I: the input image.
The features extraction process should be precise, so that the X: the point X(x ,y).
same features are extracted on two images showing the same
a: the scale
object [ 12]. The major algorithm used in recent research for
features extraction are the Scale Invariant Features Transform G(X, 0") : The variable-scale Gaussian.
(SIFT) [ 13] and the Speed Up Robust Features (SURF) [ 15] The DoG interest regions are defined as locations that are
algorithms. simultaneously extrema in the image plane and along the scale
coordinate of the D(x, a) function ( 1). Such points are found
B. Features descriptors by comparing the D(x, a) value of each point with its 8-
Extracted features represent the interesting points found in the neighborhood on the same scale level, and with the 9 closest
image to compare them with other interesting points. neighbors on each of the two adjacent levels.
Descriptors are used to describe these features. They are For each feature point in the image, a square patch around it is
generally based around points of interest of the image and appropriately scaled and rotated. Then, the 8 bins orientation
often associated with a detector of keypoints [ 10]. The histograms over a 4x4 region are generated using the gradient
descriptors can be global, local or semi-local: magnitudes and orientations in the patch. Finally, the outputs
of 16 orientation histograms which results in a 128 element
- Global image descriptor: features overall image are usually
feature descriptor are computed to get the final descriptor.
based on color indices and the most famous global color
descriptor is the color histogram. IV. PROPOSED APPROACH FOR VISUAL SUBSTITUTION
- Local image descriptors: Local features are ones that have SYSTEM
received the most attention in recent years. The main idea is For objects recognition in videos, a lot of work has been done
to focus on the areas containing the most discriminated using so-called feature extraction and detection along the
information. video or by extraction of visually salient regions which are
- Semi-local image descriptor: most shape descriptors fall supposed to contain object of interest.
into this category. This descriptor is based on extracting Feature matching using invariant features has obtained
accurate contours of shapes in the image or in the region of significant importance due to its application in various
interest. In this case, image segmentation is generally useful recognition problems. Such techniques have enabled us to
as a preprocessing step. match images irrespective of various geometric and
photometric transformations between images.
C. SIFT keypoints extraction
A. Proposed architecture
Several methods for features localization and description have
Our proposed visual substitution system is based on the
been proposed in the literature. In this paper, we aim to use
identification of objects around the blind person. We propose
local features extraction algorithm SIFT.
a system that recognize and locate 20 in the video (Fig. 1).
The SIFT (Scale Invariant Feature Transform) local descriptor
proposed by Lowe [ 13] has been one of the most widely used This system should find the invariant characteristic of objects
descriptors. to viewpoint changes, provide the recognition and reduce the
It operates in four major stages to detect and describe local complexity of detection.
features, or keypoints, in an image: We propose a method based on keypoints extraction and
1. Detect the extrema in scale space matching in video. A comparison between query frame and

364 2015 15th International Conference on Intelligent Systems Design and Applications (ISDA)
database objects is made to detect object in each frame. For V. RESULTS

each object detected an audio file containing the information


about it is activate. Hence object detection and identification A. Testing data
are simultaneously addressed. Available video databases consist generally on videos of
traffic vehicles, faces detection and tracking. In our work we
need videos of scenes from daily life with images of all the
objects. We precede with four videos sequences (Fig. 2); each
one from them contains a number of daily life objects.
Videos have not the same duration. Table I shows that each
one contains different number of frames and objects. They
will have different time processing.

SIFT
keypointos:
detertion from
frames

Traduction
Object Voice
=

Output data

Fig. l. Flowchart of proposed system

B. Algorithm
We build the bag of key-points of objects with SIFT and we
calculate the local dissimilarity map between frames. The
algorithm demonstrating the RVLDM used in the video
analysis processes as follow:
(I) Obtain the set of key-points of objects
a. Select a large set of images of daily objects.
b. Extract the SIFT feature points of all the
images within the set and obtain the SIFT
descriptor for each feature point extracted
from each image.
(2) Obtain the keypoints descriptor for the first video
frame. Fig. 2. Example of selected frames from video scenes

a. Extract SIFT feature points of the given


image.
TABLE!. DATA SELECTED FOR TESTS
b. Acquire SIFT descriptor for each feature
point. Video Number of Number
c. Match the frame key-points with those of frames of objects
the objects and identify detected objects. I 14 10 6
(3) For the next frame: 2 1022 8
a. If it contains the same objects they will not 3 786 6
be detected 4 685 4
b. New objects will be detected and identified.
c. Another method will be used in this step to B. Object detection and identification in video
identify similar and dissimilar frames for The first step is to load up a video scene containing objects;
further treatments. then we have the feature extraction. We use the difference-of­
(4) For each object detected in video a video file is Gaussian feature detector introduced previously with SIFT
launched to notify the blind about the identity of descriptor. The features we find are described in a way which
objects. makes them invariant to size changes, rotation and position.
This algorithm highlights the use of SIFT for keypoints These are quite powerful features and are used in a variety of
detection and matching to ensure the identification of the tasks. We use a standard Java implementation of SIFT [ 15].
objects. The SIFT descriptor is a 128 dimensional description of a
patch of pixels around the keypoint.
In the step of matching, the basic matcher finds many
matches, many of which are clearly incorrect.

2015 15th International Conference on Intelligent Systems DeSign and Applications (ISDA) 365
Klip a'

Fig. 3. Object detection in some video scenes

Fig. 5. Medical box detection in video with low luminosity

Fig. 4. Detection of a medical box

TABLE II. NUMBER OR TRUE MATCHES AND FALSE MATCHES


DEPENDING ON THE SCALE NUMBER

Number of True positive (%) False positive (%)


scales Fig. 6. Medical box detection in video with high luminosity

3 35 65
5 95 5 In the figure (FigA), the box is well detect, a short description
about the medicine in an audio file notify the blind about what
he hold in his hand.
An algorithm called Random Sample Consensus (RAN SAC)
In the video scene we can have different level of illumination.
[ 16] is used to fit a geometric model called an affine transform
Even in the same video, illumination can be changed. It was
to the initial set of matches. This is achieved by iteratively
selecting a random set of matches, learning a model from this important to detect objects in video scene with high
random set and then testing the remaining matches against the illumination (Fig.5) and in another one with low illumination
(Fig.6). It is clear that the number of true matches decrease in
learnt model.
the video with low lightness. However, the object is well
We can take advantage of this by transforming the bounding
detected. So we can conclude that SIFT is invariant to the
box of the object with the transform estimated in the affine
change in luminosity in video and the object can be detected
transform model. Therefore we can draw a polygon around the
estimated location of the object within the frame. and identified.
Then Table II shows that the number of true positive depends C. Discussions
on th � number of scales used in the SIFT algorithm. In fact,S The challenge in comparing keypoints is to figure out
scales gives a better matching than 3, also a number up to 5 matching between keypoints from some frames and those
didn't' give a butter results but it takes more time. So, in order from target objects. We get high percentage of detected
to have strong matches, we work with number of scale S equal objects but we tried also to identify the reason behind some
to 5. In the figure above (Fig.3) we tried to find some object.
failure cases.
We note that we detect all objects in this scene video. So we
The first cause of non-detection of object was the quality of
tried also to detect a medical box, because we know that images. For example in the figure (Fig. 7), the pen was not
identifying medicines is a delicate task for blind people. detected because we have added some noise to the video
scene.

366 2015 15th International Conference on Intelligent Systems Design and Applications (ISDA)
In this stage of works, we address the recogmtlOn of each
object in the scene as an individual task; we do not consider
the relationships between many objects. Thus, in future works,
we will consider this relationship for scene understanding or
detecting everything that belongs to a given place or location.
Finally, in order to help bind people and to provide from the
new technologies, a mobile application can be the best
solution.

REFERENCES

Fig. 7. Failure detection caused by the quality of image [I] Durette, B., Louveton, N. , Alleysson, D. , and H"erault, .I, 2008. Visuo­
auditory sensory substitution for mobility assistance: testing The VIBE.
In Workshop on Computer Vision Applications for the Visually
Impaired, Marseille, France.
[2] Hern'andez, A F. R. et ai, 2009. Computer Solutions on Sensory
Substitution for Sensory Disabled People. In Proceedings of the 8th
WSEAS International Conference on Computational Intelligence, Man­
machine Systems and Cybernetics, pp. 134-138.
[3] Tang, H. , and Beebe, D. .I. , 2006. An oral tactile interface for blind
navigation. IEEE Trans Neural Syst Rehabil Eng,pp. 116-123.
[4] Auvray, M. , Hanneton, S. , and O'Regan, 1. K. , 2007. Learning to
perceive with a visuo - auditory substitution system: Localisation and
object recognition with 'The vOICe'. Perception,pp. 416-430..

Fig. 8. Failure detection caused by the size of target object [5] Kammoun, S. et ai, 2012. Navigation and space perception assistance for
the visually impaired: The NAVIG project. In IRBM, Numro spcial ANR
TECSAN,33(2),pp.182-189.
[6] Brian Katz, F. G. et ai, 2012. NAVIG: Guidance system for the visually
impaired using virtual augmented reality. In Technology and Disability,
pp. 163-178.
[7] Abboud, S. , Hanassy, S. , Levy-Tzedek, S. , Maidenbaum, S., and Amedi,
A, 2014. EyeMusic: Introducing a "visual"' colorful experience for the
blind using auditory sensory substitution. In Restorative neurology and
neuroscience.
[8] Martnez, B. D. C. , Vergara-Villegas, 0.0., Snchez, v. G. c. , De Jess
Ochoa Domnguez, H. , and Maynez, L. O. , 20II. Visual Perception
Substitution by the Auditory Sense. In Beniamino Murgante, Osvaldo
GervaSi, Andrs IgleSias, David Taniar, and Bernady 0. Apduhan,
Fig. 9. Failure detection caused by the high speed of video scene
editors, ICCSA (2), volume 6783 of Lecture Notes in Computer Science,
pp. 522-533. Springer.
[9] .Iabnoun, H. , Benzarti, F. , and Amiri, H. , 2014. Visual substitution
Then in the figure (Fig. 8), the object in the target frame is system for blind people based on SIFT description. In International
very small compared to his origin size. We know that SIFT is Conference of Soft Computing and Pattern Recognition. Tunsia, pp.
30.0.-30.5.
invariant to the scale transfonnation but when the object
[10] Renier, L. et ai, 2005. The Ponzo illusion with auditory substitution of
contain some texture and it is very small in the target frame,
vision in sighted and early-blind subjects. In Perception.
the number of keypoints is not sufficient for matching object
[II] Zhang,.I., Ong, S. K. and Nee, A Y. c. , 2009. Design and Development
and frame and we get all the time false matches. of a Navigation Assistance System for Visually Impaired Individuals. In
Proceedings of the 3rd International Convention on Rehabilitation
Furthermore, when the registered video is very fast, SIFT has Engineering & Assistive Technology, i-CREATe '0.9, New York, NY,
USA.
not the time to detect all keypoints and we get generally a
[12] .Iafri, R. , Ali, SA, Arabnia, HR. , and Fatima, S., 2013. Computer vision­
false matches (Fig. 9) so the object is not well detected.
based object recognition for the visually impaired in an indoors
In some works they have to slow down the video, but in our environment: a survey. In The Visual Computer, Springer Berlin
case, we have just to move slowly in the step of recording the Heidelberg,pp. 1-26.
scene. [13] Lowe, GD. , 2004. Distinctive Image Features from Scale-Invariant
Keypoints.ln Int. J. Compul. Vision, 60.(2),pp. 91-110.
VI. CONCLUSION AND FUTURE WORKS [14] Bay, H. , Ess, A, Tuytelaars,T., and Gool, L.v. 2008. SURF: Speeded
Up Robust Features. Computer Vision and Image Understanding
In this paper, we present a visual substitution system for blind (CVIU), pp. 346--359.
people based on object recognition in video scene. This [IS] S. Hare., 1., Sina,S., and P. Dupplaw, D. 2011. OpenIMAJ and
system uses SIFTS keypoints extraction and features matching ImageTerrier: Java libraries and tools for scalable multimedia analysis
for object identification. We devote the experimental part to and indexing of images. In Proceedings of the 19th ACM international
conference on Multimedia (MM 'II). ACM, New York, NY, USA, 691-
test the application in order to detect some objects in some
694. DOI=10.1145/2072298.2072421.
video scene with different conditions.
[16] Ondrej, C .. and Matas, 1. 2005. Matching with PROSAC-progressive
sample consensus. Computer Vision and Pattern Recognition, 2005.
CVPR 2005. IEEE Computer Society Conference on. Vol. I. IEEE.

2015 15th International Conference on Intelligent Systems DeSign and Applications (ISDA) 367

You might also like