Object detection research paper
Object detection research paper
in Video Scene
Hanen J abnoun I, F aouzi Benzarti I, Hamid Amiri I
1 LR-II-ES 17 Signal, Images et Technologies de l'lnformation (LR-SITI-ENIT)
Universite de Tunis EI Manar, Ecole Nationale d'Ingenieur de Tunis
1002, Tunis Le Belvedere, Tunisie
[email protected] [email protected] [email protected]
surrounding environment. Hence, over thousands of papers Related works show that visual substitution devices accept
have been published on these subjects that propose a variety of input from the user's surroundings, decipher it to extract
computer vision products and services by developing new
information about entities in the user's environment, and then
electronic aids for the blind. This paper aims to introduce a
transmit that information to the subject via auditory or tactile
proposed system that restores a central function of the visual
system which is the identification of surrounding objects. This means or some combination of these two.
method is based on the local features extraction concept. The Among the various technologies used for blind people, the
simulation results using SFIT algorithm and keypoints majority is aids of mobility and obstacle detection [5, 8].
matching showed good accuracy for detecting objects. Thus, They are based on rules for converting images into data
our contribution is to present the idea of a visual substitution sensory substitution tactile or auditory stimuli. These systems
system based on features extractions and matching to recognize are efficient for mobility and localization of objects which is
and locate objects in images. sometimes with a lower precision.
However, one of the greatest difficulties of blind people is the
Keywords-Video processing; pattern recognition; SIFT;
identification of their environment and its [6]. Indeed, they can
keypoints matching; visual substituion system.
only be used to recognize simple patterns and cannot be used
as tools of substitution in natural environments. Also, they
I. INTRODUCTION
don't identify objects (e.g. whether it is a table or chair) and
According to the World Health Organization, there are they have in some cases a late detection of small objects. In
approximately 285 million people who are visual impairments, addition, some of them seek additional auditory, others require
39 million of them are blind and 246 million have a decrease of a sufficiently long period for learning and testing.
Visual acuity. Almost 90% who are visually impaired are Among the problems in object identification, we note the
living in low-income countries. In this context, Tunisia has redundancy of objects under different conditions: the change of
identified 30,000 people with visual impairments; including viewpoint, the change of illumination and the change of size.
13.3% of them are blind. We have the concept of intra-class variability (e.g. there are
These Visual impairment present severe consequences on many types of chairs) and the inter-class similarity (e.g.
certain capabilities related to visual function: television and computer).
The daily living activities (that require a vision at a
medium distance) For this reason, we are interested in the evaluation of an
Communication, reading, writing (which requires a algorithm for fast and robust computer vision application to
vision closely and average distance) recognize and locate objects in a video scene.
Evaluation of space and the displacement (which Thus, it is important to design a system based on the
require a vision far) recognition and detection of objects to meet the major
The pursuit of an activity requiring prolonged challenges of the blind in three main categories of needs:
maintenance of visual attention. displacement, orientation and object identification.
In the computer vision community, developing visual aids for
handicapped persons is one of the most active research III. OBJECT DETECTION BASED ON FEATURES EXTRACTION
projects. Mobility aids are intended to describe the Object recognition is a classical problem in computer
environment close to the person with an appreciation of the vision: the task of determining if the image data contains a
surrounding objects. These aids are essential for fine navigation specific object and it is noted that general object recognition
in an environment described in a coordinate system relative to approaches exploit features extraction. Features that have
the user. In this paper, we present an overview of vision received the most attention in the recent years are the local
substitution modalities [ 1- 12] and their functionalities. Then, features. The main idea is to focus on the areas containing the
we introduce our proposed system and the experiments tests. most discriminative information.
364 2015 15th International Conference on Intelligent Systems Design and Applications (ISDA)
database objects is made to detect object in each frame. For V. RESULTS
SIFT
keypointos:
detertion from
frames
Traduction
Object Voice
=
Output data
B. Algorithm
We build the bag of key-points of objects with SIFT and we
calculate the local dissimilarity map between frames. The
algorithm demonstrating the RVLDM used in the video
analysis processes as follow:
(I) Obtain the set of key-points of objects
a. Select a large set of images of daily objects.
b. Extract the SIFT feature points of all the
images within the set and obtain the SIFT
descriptor for each feature point extracted
from each image.
(2) Obtain the keypoints descriptor for the first video
frame. Fig. 2. Example of selected frames from video scenes
2015 15th International Conference on Intelligent Systems DeSign and Applications (ISDA) 365
Klip a'
3 35 65
5 95 5 In the figure (FigA), the box is well detect, a short description
about the medicine in an audio file notify the blind about what
he hold in his hand.
An algorithm called Random Sample Consensus (RAN SAC)
In the video scene we can have different level of illumination.
[ 16] is used to fit a geometric model called an affine transform
Even in the same video, illumination can be changed. It was
to the initial set of matches. This is achieved by iteratively
selecting a random set of matches, learning a model from this important to detect objects in video scene with high
random set and then testing the remaining matches against the illumination (Fig.5) and in another one with low illumination
(Fig.6). It is clear that the number of true matches decrease in
learnt model.
the video with low lightness. However, the object is well
We can take advantage of this by transforming the bounding
detected. So we can conclude that SIFT is invariant to the
box of the object with the transform estimated in the affine
change in luminosity in video and the object can be detected
transform model. Therefore we can draw a polygon around the
estimated location of the object within the frame. and identified.
Then Table II shows that the number of true positive depends C. Discussions
on th � number of scales used in the SIFT algorithm. In fact,S The challenge in comparing keypoints is to figure out
scales gives a better matching than 3, also a number up to 5 matching between keypoints from some frames and those
didn't' give a butter results but it takes more time. So, in order from target objects. We get high percentage of detected
to have strong matches, we work with number of scale S equal objects but we tried also to identify the reason behind some
to 5. In the figure above (Fig.3) we tried to find some object.
failure cases.
We note that we detect all objects in this scene video. So we
The first cause of non-detection of object was the quality of
tried also to detect a medical box, because we know that images. For example in the figure (Fig. 7), the pen was not
identifying medicines is a delicate task for blind people. detected because we have added some noise to the video
scene.
366 2015 15th International Conference on Intelligent Systems Design and Applications (ISDA)
In this stage of works, we address the recogmtlOn of each
object in the scene as an individual task; we do not consider
the relationships between many objects. Thus, in future works,
we will consider this relationship for scene understanding or
detecting everything that belongs to a given place or location.
Finally, in order to help bind people and to provide from the
new technologies, a mobile application can be the best
solution.
REFERENCES
Fig. 7. Failure detection caused by the quality of image [I] Durette, B., Louveton, N. , Alleysson, D. , and H"erault, .I, 2008. Visuo
auditory sensory substitution for mobility assistance: testing The VIBE.
In Workshop on Computer Vision Applications for the Visually
Impaired, Marseille, France.
[2] Hern'andez, A F. R. et ai, 2009. Computer Solutions on Sensory
Substitution for Sensory Disabled People. In Proceedings of the 8th
WSEAS International Conference on Computational Intelligence, Man
machine Systems and Cybernetics, pp. 134-138.
[3] Tang, H. , and Beebe, D. .I. , 2006. An oral tactile interface for blind
navigation. IEEE Trans Neural Syst Rehabil Eng,pp. 116-123.
[4] Auvray, M. , Hanneton, S. , and O'Regan, 1. K. , 2007. Learning to
perceive with a visuo - auditory substitution system: Localisation and
object recognition with 'The vOICe'. Perception,pp. 416-430..
Fig. 8. Failure detection caused by the size of target object [5] Kammoun, S. et ai, 2012. Navigation and space perception assistance for
the visually impaired: The NAVIG project. In IRBM, Numro spcial ANR
TECSAN,33(2),pp.182-189.
[6] Brian Katz, F. G. et ai, 2012. NAVIG: Guidance system for the visually
impaired using virtual augmented reality. In Technology and Disability,
pp. 163-178.
[7] Abboud, S. , Hanassy, S. , Levy-Tzedek, S. , Maidenbaum, S., and Amedi,
A, 2014. EyeMusic: Introducing a "visual"' colorful experience for the
blind using auditory sensory substitution. In Restorative neurology and
neuroscience.
[8] Martnez, B. D. C. , Vergara-Villegas, 0.0., Snchez, v. G. c. , De Jess
Ochoa Domnguez, H. , and Maynez, L. O. , 20II. Visual Perception
Substitution by the Auditory Sense. In Beniamino Murgante, Osvaldo
GervaSi, Andrs IgleSias, David Taniar, and Bernady 0. Apduhan,
Fig. 9. Failure detection caused by the high speed of video scene
editors, ICCSA (2), volume 6783 of Lecture Notes in Computer Science,
pp. 522-533. Springer.
[9] .Iabnoun, H. , Benzarti, F. , and Amiri, H. , 2014. Visual substitution
Then in the figure (Fig. 8), the object in the target frame is system for blind people based on SIFT description. In International
very small compared to his origin size. We know that SIFT is Conference of Soft Computing and Pattern Recognition. Tunsia, pp.
30.0.-30.5.
invariant to the scale transfonnation but when the object
[10] Renier, L. et ai, 2005. The Ponzo illusion with auditory substitution of
contain some texture and it is very small in the target frame,
vision in sighted and early-blind subjects. In Perception.
the number of keypoints is not sufficient for matching object
[II] Zhang,.I., Ong, S. K. and Nee, A Y. c. , 2009. Design and Development
and frame and we get all the time false matches. of a Navigation Assistance System for Visually Impaired Individuals. In
Proceedings of the 3rd International Convention on Rehabilitation
Furthermore, when the registered video is very fast, SIFT has Engineering & Assistive Technology, i-CREATe '0.9, New York, NY,
USA.
not the time to detect all keypoints and we get generally a
[12] .Iafri, R. , Ali, SA, Arabnia, HR. , and Fatima, S., 2013. Computer vision
false matches (Fig. 9) so the object is not well detected.
based object recognition for the visually impaired in an indoors
In some works they have to slow down the video, but in our environment: a survey. In The Visual Computer, Springer Berlin
case, we have just to move slowly in the step of recording the Heidelberg,pp. 1-26.
scene. [13] Lowe, GD. , 2004. Distinctive Image Features from Scale-Invariant
Keypoints.ln Int. J. Compul. Vision, 60.(2),pp. 91-110.
VI. CONCLUSION AND FUTURE WORKS [14] Bay, H. , Ess, A, Tuytelaars,T., and Gool, L.v. 2008. SURF: Speeded
Up Robust Features. Computer Vision and Image Understanding
In this paper, we present a visual substitution system for blind (CVIU), pp. 346--359.
people based on object recognition in video scene. This [IS] S. Hare., 1., Sina,S., and P. Dupplaw, D. 2011. OpenIMAJ and
system uses SIFTS keypoints extraction and features matching ImageTerrier: Java libraries and tools for scalable multimedia analysis
for object identification. We devote the experimental part to and indexing of images. In Proceedings of the 19th ACM international
conference on Multimedia (MM 'II). ACM, New York, NY, USA, 691-
test the application in order to detect some objects in some
694. DOI=10.1145/2072298.2072421.
video scene with different conditions.
[16] Ondrej, C .. and Matas, 1. 2005. Matching with PROSAC-progressive
sample consensus. Computer Vision and Pattern Recognition, 2005.
CVPR 2005. IEEE Computer Society Conference on. Vol. I. IEEE.
2015 15th International Conference on Intelligent Systems DeSign and Applications (ISDA) 367