Vision For The Blind
Vision For The Blind
downsampled
Abstract—Augmented reality applications have focused on column
image index
visually integrating virtual objects into real environments. In 5 6
7 8
this paper, we propose an auditory augmented reality, where we 2 3 4
1
integrate acoustic virtual objects into the real world. We sonify
8
objects that do not intrinsically produce sound, with the purpose 7
of revealing additional information about them. Using spatialized 6
(3D) audio synthesis, acoustic virtual objects are placed at specific 5
real-world coordinates, obviating the need to explicitly tell the 4
user where they are. Thus, by leveraging the innate human 3
2
capacity for 3D sound source localization and source separation,
1
we create an audio natural user interface. In contrast with column 6
previous work, we do not create acoustic scenes by transducing
row index
low-level (for instance, pixel-based) visual information. Instead,
we use computer vision methods to identify high-level features of Figure 1. Encoding used by The vOICe
interest in an RGB-D stream, which are then sonified as virtual
objects at their respective real-world coordinates. Since our
visual and auditory senses are inherently spatial, this technique
naturally maps between these two modalities, creating intuitive
which appear to originate from their real-world locations. On
representations. We evaluate this concept with a head-mounted one hand, computer-vision adds a layer of cognition which
device, featuring modes that sonify flat surfaces, navigable paths promotes relevance and produces tremendous bandwidth sav-
and human faces. ings. On the other hand, spatial audio eliminates the overhead
Index Terms—augmented reality, natural user interface, soni- of encoding, transmitting and decoding spatial coordinates.
fication, spatialization, blind, visually impaired.
In recent years, there have been several proposals for
translating visual information into audio by encoding low
I. I NTRODUCTION level features. For instance, several methods were proposed
According to the World Health Organization, there are to encode a bitmap image one pixel column at a time, using
an estimated 39 million blind people in the world [1]. In frequency-domain multiplexing for each column. The vOICe
the United States, there are an estimated 1.3 million legally [4] was the first proposal for a wearable device of this kind.
blind individuals [2], with approximately 109,000 who use A camera acquires a bitmap image of up to 64x64 pixels with
long canes and 7,000 who rely on guide dogs [3]. Since 16 shades of gray per pixel, and the system encodes columns
vision impairments hinder a wide variety of human activities, in left-to-right order. Given a column, each pixel controls a
assistive devices have been designed to facilitate specific tasks sinusoidal generator, with its value determining amplitude, and
or enhance mobility. its coordinate being proportional to the frequency (see Fig.
We start from the observation that our two highest band- 1). At a given moment, the user hears the superposition of
width senses – vision and hearing – have spatial structure. all the sinusoids from a column. After all columns have been
Using spatialized (3D) audio, we synthesize virtual acoustic rendered, a synchronization click is generated and the process
objects, placing them at specific real-world coordinates. Thus, restarts.
by leveraging the innate human ability for 3D sound source The literature features several variations of this approach. In
localization, we can relay location-dependent information [5], an RGB camera image was first reduced to 1 bit per pixel,
without having to explicitly encode spatial coordinates. and pixels were associated with musical notes. A black and
In this paper, we combine this approach with computer white image was then be mapped into a melody. SVETA [6]
vision techniques to create natural high-level scene represen- was a more recent proposal which transduced a disparity image
tations for the visually impaired. For example, we use face obtained from stereo matching. To reduce user fatigue, pixels
recognition to detect known individuals, who can then virtually were associated with major chords instead of pure sinewaves.
identify themselves using recordings of their own voices, Unless the image is very sparse, encoding every pixel
fs
1/r
TTS
XAudio2
HRTF +
FM Synth (fixed)
Source
PCM Buffer Reverb
characteristics of an environment, and improve their sound Figure 5. Plane segmentation example
source localization performance with time [13].
HRTF filtering is performed with an FFT accelerated con- 4
1 4 1
volution engine, with a typical latency of 10 ms. Each audio
source is associated with a playlist of wave samples, speech
utterances (produced by a text-to-speech synthesizer) or mu-
sical notes (produced by an FM synthesizer). To facilitate the
3
description of real-world features, a source can also describe
2 3
parameterized curves in real-world coordinates. horizontal plane vertical plane 2
Head tracking is an integral component of audio spatial- vertex order vertex order
ization. If the user moves, the virtual audio sources associated Figure 6. Acoustic rendering for plane representation
with real-world objects should not be dragged along with him.
Thus, during the acoustic rendering of a scene (which lasts
approximately 1 second) we perform head tracking, and update of its 4 corners. A vertical plane is represented by a clockwise
the coordinates of all 3D audio objects. To simplify tracking, sequence of musical notes, rising in pitch, with each corner
we only estimate the relative rotation between computer vision rendered by a virtual source at the corner’s real world location.
updates. This rotation is given by the composition of rotation A horizontal plane is distinguished by being represented by a
matrices of the form counterclockwise sequence of musical notes, falling in pitch.
⎡ ⎤
1 −θz θy While other representations are certainly possible, this proved
R (θx , θy , θy ) = ⎣ θz 1 −θx ⎦ . (1) sufficient to relay the concept of a plane.
−θy θx 1 C. Floorplan description
While the composition of rotations is not commutative, the Several devices for the visually impaired have relied on
composition of infinitesimal rotations is. We sample a 3D ultrasound for describing local geometry. In comparison, depth
gyroscope at 40 Hz, such that a relative rotation matrix can cameras provide a dramatic improvement in terms of spatial
be updated by iteratively multiplying by (1) and reorthonor- resolution. Furthermore, their range can reach 10 m with
malizing the result. current technology. Thus, one can perform local mapping to
complement the short range and tactile feedback of a white
B. Plane detection cane. With this in mind, our device implements a floorplan
Plane detection is used for two purposes: to identify the description mode, which is intended to relay on demand a fast
floor, and to provide an environmental decomposition into flat description of the navigable floor.
surfaces. Indeed, planes are the dominant primitives in man- We define the navigable region to be the visible floor,
made environments, and can be conveniently used to identify bounded by obstacles. Visibility is important for safety rea-
walls and furniture. Our underlying assumption is that given sons, as it prevents instructing the user to walk on potentially
the decomposition of an environment into planes of known nonexistent ground. Depth cameras relying on infrared also
sizes and locations, a user is able to infer which classes of produce offscale high pixels for glass and black surfaces,
objects they belong to, given contextual clues. For instance, effectively classifying them as distant objects. By treating
the location of a table could be inferred from the description of these offscale regions as non-navigable, we prevent collisions
a large horizontal plane. Likewise, a chair could be associated with undetectable obstacles.
with a small pair of horizontal and vertical planes. The plane detector is first used to locate the floor, which is
Our algorithm for fast plane detection is described in the the largest plane with an orientation consistent with the gravity
Appendix. Figure 5 shows an example for plane segmentation, vector (given by the accelerometer). After plane detection, we
where each plane is drawn with a different color. The sampling rotate the entire point cloud so y = 0 for all points in the
rectangles used for each rectangle are also represented. floor. We then project the entire point cloud onto the xz plane,
Figure 6 shows how planes are represented acoustically. The creating a 2D floorplan. We ignore points which are small
plane detector associates each detected plane with its point in height, and fall under the error threshold of the camera.
cloud. Using the eigendecomposition of this flat set of points, We also ignore obstacles which the user can walk under.
we approximate it as a quadrilateral and produce an estimate The obstacle floorplan is then convolved with a human-like
60º FOV navigable direction
obstacle
synchronization click
FOV
-90º +90º
Figure 11. Participant score histogram
f +f
Our proof of concept device features modes for plane where z is the depth coordinate, fd = x 2 y is the mean focal
decomposition, floorplan estimation and face detection. Plane length and σ0 and B are constants. Let t0 be the inlier distance
decomposition could be generalized with other primitives. threshold at a reference depth z0 . One can use (3) to determine
A practical device would benefit from additional operating the probability p0 of having a depth error exceeding t0 . We
modes, such as optical character recognition followed by use (3) to produce a depth-dependent inlier threshold t (z),
text-to-speech synthesis, and barcode recognition followed such that for all z, the depth error exceeds t (z) with constant
by product lookup. Outdoor urban use could benefit from probability p0 . Assuming the independence of depth errors,
crosswalk and traffic sign detectors, and GPS integration. for sufficiently large N one should have approximately N p0
Specific context-dependent tasks could also be modeled. In inliers. By comparing N p0 with the actual number of inliers,
particular, entertainment applications could involve games one can accept or reject a plane candidate. After extracting
such as bowling and billiards. inliers for the connected region, we recompute the least-
squares estimate, and extract inliers for the entire depth map.
A PPENDIX The least-squares estimate is computed again, and produces
the accepted plane estimate after a final inlier extraction.
Plane detection is performed using the 640x480 point cloud This method is more computationally efficient than
produced by the depth camera. We implemented a fully de- RANSAC, since it does not require testing a large number
terministic approach based on multi-scale sampling, designed of randomly sampled plane candidates. By using a multiscale
to be computationally efficient and robust to noise. While approach, it is guaranteed to extract large planes first, dramat-
RANSAC and its variants [16], [17] are very effective for ically reducing the size of the remaining point cloud.
fitting a wide range of primitives, plane detection can benefit
from a more specialized approach. Our proposal samples R EFERENCES
a depth frame using uniform rectangular grids which are [1] W. H. O. (WHO), “Fact sheet 282: Visual im-
gradually refined. This approach promotes the fast and robust pairment and blindness,” Oct. 2011, available:
http://www.who.int/mediacentre/factsheets/fs282/en/.
removal of large flat regions, and progressively searches for [2] N. F. for the Blind, “Blindness statistics,” available:
smaller plane sections. http://www.nfb.org/nfb/blindness_statistics.asp.
At each sampling scale, the depth frame is divided into [3] A. F. for the Blind, “Facts and figures on adults with vision loss,”
available: http://www.afb.org.
rectangles with 50% overlap. For each rectangle, a gradient [4] P. Meijer, “An experimental system for auditory image representations,”
fill is applied at its center, effectively identifying a connected IEEE Trans. Biomed. Eng., vol. 39, no. 2, pp. 112–121, 1992.
region in 3D space. For sufficiently large connected regions [5] J. Cronly-Dillon, K. Persaud, and R. Gregory, “The perception of visual
N images encoded in musical form: a study in cross-modality information
with points {(xi , yi , zi )}i=1 , we estimate the least-squares transfer,” Proc. R. Soc. B, vol. 266, no. 1436, p. 2427, 1999.
plane ax + by + cz + d = 0 using the least-squares solution to [6] G. Balakrishnan, G. Sainarayanan, R. Nagarajan, and S. Yaacob, “Wear-
⎡ ⎤
x 1 y 1 z1 1 ⎡ a ⎤
able real-time stereo vision for the visually impaired,” Engineering
Letters, vol. 14, no. 2, 2007.
⎢ x 2 y 2 z2 1 ⎥ ⎢ ⎥ [7] G. Bologna, B. Deville, T. Pun, and M. Vinckenbosch, “Transforming
⎢ ⎥ b⎥
⎢ .. .. .. .. ⎥ ⎢ = 0. (2) 3d coloured pixels into musical instrument notes for vision substitution
⎣ . . . . ⎦⎣ c ⎦ applications,” EURASIP J. Image Video Process., 2007.
d [8] G. Bologna, B. Deville, and T. Pun, “On the use of the auditory pathway
x N y N zN 1 to represent image scenes in real-time,” Neurocomputing, vol. 72, no.
4-6, pp. 839–849, 2009.
A
[9] V. Algazi, R. Duda, D. Thompson, and C. Avendano, “The CIPIC HRTF
Note that the optimal plane is given by the least significant database,” in Proc. WASPAA, 2001.
[10] D. Mershon and L. King, “Intensity and reverberation as factors in
right singular value of A, which is also the least significant the auditory perception of egocentric distance,” Attention, Perception,
eigenvector of AT A. In practice, one can subsample the & Psychophysics, vol. 18, no. 6, pp. 409–415, 1975.
connected region to reduce N . [11] P. Zahorik, “Assessing auditory distance perception using virtual acous-
tics,” J. Acoust. Soc. Am, vol. 111, no. 4, pp. 1832–1846, 2002.
Similarly to RANSAC, we obtain the set of inliers, i.e., [12] ——, “Direct-to-reverberant energy ratio sensitivity,” J. Acoust. Soc.
the subset of points which are close to a least-squares plane. Am., vol. 112, no. 5, pp. 2110–2117, 2002.
Unlike RANSAC, we do not iterate over multiple plane [13] B. Shinn-Cunningham, “Localizing sound in rooms,” in Proc. ACM
SIGGRAPH, 2001.
candidates in search for the largest plane. Instead, we either [14] C. Zhang and P. Viola, “Multiple-instance pruning for learning efficient
accept a plane candidate at the current scale if the ratio of cascade detectors,” in Proc. NIPS, 2007.
inliers to the total number of points in the sampling rectangle [15] Z. Cao, Q. Yin, X. Tang, and J. Sun, “Face recognition with learning-
based descriptor,” in Proc. CVPR, 2010.
is sufficiently large, and reject it otherwise. Using a model [16] M. Fischler and R. Bolles, “Random sample consensus: a paradigm
for the depth camera, we define the inliers such that this for model fitting with applications to image analysis and automated
ratio test produces plane estimates with a given false positive cartography,” Communications of the ACM, vol. 24, no. 6, pp. 381–395,
1981.
probability. [17] R. Schnabel, R. Wahl, and R. Klein, “Efficient ransac for point-cloud
For the Kinect depth camera, we assume that the depth map shape detection,” in Computer Graphics Forum, vol. 26, no. 2. Wiley
noise is Gaussian, with a variance given by [18] Online Library, 2007, pp. 214–226.
[18] Q. Cai, D. Gallup, C. Zhang, and Z. Zhang, “3d deformable face tracking
σ02 z 4 with a commodity depth camera,” Computer Vision–ECCV 2010, pp.
σz2 = , (3) 229–242, 2010.
fd2 B 2