Vision For The Blind

The document proposes an auditory augmented reality system that uses spatialized audio to sonify objects detected by computer vision in real-time. This allows visually impaired users to perceive additional information about their environment through sound. The system aims to represent high-level features rather than raw visual data for better comprehension and less fatigue compared to previous work.

Uploaded by

Zhengyou Zhang

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

Vision For The Blind

Uploaded by

Zhengyou Zhang

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Auditory Augmented Reality:

Object Soniﬁcation for the Visually Impaired

Flávio Ribeiro #1 , Dinei Florêncio ∗2 , Philip A. Chou ∗3 , and Zhengyou Zhang ∗4
#
Electronic Systems Eng. Dept., Universidade de São Paulo, Brazil
1
[email protected]
∗
Microsoft Research, One Microsoft Way, Redmond, WA
2,3,4
{dinei,pachou,zhang}@microsoft.com

downsampled
Abstract—Augmented reality applications have focused on column
image index
visually integrating virtual objects into real environments. In 5 6
7 8
this paper, we propose an auditory augmented reality, where we 2 3 4
1
integrate acoustic virtual objects into the real world. We sonify
8
objects that do not intrinsically produce sound, with the purpose 7
of revealing additional information about them. Using spatialized 6
(3D) audio synthesis, acoustic virtual objects are placed at specific 5
real-world coordinates, obviating the need to explicitly tell the 4
user where they are. Thus, by leveraging the innate human 3
2
capacity for 3D sound source localization and source separation,
1
we create an audio natural user interface. In contrast with column 6
previous work, we do not create acoustic scenes by transducing
row index
low-level (for instance, pixel-based) visual information. Instead,
we use computer vision methods to identify high-level features of Figure 1. Encoding used by The vOICe
interest in an RGB-D stream, which are then sonified as virtual
objects at their respective real-world coordinates. Since our
visual and auditory senses are inherently spatial, this technique
naturally maps between these two modalities, creating intuitive
which appear to originate from their real-world locations. On
representations. We evaluate this concept with a head-mounted one hand, computer-vision adds a layer of cognition which
device, featuring modes that sonify flat surfaces, navigable paths promotes relevance and produces tremendous bandwidth sav-
and human faces. ings. On the other hand, spatial audio eliminates the overhead
Index Terms—augmented reality, natural user interface, soni- of encoding, transmitting and decoding spatial coordinates.
fication, spatialization, blind, visually impaired.
In recent years, there have been several proposals for
translating visual information into audio by encoding low
I. I NTRODUCTION level features. For instance, several methods were proposed
According to the World Health Organization, there are to encode a bitmap image one pixel column at a time, using
an estimated 39 million blind people in the world [1]. In frequency-domain multiplexing for each column. The vOICe
the United States, there are an estimated 1.3 million legally [4] was the first proposal for a wearable device of this kind.
blind individuals [2], with approximately 109,000 who use A camera acquires a bitmap image of up to 64x64 pixels with
long canes and 7,000 who rely on guide dogs [3]. Since 16 shades of gray per pixel, and the system encodes columns
vision impairments hinder a wide variety of human activities, in left-to-right order. Given a column, each pixel controls a
assistive devices have been designed to facilitate specific tasks sinusoidal generator, with its value determining amplitude, and
or enhance mobility. its coordinate being proportional to the frequency (see Fig.
We start from the observation that our two highest band- 1). At a given moment, the user hears the superposition of
width senses – vision and hearing – have spatial structure. all the sinusoids from a column. After all columns have been
Using spatialized (3D) audio, we synthesize virtual acoustic rendered, a synchronization click is generated and the process
objects, placing them at specific real-world coordinates. Thus, restarts.
by leveraging the innate human ability for 3D sound source The literature features several variations of this approach. In
localization, we can relay location-dependent information [5], an RGB camera image was first reduced to 1 bit per pixel,
without having to explicitly encode spatial coordinates. and pixels were associated with musical notes. A black and
In this paper, we combine this approach with computer white image was then be mapped into a melody. SVETA [6]
vision techniques to create natural high-level scene represen- was a more recent proposal which transduced a disparity image
tations for the visually impaired. For example, we use face obtained from stereo matching. To reduce user fatigue, pixels
recognition to detect known individuals, who can then virtually were associated with major chords instead of pure sinewaves.
identify themselves using recordings of their own voices, Unless the image is very sparse, encoding every pixel

,((( 0063

Sensors
generally produces an overwhelming combination of sounds. RGB-D Camera
camera Computer vision
3D Accelerometer Spatial audio engine
The representation is difficult to interpret, and despite attempts
to use more pleasant sounds, even short-term use produces
significant user fatigue. To reduce the amount of information Navigable Floor Face Detection
Plane Detection
to be encoded, See ColOr [7], [8] used segmentation to create a Mapping and Recognition
cartoon-like image, and objective saliency methods to estimate
the most relevant regions. To transduce a simplified RGB-D 3D Gyroscope
image, hue was encoded by the timbre of a musical instrument, FM Synthesizer
Head orientation
saturation by one of four possible notes, and luminosity with TTS Synthesizer Spatial Audio Engine
HRTF
bass or a singing voice. Pixel locations were spatialized using 9 Wave Samples
ambisonic channels. To avoid overwhelming the listener, only
a small window (chosen by the user on a tactile tablet) was Headphones
encoded. Figure 2. System block diagram
Even though segmentation and saliency reduce the amount
of information, color adds another dimension which must be
encoded. Since current state-of-the-art objective saliency meth-
ods are only based on low-level features such as luminance,
contrast and texture, they completely neglect cognitive aspects
which often determine regions of interest. These issues are
reflected in a See ColOr user study, where participants required
an average of 6.6 minutes to locate a red door on an image
which had no other red features [8].
Thus, it becomes clear that given the bandwidth limitations
of audio, it is important to avoid low-level representations and Figure 3. Prototype device
arbitrary encodings. In this paper, we present our preliminary
work intended to address these issues. We acquire and trans-
duce data in real-time using a helmet mounted RGB-D camera, level visual information is represented using a combination
an inertial measurement unit and open-ear headphones. We of pre-recorded wave files, a text to speech synthesizer and
illustrate the concept by applying computer-vision methods a musical instrument synthesizer. Every sound is spatialized
for plane decomposition, navigable floor mapping and object in 3D space using HRTFs from the CIPIC database [9]. As
detection. This representation carries more cognitive content described below, head tracking is implemented with a 3D
and is much more summarized than a raster scan, and thus gyroscope. A 3D accelerometer is used to estimate the gravity
can be relayed without overwhelming the user. By sonifying vector, and infer the location of the floor plane.
high-level spatial features with 3D audio, users can use their
inherent capacity for sound source localization to identify A. Audio spatialization
the position of virtual objects. Thus, we intuitively represent When representing real-world elements using audio, one
coordinates from visual space in auditory space, avoid using must establish how to sonify real-world coordinates. Several
explicit arbitrary encodings, and the representation is summa- previous proposals have relied on arbitrary encodings (for
rized further. example, using frequency to represent vertical position [4]–
This paper is organized as follows. Section 2 presents the [6]). Instead, we render the location of an object by synthe-
designed system, which includes an audio spatialization en- sizing a virtual sound source at its corresponding real-world
gine and components for plane detection, floorplan estimation coordinates. The source is spatialized with 5◦ resolution in
and face recognition. For each component, we describe how azimuth and elevation, by filtering with HRTFs from the CIPIC
outputs are sonified into 3D audio. Section 3 describes exper- database [9]. We used HRTFs for the KEMAR with small
iments used for evaluating our auditory mapping techniques. pinnae, interpolated to obtain 5◦ resolution.
Finally, Section 4 has our conclusions and directions for future Since the CIPIC HRTFs are measured at a fixed distance of
work. 1.0 m, they are not range dependent and effectively act as a
far-field model. We represent range by attenuating the source
II. S YSTEM DESCRIPTION by 6 dB for each doubling in distance, and by adding a fixed,
Figure 2 shows the block diagram for the proposed device. location-independent reverberation component (see Fig. 4).
Visual input is captured at 640x480 pixels with the RGB-D Intensity and direct-to-reverberant energy ratio are known to be
camera module used in the Microsoft Kinect (see Figure 3). the two primary cues for range, and have complementary roles
The RGB and depth cameras were calibrated to produce an for relative and absolute range perception [10]–[12]. We use
accurate correspondence between depth and RGB pixels. a time-invariant virtual room for synthesizing reverberation,
This proof of concept version implements modules for plane with a reverberation time of 300 ms. Having a fixed virtual
detection, floorplan estimation and face recognition. High- room is useful, because users are known to learn the acoustic

fs
1/r
TTS
XAudio2
HRTF +
FM Synth (fixed)
Source
PCM Buffer Reverb

Figure 4. Spatialized source

characteristics of an environment, and improve their sound Figure 5. Plane segmentation example
source localization performance with time [13].
HRTF filtering is performed with an FFT accelerated con- 4
1 4 1
volution engine, with a typical latency of 10 ms. Each audio
source is associated with a playlist of wave samples, speech
utterances (produced by a text-to-speech synthesizer) or mu-
sical notes (produced by an FM synthesizer). To facilitate the
3
description of real-world features, a source can also describe
2 3
parameterized curves in real-world coordinates. horizontal plane vertical plane 2
Head tracking is an integral component of audio spatial- vertex order vertex order
ization. If the user moves, the virtual audio sources associated Figure 6. Acoustic rendering for plane representation
with real-world objects should not be dragged along with him.
Thus, during the acoustic rendering of a scene (which lasts
approximately 1 second) we perform head tracking, and update of its 4 corners. A vertical plane is represented by a clockwise
the coordinates of all 3D audio objects. To simplify tracking, sequence of musical notes, rising in pitch, with each corner
we only estimate the relative rotation between computer vision rendered by a virtual source at the corner’s real world location.
updates. This rotation is given by the composition of rotation A horizontal plane is distinguished by being represented by a
matrices of the form counterclockwise sequence of musical notes, falling in pitch.
⎡ ⎤
1 −θz θy While other representations are certainly possible, this proved
R (θx , θy , θy ) = ⎣ θz 1 −θx ⎦ . (1) sufficient to relay the concept of a plane.
−θy θx 1 C. Floorplan description
While the composition of rotations is not commutative, the Several devices for the visually impaired have relied on
composition of infinitesimal rotations is. We sample a 3D ultrasound for describing local geometry. In comparison, depth
gyroscope at 40 Hz, such that a relative rotation matrix can cameras provide a dramatic improvement in terms of spatial
be updated by iteratively multiplying by (1) and reorthonor- resolution. Furthermore, their range can reach 10 m with
malizing the result. current technology. Thus, one can perform local mapping to
complement the short range and tactile feedback of a white
B. Plane detection cane. With this in mind, our device implements a floorplan
Plane detection is used for two purposes: to identify the description mode, which is intended to relay on demand a fast
floor, and to provide an environmental decomposition into flat description of the navigable floor.
surfaces. Indeed, planes are the dominant primitives in man- We define the navigable region to be the visible floor,
made environments, and can be conveniently used to identify bounded by obstacles. Visibility is important for safety rea-
walls and furniture. Our underlying assumption is that given sons, as it prevents instructing the user to walk on potentially
the decomposition of an environment into planes of known nonexistent ground. Depth cameras relying on infrared also
sizes and locations, a user is able to infer which classes of produce offscale high pixels for glass and black surfaces,
objects they belong to, given contextual clues. For instance, effectively classifying them as distant objects. By treating
the location of a table could be inferred from the description of these offscale regions as non-navigable, we prevent collisions
a large horizontal plane. Likewise, a chair could be associated with undetectable obstacles.
with a small pair of horizontal and vertical planes. The plane detector is first used to locate the floor, which is
Our algorithm for fast plane detection is described in the the largest plane with an orientation consistent with the gravity
Appendix. Figure 5 shows an example for plane segmentation, vector (given by the accelerometer). After plane detection, we
where each plane is drawn with a different color. The sampling rotate the entire point cloud so y = 0 for all points in the
rectangles used for each rectangle are also represented. floor. We then project the entire point cloud onto the xz plane,
Figure 6 shows how planes are represented acoustically. The creating a 2D floorplan. We ignore points which are small
plane detector associates each detected plane with its point in height, and fall under the error threshold of the camera.
cloud. Using the eigendecomposition of this flat set of points, We also ignore obstacles which the user can walk under.
we approximate it as a quadrilateral and produce an estimate The obstacle floorplan is then convolved with a human-like

60º FOV navigable direction
obstacle
synchronization click

FOV

-90º +90º

Figure 7. Navigable ﬂoorplan example Figure 8. Acoustic rendering for navigation

cross-section, deﬁning a navigable ﬂoorplan. From the camera

coordinates, we cast a ray for every degree in azimuth, and
store the maximum distance that can be traveled before hitting
an obstacle.
Figure 7 shows an example of this polar floorplan, which
we sonify. The red regions indicate the visible floor which can
be reached by line of sight. The green regions are reachable
by walking along a straight line. Figure 9. Training set
Figure 8 illustrates how the polar floorplan is acoustically
rendered. Even rendering cycles start with a synchronization
click, spatialized at −90◦ in azimuth (the user’s absolute device integrates face detection and recognition, implemented
left), and describe the polar floorplan in left-to-right order. with [14], [15].
Odd rendering cycles start with a synchronization click at During start-up, the device loads a database of faces, which
+90◦ , and describe the floorplan in right-to-left order. The are used to recognize individuals who appear on the RGB
synchronization clicks are important for giving an indication of frame. Each face is represented by the spatialized name of
when the description starts, and for giving the user a reference the person, rendered as a virtual source located at the real-
for absolute left and right. This reference is always useful, and world coordinates of the face. When a face is detected but not
becomes more important if the CIPIC HRTF is significantly recognized, the device uses a musical note fallback. With an
mismatched with respect to the user’s personal HRTF. enhanced face detector, one could potentially use voices for
This sequential description was shown experimentally to be more descriptive fallbacks (male/female, adult/child, etc.).
preferable to a random sampling, especially in the presence
III. E VALUATION
of HRTF mismatch. Indeed, it appears that with a sequential
sweep, the spatial sense from a click can be integrated with While the mapping from a scene to the auditory representa-
neighboring clicks, providing a clearer spatial sense. tion is straightforward, it produces a summarized description.
Following the synchronization click, the floorplan is sam- Thus, it remained to be seen whether the sonifications could be
pled at every 2◦ , generating a low pitched click if an obstacle interpreted with sufficient accuracy, enabling the identification
has been detected at less than 1 m, and a high pitched of environmental features. To obtain quantifiable and repeat-
click otherwise (indicating a navigable direction). Low pitched able results, we designed a scene classification task, where
clicks have constant amplitude, while high pitched clicks are sighted users listened to an auditory rendering, and then were
stronger for longer navigable paths1 . This modulation provides asked to choose the scene which best matched it. For this task
an intuitive cue for navigability, since the user becomes we used pre-captured still frames, allowing the same data to
accustomed to walk into the direction of loud high pitched be presented to multiple participants.
clicks. We note that this agrees with the convention used The classification experiment began with a training session,
worldwide for crosswalks. where users were shown an RGB image of a scene and then
listened to its associated acoustic rendering. For each example,
D. Face detection and recognition participants were instructed to notice how specific spatial
features corresponded to synthesized sounds. The training set
During interviews with visually impaired and blind users,
consisted of 8 diverse indoor environments (see Figure 9),
face detection and recognition were suggested as desirable
which captured a wide variety of detail. During this step,
features for an assistive device. Indeed, for a blind individual,
participants were free to ask any questions.
not knowing the identity of an approaching person generally
The evaluation phase used 3 sets of 8 scenes (see Figure
implies a missed opportunity for social interaction. Thus, our
10). For each participant, 2 scenes were drawn randomly from
1 High pitched clicks are spatialized at the correct azimuth, with amplitude
each of the 3 sets, without replacement, for a total of 6 scenes.
proportional to 6.0 − r, where r ∈ [0, 5.0] is the distance to the nearest For each scene, the participant only knew which set it came
obstacle (in meters). from. His task was to identify which of the 8 set members

Figure 11. Participant score histogram

strongly neither agree strongly

disagree nor disagree agree
Figure 12. Answers to the survey question "I believe my performance will
improve with training"

jects. Nevertheless, it still performs some high-level encoding

due to the absence of scene interpretation. With advances in
computer vision, we foresee the use of progressively more
natural descriptions (for instance, a chair could have a unique
sound, instead of being represented as two planes).
Figure 10. Test sets A detailed study with visually impaired participants is being
planned, and initial feedback has been very positive. Accord-
ing to a blind user, "it’s a very intuitive device, so I think that
you would get used to it very quickly". Plane decomposition
best described the acoustic rendering. Participants typically
was considered useful for unknown environments: "the flat
required approximately 1 minute, and no more than 2 minutes
surface mode – I really liked that, because you can detect
to make decisions. We note that a significant fraction of this
objects around you and kind of get a feel for how the room
time was actually spent looking at the features of each of the
is laid out".
8 photographs, and mentally comparing them to the auditory
Blind interviewees also noted the importance of training.
rendering. We expect these times to improve through the use
Visually impaired individuals learn to interpret the world
of personalized HRTFs.
through suble cues (for example, a blind person can detect a
Our user study featured 14 participants, who had received telephone pole by noticing how traffic sounds diffract around
no previous training. Figure 11 shows the score histogram it). In contrast, this device renders environments using a very
for their answers. By observing the experiments, it was clear explicit representation. Thus, a blind person would be expected
that some individuals were extremely adept at matching our to adapt to this new language.
acoustic representation with visual information. In general,
users showed very good results, with 9 participants correctly IV. C ONCLUSION
associating at least 5 of the 6 test scenes. Some users com- In this paper, we described a new approach for representing
plained about having difficulty localizing spatialized sounds, visual information with spatial audio. This method was imple-
which could be due to a mismatch between their personal mented using a head-mounted device with a RGB-D camera
HRTFs and the KEMAR HRTF, or due to physiological con- module, an accelerometer, a gyroscope and open-ear head-
straints (some individuals have poor sound source localization, phones. In contrast with previous proposals, we rely heavily
even in real-life scenarios). Some mistakes were clearly due on computer vision for obtaining summarized environmental
to limited training, because participants were hearing the models, and on audio spatialization for representing spatial
representation correctly, but making incorrect inferences (for locations, thus circumventing the need to encode coordinates.
example, confusing the meaning of the clicks from Figure 8). Preliminary results show that most users can interpret the
With an ideal natural representation, training would only representations, effectively building mental maps from the
involve learning which sound is assigned to each type of acoustic signals, and associating them with spatial data. This
object. Training with a low-level (e.g., pixel-based) represen- is an encouraging result, since these users received little
tation is much more involved, due to the wide diversity of training and personalized HRTFs were not used. For a practical
encoded sounds and to the lack of a one-to-one map. Like device, we envision measuring personalized HRTFs, similarly
a natural representation, our approach sonifies high-level ob- to fitting a hearing aid or glasses.

f +f
Our proof of concept device features modes for plane where z is the depth coordinate, fd = x 2 y is the mean focal
decomposition, floorplan estimation and face detection. Plane length and σ0 and B are constants. Let t0 be the inlier distance
decomposition could be generalized with other primitives. threshold at a reference depth z0 . One can use (3) to determine
A practical device would benefit from additional operating the probability p0 of having a depth error exceeding t0 . We
modes, such as optical character recognition followed by use (3) to produce a depth-dependent inlier threshold t (z),
text-to-speech synthesis, and barcode recognition followed such that for all z, the depth error exceeds t (z) with constant
by product lookup. Outdoor urban use could benefit from probability p0 . Assuming the independence of depth errors,
crosswalk and traffic sign detectors, and GPS integration. for sufficiently large N one should have approximately N p0
Specific context-dependent tasks could also be modeled. In inliers. By comparing N p0 with the actual number of inliers,
particular, entertainment applications could involve games one can accept or reject a plane candidate. After extracting
such as bowling and billiards. inliers for the connected region, we recompute the least-
squares estimate, and extract inliers for the entire depth map.
A PPENDIX The least-squares estimate is computed again, and produces
the accepted plane estimate after a final inlier extraction.
Plane detection is performed using the 640x480 point cloud This method is more computationally efficient than
produced by the depth camera. We implemented a fully de- RANSAC, since it does not require testing a large number
terministic approach based on multi-scale sampling, designed of randomly sampled plane candidates. By using a multiscale
to be computationally efficient and robust to noise. While approach, it is guaranteed to extract large planes first, dramat-
RANSAC and its variants [16], [17] are very effective for ically reducing the size of the remaining point cloud.
fitting a wide range of primitives, plane detection can benefit
from a more specialized approach. Our proposal samples R EFERENCES
a depth frame using uniform rectangular grids which are [1] W. H. O. (WHO), “Fact sheet 282: Visual im-
gradually refined. This approach promotes the fast and robust pairment and blindness,” Oct. 2011, available:
http://www.who.int/mediacentre/factsheets/fs282/en/.
removal of large flat regions, and progressively searches for [2] N. F. for the Blind, “Blindness statistics,” available:
smaller plane sections. http://www.nfb.org/nfb/blindness_statistics.asp.
At each sampling scale, the depth frame is divided into [3] A. F. for the Blind, “Facts and figures on adults with vision loss,”
available: http://www.afb.org.
rectangles with 50% overlap. For each rectangle, a gradient [4] P. Meijer, “An experimental system for auditory image representations,”
fill is applied at its center, effectively identifying a connected IEEE Trans. Biomed. Eng., vol. 39, no. 2, pp. 112–121, 1992.
region in 3D space. For sufficiently large connected regions [5] J. Cronly-Dillon, K. Persaud, and R. Gregory, “The perception of visual
N images encoded in musical form: a study in cross-modality information
with points {(xi , yi , zi )}i=1 , we estimate the least-squares transfer,” Proc. R. Soc. B, vol. 266, no. 1436, p. 2427, 1999.
plane ax + by + cz + d = 0 using the least-squares solution to [6] G. Balakrishnan, G. Sainarayanan, R. Nagarajan, and S. Yaacob, “Wear-
⎡ ⎤
x 1 y 1 z1 1 ⎡ a ⎤
able real-time stereo vision for the visually impaired,” Engineering
Letters, vol. 14, no. 2, 2007.
⎢ x 2 y 2 z2 1 ⎥ ⎢ ⎥ [7] G. Bologna, B. Deville, T. Pun, and M. Vinckenbosch, “Transforming
⎢ ⎥ b⎥
⎢ .. .. .. .. ⎥ ⎢ = 0. (2) 3d coloured pixels into musical instrument notes for vision substitution
⎣ . . . . ⎦⎣ c ⎦ applications,” EURASIP J. Image Video Process., 2007.
d [8] G. Bologna, B. Deville, and T. Pun, “On the use of the auditory pathway
x N y N zN 1 to represent image scenes in real-time,” Neurocomputing, vol. 72, no.

4-6, pp. 839–849, 2009.
A
[9] V. Algazi, R. Duda, D. Thompson, and C. Avendano, “The CIPIC HRTF
Note that the optimal plane is given by the least significant database,” in Proc. WASPAA, 2001.
[10] D. Mershon and L. King, “Intensity and reverberation as factors in
right singular value of A, which is also the least significant the auditory perception of egocentric distance,” Attention, Perception,
eigenvector of AT A. In practice, one can subsample the & Psychophysics, vol. 18, no. 6, pp. 409–415, 1975.
connected region to reduce N . [11] P. Zahorik, “Assessing auditory distance perception using virtual acous-
tics,” J. Acoust. Soc. Am, vol. 111, no. 4, pp. 1832–1846, 2002.
Similarly to RANSAC, we obtain the set of inliers, i.e., [12] ——, “Direct-to-reverberant energy ratio sensitivity,” J. Acoust. Soc.
the subset of points which are close to a least-squares plane. Am., vol. 112, no. 5, pp. 2110–2117, 2002.
Unlike RANSAC, we do not iterate over multiple plane [13] B. Shinn-Cunningham, “Localizing sound in rooms,” in Proc. ACM
SIGGRAPH, 2001.
candidates in search for the largest plane. Instead, we either [14] C. Zhang and P. Viola, “Multiple-instance pruning for learning efficient
accept a plane candidate at the current scale if the ratio of cascade detectors,” in Proc. NIPS, 2007.
inliers to the total number of points in the sampling rectangle [15] Z. Cao, Q. Yin, X. Tang, and J. Sun, “Face recognition with learning-
based descriptor,” in Proc. CVPR, 2010.
is sufficiently large, and reject it otherwise. Using a model [16] M. Fischler and R. Bolles, “Random sample consensus: a paradigm
for the depth camera, we define the inliers such that this for model fitting with applications to image analysis and automated
ratio test produces plane estimates with a given false positive cartography,” Communications of the ACM, vol. 24, no. 6, pp. 381–395,
1981.
probability. [17] R. Schnabel, R. Wahl, and R. Klein, “Efficient ransac for point-cloud
For the Kinect depth camera, we assume that the depth map shape detection,” in Computer Graphics Forum, vol. 26, no. 2. Wiley
noise is Gaussian, with a variance given by [18] Online Library, 2007, pp. 214–226.
[18] Q. Cai, D. Gallup, C. Zhang, and Z. Zhang, “3d deformable face tracking
σ02 z 4 with a commodity depth camera,” Computer Vision–ECCV 2010, pp.
σz2 = , (3) 229–242, 2010.
fd2 B 2