An HRTF Based Approach Towards Binaural Sound Source Localization
An HRTF Based Approach Towards Binaural Sound Source Localization
ABSTRACT
With the evolution of smart headphones, hearables, and hearing aids, there is a need for technologies to improve
situational awareness. The device needs to constantly monitor the real world events and cue the listener to stay
aware of the outside world. In this paper, we develop a technique to identify the exact location of the dominant
sound source using the unique spectral and temporal features listener’s head-related transfer functions (HRTFs).
Unlike, most state-of-the-art beamforming technologies, this method localizes the sound source using just two
microphones thereby reducing the cost and complexity of this technology. An experimental framework is setup
at the EmbodyVR anechoic chamber and hearing aid recordings are carried out for several different trajectories,
SNRs, and turn-rates. Results indicate that the source localization algorithms perform well for dynamic moving
sources for different SNR levels.
1 Introduction just two microphones to “detect the direction” of an
acoustic event. This method makes use of the unique
In the last few years, we have seen a surge of hearables spectral and temporal features of the listeners’ head-
in the audio market with a variety of smart features. related transfer functions (HRTFs) [1]. Using just two
Listeners’ wearing these devices are increasingly cut- microphones helps enormously reduce the cost and
off from the outside world and is a potential safety computational complexity of this technology compared
hazard. In this paper, we primarily focus on hearing to the multiple-microphone techniques, which is cur-
aid applications. The hearing aid can first use this rently the state of the art.
technology to locate the direction of the prominent
speech, and virtually synthesize the speech from the HRTF based sound localization methods have tradition-
detected direction. Sound unlike vision can provide ally been applied to human centered robotic systems
us a complete 360 degree situational awareness. This for telepresence operations [2]. Martin et al. [2] in
can eventually enable a biker to be aware of the pos- his paper, has reviewed several HRTF based sound
sible presence of vehicles around, thereby improving source localization algorithms such as the Matched Fil-
safety. We demonstrate a unique approach that uses tering Approach, Source Cancellation Approach, and
Sunder, Wang HRTF based binaural source localization
the Cross-convolution approach. The idea behind most space reduction is one of the key novelties of this paper.
HRTF based localization algorithms is to identify a pair Another important work in this paper is the creation of
of HRTFs that corresponds to the emitting position of an average compensation filter or a difference-filter to
the source, such that the correlation between left and compensate for the difference in position of the hearing
right microphone observations are maximum [1]. In the aid placement and the entrance of the ear-canal position,
matched filtering approach, the inverse of the HRTFs where a typical HRTF is measured. This difference-
is first computed, which is then filtered with the micro- filter also compensates for the positional errors of the
phone observations. The index leading to the highest hearing aid every time it is repositioned.
cross-correlation is the source location of interest. The
disadvantage of this approach is that the inversion of In order to further narrow down the search space
the HRTFs can be problematic due to instability. This to the exact location of the sound source, the cross-
is primarily because of the linear phase component convolution based localization [5] approach is used
of the HRTFs that encodes the ITDs. Keyrouz et al. to avoid any instability problem. The direction corre-
[3] proposed a solution using outer-inner factorization sponding to the index of maximum cross-correlation
converting an unstable inverse to an anti-causal and is the estimated direction of the source. In addition,
bounded inverse. Keyrouz & Diepold [4] and Usman a Kalman filter is used to improve the localization in
et al. [5] proposed the source cancellation approach, the presence of external noise for a moving source. In
which is an extension of the matched filtering approach. this work, we are able to predict the location of the
In this method, cross-correlation between all the pairs sound source without prior knowledge of the type of
of ratios of recorded observations and ratio of HRTFs is trajectory. First few samples are analysed and used to
taken. The main improvement is that the HRTFs need predict the speed of the sound source and the direction
not be inverted, and can be precomputed and stored in that improves the Kalman estimation.
memory. The reference signal [6] method uses four mi- This paper is organized as follows : Section 2 explains
crophones : Two for the HRTF-filtered signals and two the experimental setup followed in this paper for mea-
other outside the ear-canal for original sound signals. suring reference HRTF database as well as recording
While this method has several advantages, this method sound sources. All the different processing blocks or
does not fall under the scope of binaural localization methods used in this paper are explained in detail in
as it requires four microphones. In order to avoid any Section 3. Results of the simulations are explained
instability issue, Usman et al. [4] exploited the associa- in Section 4 for two types of trajectories (Full circle
tive property of the convolution operator and proposed and Circular segments). Discussion and future work is
the cross-convolution approach. In this approach, we described in Section 5.
use the cross-convolution approach to predict the index
of the HRTFs that is of interest. This is explained in 2 Experimental Setup
detail in section 3.3.
Our measuring system contains a two-axis (azimuth/
In this paper, the left and right microphone signals are elevation) motor platform, a HATS humanoid record-
analyzed, compared against a certain reference HRTF ing manikin, a single channel speaker assembly and
database to determine the direction of the source. The a infra-red camera head tracking system, all of them
reference HRTF database is measured with the Head assembled in a semi-anechoic chamber. The HATS
and Torso simulator (HATS) at the EmbodyVR semi- manikin was mounted on the vertical rotational motor
anechoic chamber. The HRTF database was measured platform, which allows for motions in the azimuth. The
for 541 unique directions that results in a resolution of speaker was mounted on an arc rail, driven by steel
10 by 10 degrees. As a first step, a reduction of search- cable connected to the other motor, enabling for the
space is carried out by computing the ITD (interaural motion in the elevation.
time difference) between the left and right recorded
signals from the hearing aid. By narrowing the search- The motions of both the azimuth and elevation motor
space, the source localization algorithm does not re- was controlled by a PC terminal outside the chamber,
quire to search all the 541 pairs of HRTFs in the ref- where we could set the trajectory and the speed of both
erence database. This reduces the computational com- axis, and record the real-time position from the motors
plexity of the localization algorithm. The ITD search of each axis from the sensors built in the platforms.
Reference
HRTFs
Mic1
Convolution
Data Low-pass Reduce
based
Mic2 Acquisition filtering search space
localization
HRTF1 HRTF1
HRTF2 HRTF2
... Difference ...
. Filter . Predicted
. . source location
. .
. .
HRTFn HRTFn
Measured at Measured at
the entrance of hearing aid
Fig. 1: Semi-anechoic chamber with experimental ear canal microphone position
setup
Fig. 2: Block diagram for our processing strategy
The hearing aid (HA) microphones are usually placed v) Difference curves for these multiple positions of
in the front or behind the hearing aid casing and not at hearing aid is taken and a highly optimized difference
the entrance of the blocked ear canal, where the HRTFs curve is designed .
are typically measured. Thus, the reference HRTF
database cannot be directly compared with the left/right 3.2 Search Space Reduction
HA microphone signals, as it needs to be converted to
a database with equivalent measuring position as the Interaural time difference (ITD) is an essential spatial
hearing aid. This conversion is carried out using a audio cue that human beings use for sound localiza-
Difference filter. The difference filter compensates for tion [1]. ITD is typically calculated by computing the
both the difference in hearing aid mic position and cross-correlation between the left and right channels
the HRTF measurement position, and the variations in and taking the maximum correlation lying between +/-
hearing aid mic positions when the user puts them on. 1ms. A low-pass filter with a cut-off frequency of 1500
Hz is first applied to the left and right hearing aid mi- A Kalman filter takes in input, which is known to have
crophone signals before computing the ITD, as the ITD some error, uncertainty, or noise and extract the useful
is primarily a low-frequency cue that is usually present parts of interest to reduce the uncertainty or noise. The
below 1500 Hz. The low-pass filter also eliminates Kalman filter model assumes that the state of a system
some of the high-frequency noise from the HA mic at a time step t evolved from the prior state at time step
signals for better prediction of the direction. In order t − 1 according to the equation 3.
to reduce the search space, the ITD of the hearing aid
signals and the reference HRTF database is compared x̂t|t−1 = Ft x̂t−1|t−1 + Bt ut
with a predefined tolerance. The amount of tolerance (3)
Pt|t−1 = Ft Pt−1|t−1 FtT + Qt
defines the amount of space reduction that can be ob-
tained. For example, choosing the ITD tolerance of
+/-2 samples reduces the search space by up to 92%.
Where Xt is the state vector containing the terms of
3.3 Cross-Convolution Based Localization interest for the system (eg. position, velocity, heading
etc at time t ), ut is vector containing any control inputs
We implemented cross-convolution based localization (steering angle, throttle setting, etc.) Ft is state tran-
(Usman et al. [5]) for this next step. Among all the sition matrix which applies the effect of each system
HRTF based sound source localization techniques, e.g. parameter at time t − 1 on the system state at time t (eg.
matched filter approach, source cancellation approach the position and velocity at time t − 1 both affect the
and convolution based approach, cross-convolutional position at time t ). Bt is Control input matrix which
method is the one that is most robust, delivers high- applies the effect of each control input vector in the
est accuracy, and best suited for this application. In vector ut on the state vector. Pt is the Covariance matrix
this method, the left and right observations or the hear- corresponding to the state vectors.
ing aid signals are filtered with a pair of contralateral
HRTFs. The filtered signals are then converted to fre- Due to the nature of Kalman predictor, we do not need
quency domain before calculating the index of the max- the prior knowledge of the historical information of the
imum cross-correlation. The direction corresponding input, which in our case is the initial position of the
to the index is the estimated direction of the source: external sound source. For the Kalman filter we imple-
ˆ = HR,i · XL mented, we used the coordinated turn model(Micheal
SL,i
Roth et al [9]), where the state vector is defined as
= HR,i · HL,i0 · S [x1, x2, v, h, w]T , where x1/x2 are the coordinates in
= HL,i · HR,i0 · S cartesian axis, v is the velocity, h is the heading an-
= HL,i · XL (2) gle and w is the angular velocity. The Kalman filter
takes the results from the cross-convolution localization
ˆ ⇔ i = i0
= SR,i
n as input, then returns its best prediction of the moving
M o
ˆ
i0 = argmaxi F SL,i ˆ
F SR,i sound source.
Table 1: Kalman prediction errors in deg for the full Table 2: Kalman prediction errors in deg for the clock
circle trajectory wise segmented circle trajectory
4.1 Full circle motion Table 3: Kalman prediction errors in deg for the
counter-clock wise segmented circle trajec-
For experiments with the full circle trajectory, we tory
record the sound source travelling from -180 deg to
+180 deg in azimuth with respect to the HATS. The real
time position of the source is also recorded simultane- Conditions W GN Male Female
ously, which is used as the ground truth to compare 2rpm SNR60 13.59 15.35 22.78
with. 2rpm SNR10 10.90 18.90 58.33
In the simulation process, we fix the frame rate to be 6 6rpm SNR60 15.80 4.46 58.22
frames per second, i.e. the frame length is 1/6 seconds, 6rpm SNR10 16.59 22.85 61.90
which is also the time interval T that we update the
current state from the previous state for the state vector.
The captured binaural recordings were first sectioned
according to the frame length. The ITD detection and 4.2 Segmented circle motion
cross-convolutional localization was then performed
on these recordings. In the following figures 3 to 6,
each top row shows the noisy localization results. The Furthermore, we performed experiments with trajectory
Kalman filter then takes this result as the state esti- of segmented circular motion. In this simulation, we de-
mation input, and turns the predicted location at each signed the sound source moving from -90 to +90 (semi-
corresponding time frame. There’s no prior knowledge circle) with respect to the HATS. The Kalman predicted
of the initial source locations, and we just use the first and the ground truths are plotted in figures 5 and 6. The
12 frames to initiate the parameters in the state. For prediction algorithms are applied to these recordings
the state vector [x1, x2, v, h, w]T , we let v to be the value and a similar error analysis is carried out (shown in Ta-
calculated from this 12 frames estimation, which con- ble 2 for Clockwise and Table 3 for Counter-clockwise
tains both the speed and direction information. After movements). Similar to the full circle trajectory case,
the Kalman filter process, the predicted localization at close-wise 2rpm trial, both WGN and male trials
results are shown in the lower row. An error analysis is deliver the accuracy at around 10 degrees at high SNR
performed (Table 1), which is calculated by computing case, while the female speech trial shows relatively
the RMS difference of the Kalman predicted output higher error of 22 degrees. With lower SNR, the perfor-
and the ground truth (true trajectory). For the 2rpm mance are generally worse. This conclusion is true for
case, errors of WGN trial are around 10 degrees for both the clock-wise and counter close-wise trials. In
both high and low SNR scenerio, where Male speech some cases (like 6rpm male speech trial), the Kalman
error is slightly higher. The female speech error are prediction are able to deliver very accurate localization
slightly higher, especially at low SNR case, probably results, owning to the accurate convolution base esti-
due to the influence from the front/back confusion at mation at the previous step. Higher localization errors
the estimations step. For 6rpm case, the results are occurred at female speech trials and this is mainly due
similar in general, WGN and Male cases works better to the high front back confusion from the estimation
than Female cases. step.
Fig. 3: Comparison for the kalman filter performance with wgn/male/female sound source at different SNR level.
The trajectory is full circle at constant turning rate at 2rpm
Fig. 4: Comparison for the kalman filter performance with wgn/male/female sound source at different SNR level.
The trajectory is full circle at constant turning rate at 6rpm
Fig. 5: Comparison for the kalman filter performance with wgn/male/female sound source at different SNR level.
The trajectory is segmented circle from -90 to 90 at constant turning rate at 2rpm.
Fig. 6: Comparison for the kalman filter performance with wgn/male/female sound source at different SNR level.
The trajectory is segmented circle from -90 to 90 at constant turning rate at 6rpm.
turning rate/directions. Overall, WGN and male speech ence HRTF database used to compare can be different
performs better than the female speech, possibly due from the HRTF of the person wearing the hearing aids.
to the differences in higher frequencies. Although not For future development, we can improve the system in
presented in detail in this paper due to its limited scope, the following aspects. There is still room for eliminat-
we also found that the source localization algorithms ing front/back ambiguity for the cross-convolution step,
are pretty robust to HRTFs. This means that the refer- which would deliver a much more accurate estimation
for the input of the Kalman filter. Also, the algorithm’s Conference on Information Fusion, volume 2, pp.
performance within highly reverberant environment 6–pp, IEEE, 2005.
remains untested. Furthermore, more complex trajec-
tories could be test to simulate various scenarios, e.g. [9] Roth, M., Hendeby, G., and Gustafsson, F.,
the matrices in the Kalman filter could be optimized “EKF/UKF maneuvering target tracking using coor-
for objects like incoming car, with variable turning rate dinated turn models with polar/Cartesian velocity,”
in our coordinate system. Due to the principles of the in 17th International Conference on Information
processing procedure, the latency of this algorithm is Fusion (FUSION), pp. 1–8, IEEE, 2014.
minimal and is ideal for real-time applications, such as
implementing on DSP platforms.
References
[8] Yuan, X., Han, C., Duan, Z., and Lei, M., “Compar-
ison and choice of models in tracking target with
coordinated turn motion,” in 2005 7th International