0% found this document useful (0 votes)
41 views

An HRTF Based Approach Towards Binaural Sound Source Localization

This paper presents a technique for localizing the direction of a dominant sound source using only two microphones. It utilizes the unique spectral and temporal features of a listener's head-related transfer functions (HRTFs). An experimental setup was created to measure an HRTF database and record sound sources from various trajectories. A localization algorithm analyzes the microphone signals against the HRTF database to determine the source direction. It employs interaural time difference filtering and cross-correlation to reduce the search space and avoid instability issues. Experimental results show the algorithm can accurately localize dynamic moving sources under different signal-to-noise ratios.

Uploaded by

jack.arnold120
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views

An HRTF Based Approach Towards Binaural Sound Source Localization

This paper presents a technique for localizing the direction of a dominant sound source using only two microphones. It utilizes the unique spectral and temporal features of a listener's head-related transfer functions (HRTFs). An experimental setup was created to measure an HRTF database and record sound sources from various trajectories. A localization algorithm analyzes the microphone signals against the HRTF database to determine the source direction. It employs interaural time difference filtering and cross-correlation to reduce the search space and avoid instability issues. Experimental results show the algorithm can accurately localize dynamic moving sources under different signal-to-noise ratios.

Uploaded by

jack.arnold120
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Audio Engineering Society

Convention Paper 10289


Presented at the 147th Convention
2019 October 16 – 19, New York
This convention paper was selected based on a submitted abstract and 750-word precis that have been peer reviewed by at
least two qualified anonymous reviewers. The complete manuscript was not peer reviewed. This convention paper has been
reproduced from the author’s advance manuscript without editing, corrections, or consideration by the Review Board. The
AES takes no responsibility for the contents. This paper is available in the AES E-Library (http://www.aes.org/e-lib), all rights
reserved. Reproduction of this paper, or any portion thereof, is not permitted without direct permission from the Journal of the
Audio Engineering Society.

An HRTF based approach towards binaural sound source


localization
Kaushik Sunder1 and Yuxiang Wang2
1 EmbodyVR, San Mateo, California
2 School of Electrical and Computer Engineering, University of Rochester
Correspondence should be addressed to Kaushik Sunder ([email protected])

ABSTRACT
With the evolution of smart headphones, hearables, and hearing aids, there is a need for technologies to improve
situational awareness. The device needs to constantly monitor the real world events and cue the listener to stay
aware of the outside world. In this paper, we develop a technique to identify the exact location of the dominant
sound source using the unique spectral and temporal features listener’s head-related transfer functions (HRTFs).
Unlike, most state-of-the-art beamforming technologies, this method localizes the sound source using just two
microphones thereby reducing the cost and complexity of this technology. An experimental framework is setup
at the EmbodyVR anechoic chamber and hearing aid recordings are carried out for several different trajectories,
SNRs, and turn-rates. Results indicate that the source localization algorithms perform well for dynamic moving
sources for different SNR levels.
1 Introduction just two microphones to “detect the direction” of an
acoustic event. This method makes use of the unique
In the last few years, we have seen a surge of hearables spectral and temporal features of the listeners’ head-
in the audio market with a variety of smart features. related transfer functions (HRTFs) [1]. Using just two
Listeners’ wearing these devices are increasingly cut- microphones helps enormously reduce the cost and
off from the outside world and is a potential safety computational complexity of this technology compared
hazard. In this paper, we primarily focus on hearing to the multiple-microphone techniques, which is cur-
aid applications. The hearing aid can first use this rently the state of the art.
technology to locate the direction of the prominent
speech, and virtually synthesize the speech from the HRTF based sound localization methods have tradition-
detected direction. Sound unlike vision can provide ally been applied to human centered robotic systems
us a complete 360 degree situational awareness. This for telepresence operations [2]. Martin et al. [2] in
can eventually enable a biker to be aware of the pos- his paper, has reviewed several HRTF based sound
sible presence of vehicles around, thereby improving source localization algorithms such as the Matched Fil-
safety. We demonstrate a unique approach that uses tering Approach, Source Cancellation Approach, and
Sunder, Wang HRTF based binaural source localization

the Cross-convolution approach. The idea behind most space reduction is one of the key novelties of this paper.
HRTF based localization algorithms is to identify a pair Another important work in this paper is the creation of
of HRTFs that corresponds to the emitting position of an average compensation filter or a difference-filter to
the source, such that the correlation between left and compensate for the difference in position of the hearing
right microphone observations are maximum [1]. In the aid placement and the entrance of the ear-canal position,
matched filtering approach, the inverse of the HRTFs where a typical HRTF is measured. This difference-
is first computed, which is then filtered with the micro- filter also compensates for the positional errors of the
phone observations. The index leading to the highest hearing aid every time it is repositioned.
cross-correlation is the source location of interest. The
disadvantage of this approach is that the inversion of In order to further narrow down the search space
the HRTFs can be problematic due to instability. This to the exact location of the sound source, the cross-
is primarily because of the linear phase component convolution based localization [5] approach is used
of the HRTFs that encodes the ITDs. Keyrouz et al. to avoid any instability problem. The direction corre-
[3] proposed a solution using outer-inner factorization sponding to the index of maximum cross-correlation
converting an unstable inverse to an anti-causal and is the estimated direction of the source. In addition,
bounded inverse. Keyrouz & Diepold [4] and Usman a Kalman filter is used to improve the localization in
et al. [5] proposed the source cancellation approach, the presence of external noise for a moving source. In
which is an extension of the matched filtering approach. this work, we are able to predict the location of the
In this method, cross-correlation between all the pairs sound source without prior knowledge of the type of
of ratios of recorded observations and ratio of HRTFs is trajectory. First few samples are analysed and used to
taken. The main improvement is that the HRTFs need predict the speed of the sound source and the direction
not be inverted, and can be precomputed and stored in that improves the Kalman estimation.
memory. The reference signal [6] method uses four mi- This paper is organized as follows : Section 2 explains
crophones : Two for the HRTF-filtered signals and two the experimental setup followed in this paper for mea-
other outside the ear-canal for original sound signals. suring reference HRTF database as well as recording
While this method has several advantages, this method sound sources. All the different processing blocks or
does not fall under the scope of binaural localization methods used in this paper are explained in detail in
as it requires four microphones. In order to avoid any Section 3. Results of the simulations are explained
instability issue, Usman et al. [4] exploited the associa- in Section 4 for two types of trajectories (Full circle
tive property of the convolution operator and proposed and Circular segments). Discussion and future work is
the cross-convolution approach. In this approach, we described in Section 5.
use the cross-convolution approach to predict the index
of the HRTFs that is of interest. This is explained in 2 Experimental Setup
detail in section 3.3.
Our measuring system contains a two-axis (azimuth/
In this paper, the left and right microphone signals are elevation) motor platform, a HATS humanoid record-
analyzed, compared against a certain reference HRTF ing manikin, a single channel speaker assembly and
database to determine the direction of the source. The a infra-red camera head tracking system, all of them
reference HRTF database is measured with the Head assembled in a semi-anechoic chamber. The HATS
and Torso simulator (HATS) at the EmbodyVR semi- manikin was mounted on the vertical rotational motor
anechoic chamber. The HRTF database was measured platform, which allows for motions in the azimuth. The
for 541 unique directions that results in a resolution of speaker was mounted on an arc rail, driven by steel
10 by 10 degrees. As a first step, a reduction of search- cable connected to the other motor, enabling for the
space is carried out by computing the ITD (interaural motion in the elevation.
time difference) between the left and right recorded
signals from the hearing aid. By narrowing the search- The motions of both the azimuth and elevation motor
space, the source localization algorithm does not re- was controlled by a PC terminal outside the chamber,
quire to search all the 541 pairs of HRTFs in the ref- where we could set the trajectory and the speed of both
erence database. This reduces the computational com- axis, and record the real-time position from the motors
plexity of the localization algorithm. The ITD search of each axis from the sensors built in the platforms.

AES 147th Convention, New York, 2019 October 16 – 19


Page 2 of 8
Sunder, Wang HRTF based binaural source localization

Reference
HRTFs

Mic1
Convolution
Data Low-pass Reduce
based
Mic2 Acquisition filtering search space
localization

Reference HRTF database Reference Predictors


HRTF ITDs (Kalman filtering)

HRTF1 HRTF1
HRTF2 HRTF2
... Difference ...
. Filter . Predicted
. . source location
. .
. .
HRTFn HRTFn

Measured at Measured at
the entrance of hearing aid
Fig. 1: Semi-anechoic chamber with experimental ear canal microphone position

setup
Fig. 2: Block diagram for our processing strategy

With this experiment setup, we are able to simulate an


external moving sound source with respect to the lis- In order to calculate the difference filter, following
tener, along with its corresponding real time positions steps are performed :
as the ground truth. i) HRTF measurements are first carried out at the en-
trance of the ear-canal on the B&K Head and Torso
In order to obtain a reference HRTF database, HRTFs Simulator (HATS).
were measured on the HATS with this experiment setup ii) Sivantos R hearing aids are mounted on the ears of
described above. HRTFs were measured for 541 unique the HATS and the same HRTF measurements are car-
directions owing to a resolution of 10 degrees in both ried out, this time with the hearing aid front/rear mics.
azimuth and elevation. The same experimental setup iii) In order to account for the variability in these mea-
is also used for recordings simulating a moving sound surements, the hearing aids are repositioned several
source, which is described in the Results section. The times and a new measurement is taken.
following methods described in Section 3 were then iv) Inverse filtering of the HATS measurement is per-
performed after the recordings. formed with a regularization parameter and the solution
is then convolved with the hearing aid measurements
3 Methods in order to obtain the Difference Filter
HRT FHearing−aid
Di f f − Filter = (1)
3.1 Difference Filter HRT FHAT S

The hearing aid (HA) microphones are usually placed v) Difference curves for these multiple positions of
in the front or behind the hearing aid casing and not at hearing aid is taken and a highly optimized difference
the entrance of the blocked ear canal, where the HRTFs curve is designed .
are typically measured. Thus, the reference HRTF
database cannot be directly compared with the left/right 3.2 Search Space Reduction
HA microphone signals, as it needs to be converted to
a database with equivalent measuring position as the Interaural time difference (ITD) is an essential spatial
hearing aid. This conversion is carried out using a audio cue that human beings use for sound localiza-
Difference filter. The difference filter compensates for tion [1]. ITD is typically calculated by computing the
both the difference in hearing aid mic position and cross-correlation between the left and right channels
the HRTF measurement position, and the variations in and taking the maximum correlation lying between +/-
hearing aid mic positions when the user puts them on. 1ms. A low-pass filter with a cut-off frequency of 1500

AES 147th Convention, New York, 2019 October 16 – 19


Page 3 of 8
Sunder, Wang HRTF based binaural source localization

Hz is first applied to the left and right hearing aid mi- A Kalman filter takes in input, which is known to have
crophone signals before computing the ITD, as the ITD some error, uncertainty, or noise and extract the useful
is primarily a low-frequency cue that is usually present parts of interest to reduce the uncertainty or noise. The
below 1500 Hz. The low-pass filter also eliminates Kalman filter model assumes that the state of a system
some of the high-frequency noise from the HA mic at a time step t evolved from the prior state at time step
signals for better prediction of the direction. In order t − 1 according to the equation 3.
to reduce the search space, the ITD of the hearing aid
signals and the reference HRTF database is compared x̂t|t−1 = Ft x̂t−1|t−1 + Bt ut
with a predefined tolerance. The amount of tolerance (3)
Pt|t−1 = Ft Pt−1|t−1 FtT + Qt
defines the amount of space reduction that can be ob-
tained. For example, choosing the ITD tolerance of
+/-2 samples reduces the search space by up to 92%.
Where Xt is the state vector containing the terms of
3.3 Cross-Convolution Based Localization interest for the system (eg. position, velocity, heading
etc at time t ), ut is vector containing any control inputs
We implemented cross-convolution based localization (steering angle, throttle setting, etc.) Ft is state tran-
(Usman et al. [5]) for this next step. Among all the sition matrix which applies the effect of each system
HRTF based sound source localization techniques, e.g. parameter at time t − 1 on the system state at time t (eg.
matched filter approach, source cancellation approach the position and velocity at time t − 1 both affect the
and convolution based approach, cross-convolutional position at time t ). Bt is Control input matrix which
method is the one that is most robust, delivers high- applies the effect of each control input vector in the
est accuracy, and best suited for this application. In vector ut on the state vector. Pt is the Covariance matrix
this method, the left and right observations or the hear- corresponding to the state vectors.
ing aid signals are filtered with a pair of contralateral
HRTFs. The filtered signals are then converted to fre- Due to the nature of Kalman predictor, we do not need
quency domain before calculating the index of the max- the prior knowledge of the historical information of the
imum cross-correlation. The direction corresponding input, which in our case is the initial position of the
to the index is the estimated direction of the source: external sound source. For the Kalman filter we imple-
ˆ = HR,i · XL mented, we used the coordinated turn model(Micheal
SL,i
Roth et al [9]), where the state vector is defined as
= HR,i · HL,i0 · S [x1, x2, v, h, w]T , where x1/x2 are the coordinates in
= HL,i · HR,i0 · S cartesian axis, v is the velocity, h is the heading an-
= HL,i · XL (2) gle and w is the angular velocity. The Kalman filter
takes the results from the cross-convolution localization
ˆ ⇔ i = i0
= SR,i
n as input, then returns its best prediction of the moving
M o
ˆ
i0 = argmaxi F SL,i ˆ
F SR,i sound source.

where, io is the best estimate from this method, XL and 4 Results


XR are the left and right observations of the hearing
aids. HL,i and HR,i are the reference HRTFs obtained
after applying the difference filter. In our simulation experiments, we used sound source of
different types (White gaussian noise, Male speech, Fe-
3.4 Kalman filter prediction male speech), and made recordings at different turning
rates in azimuth : 12 and 36 deg/s, which corresponds
In this step, Kalman filter( [2],[7],[8]) is used to im- to 2 and 6 rpm, respectively. After the recordings, we
prove the localization accuracy in the presence of exter- added non-directional ambient noise to simulate the
nal noise for a moving source, which in our case, is the best and worst SNR scenarios (60 and 10 dB SNR).
error in the estimation of cross-convolution localization The simulations were carried out for two kinds of tra-
results. jectories : Full circle and segmented circle.

AES 147th Convention, New York, 2019 October 16 – 19


Page 4 of 8
Sunder, Wang HRTF based binaural source localization

Table 1: Kalman prediction errors in deg for the full Table 2: Kalman prediction errors in deg for the clock
circle trajectory wise segmented circle trajectory

Conditions W GN Male Female Conditions W GN Male Female


2rpm SNR60 12.47 12.79 26.72 2rpm SNR60 10.58 13.53 22.04
2rpm SNR10 9.87 16.48 50.17 2rpm SNR10 8.58 18.99 57.78
6rpm SNR60 11.72 11.50 32.30 6rpm SNR60 5.93 4.31 61.16
6rpm SNR10 7.41 13.90 38.44 6rpm SNR10 3.69 8.72 35.47

4.1 Full circle motion Table 3: Kalman prediction errors in deg for the
counter-clock wise segmented circle trajec-
For experiments with the full circle trajectory, we tory
record the sound source travelling from -180 deg to
+180 deg in azimuth with respect to the HATS. The real
time position of the source is also recorded simultane- Conditions W GN Male Female
ously, which is used as the ground truth to compare 2rpm SNR60 13.59 15.35 22.78
with. 2rpm SNR10 10.90 18.90 58.33
In the simulation process, we fix the frame rate to be 6 6rpm SNR60 15.80 4.46 58.22
frames per second, i.e. the frame length is 1/6 seconds, 6rpm SNR10 16.59 22.85 61.90
which is also the time interval T that we update the
current state from the previous state for the state vector.
The captured binaural recordings were first sectioned
according to the frame length. The ITD detection and 4.2 Segmented circle motion
cross-convolutional localization was then performed
on these recordings. In the following figures 3 to 6,
each top row shows the noisy localization results. The Furthermore, we performed experiments with trajectory
Kalman filter then takes this result as the state esti- of segmented circular motion. In this simulation, we de-
mation input, and turns the predicted location at each signed the sound source moving from -90 to +90 (semi-
corresponding time frame. There’s no prior knowledge circle) with respect to the HATS. The Kalman predicted
of the initial source locations, and we just use the first and the ground truths are plotted in figures 5 and 6. The
12 frames to initiate the parameters in the state. For prediction algorithms are applied to these recordings
the state vector [x1, x2, v, h, w]T , we let v to be the value and a similar error analysis is carried out (shown in Ta-
calculated from this 12 frames estimation, which con- ble 2 for Clockwise and Table 3 for Counter-clockwise
tains both the speed and direction information. After movements). Similar to the full circle trajectory case,
the Kalman filter process, the predicted localization at close-wise 2rpm trial, both WGN and male trials
results are shown in the lower row. An error analysis is deliver the accuracy at around 10 degrees at high SNR
performed (Table 1), which is calculated by computing case, while the female speech trial shows relatively
the RMS difference of the Kalman predicted output higher error of 22 degrees. With lower SNR, the perfor-
and the ground truth (true trajectory). For the 2rpm mance are generally worse. This conclusion is true for
case, errors of WGN trial are around 10 degrees for both the clock-wise and counter close-wise trials. In
both high and low SNR scenerio, where Male speech some cases (like 6rpm male speech trial), the Kalman
error is slightly higher. The female speech error are prediction are able to deliver very accurate localization
slightly higher, especially at low SNR case, probably results, owning to the accurate convolution base esti-
due to the influence from the front/back confusion at mation at the previous step. Higher localization errors
the estimations step. For 6rpm case, the results are occurred at female speech trials and this is mainly due
similar in general, WGN and Male cases works better to the high front back confusion from the estimation
than Female cases. step.

AES 147th Convention, New York, 2019 October 16 – 19


Page 5 of 8
Sunder, Wang HRTF based binaural source localization

Fig. 3: Comparison for the kalman filter performance with wgn/male/female sound source at different SNR level.
The trajectory is full circle at constant turning rate at 2rpm

Fig. 4: Comparison for the kalman filter performance with wgn/male/female sound source at different SNR level.
The trajectory is full circle at constant turning rate at 6rpm

5 Discussion to the azimuth location ground truth. The turning rate


calculated from the Kalman filtered results shows good
The results show that even with severe front/back con- consistency with the actual value, throughout different
fusion from the cross-convolution results, the Kalman SNR and turning rate cases. The predicted localization
filters that we used are able to return an accurate predic- error shows as good as less than 10 degrees in some
tion at the corresponding source locations, compared trials. The performance is robust across SNR levels and

AES 147th Convention, New York, 2019 October 16 – 19


Page 6 of 8
Sunder, Wang HRTF based binaural source localization

Fig. 5: Comparison for the kalman filter performance with wgn/male/female sound source at different SNR level.
The trajectory is segmented circle from -90 to 90 at constant turning rate at 2rpm.

Fig. 6: Comparison for the kalman filter performance with wgn/male/female sound source at different SNR level.
The trajectory is segmented circle from -90 to 90 at constant turning rate at 6rpm.

turning rate/directions. Overall, WGN and male speech ence HRTF database used to compare can be different
performs better than the female speech, possibly due from the HRTF of the person wearing the hearing aids.
to the differences in higher frequencies. Although not For future development, we can improve the system in
presented in detail in this paper due to its limited scope, the following aspects. There is still room for eliminat-
we also found that the source localization algorithms ing front/back ambiguity for the cross-convolution step,
are pretty robust to HRTFs. This means that the refer- which would deliver a much more accurate estimation

AES 147th Convention, New York, 2019 October 16 – 19


Page 7 of 8
Sunder, Wang HRTF based binaural source localization

for the input of the Kalman filter. Also, the algorithm’s Conference on Information Fusion, volume 2, pp.
performance within highly reverberant environment 6–pp, IEEE, 2005.
remains untested. Furthermore, more complex trajec-
tories could be test to simulate various scenarios, e.g. [9] Roth, M., Hendeby, G., and Gustafsson, F.,
the matrices in the Kalman filter could be optimized “EKF/UKF maneuvering target tracking using coor-
for objects like incoming car, with variable turning rate dinated turn models with polar/Cartesian velocity,”
in our coordinate system. Due to the principles of the in 17th International Conference on Information
processing procedure, the latency of this algorithm is Fusion (FUSION), pp. 1–8, IEEE, 2014.
minimal and is ideal for real-time applications, such as
implementing on DSP platforms.

References

[1] Begault, D. R. and Trejo, L. J., “3-D sound for


virtual reality and multimedia,” 2000.

[2] Rothbucher, M., Kronmüller, D., Durkovic, M.,


Habigt, T., and Diepold, K., “HRTF sound localiza-
tion,” in Advances in Sound Localization, INTECH
Open Access Publisher, 2011.

[3] Keyrouz, F., Diepold, K., and Dewilde, P., “Robust


3d robotic sound localization using state-space hrtf
inversion,” in 2006 IEEE International Conference
on Robotics and Biomimetics, pp. 245–250, IEEE,
2006.

[4] Keyrouz, F. and Diepold, K., “An enhanced binau-


ral 3D sound localization algorithm,” in 2006 IEEE
International Symposium on Signal Processing and
Information Technology, pp. 662–665, IEEE, 2006.

[5] Usman, M., Keyrouz, F., and Diepold, K., “Real


time humanoid sound source localization and track-
ing in a highly reverberant environment,” in 2008
9th International Conference on Signal Processing,
pp. 2661–2664, IEEE, 2008.

[6] Keyrouz, F. and Saleh, A. A., “Intelligent sound


source localization based on head-related transfer
functions,” in 2007 IEEE International Conference
on Intelligent Computer Communication and Pro-
cessing, pp. 97–104, IEEE, 2007.

[7] Li, X. R. and Jilkov, V. P., “Survey of maneuvering


target tracking. Part I. Dynamic models,” IEEE
Transactions on aerospace and electronic systems,
39(4), pp. 1333–1364, 2003.

[8] Yuan, X., Han, C., Duan, Z., and Lei, M., “Compar-
ison and choice of models in tracking target with
coordinated turn motion,” in 2005 7th International

AES 147th Convention, New York, 2019 October 16 – 19


Page 8 of 8

You might also like