Sage Journals: Discover world-class research

Abstract

Sound source localization is one of the basic and essential techniques for intelligent robots in terms of human-robot interaction and has been utilized in various engineering fields. This paper suggests a new localization method using an inter-channel time difference trajectory, which is a new localization cue for efficient 3-D localization. As one of the ways to realize the proposed cue, a two-channel rotating array is employed. Two microphones are attached on the left and right sides of the spherical head. One microphone is in a circular motion on the right side, while the other is fixed on the left side. According to the rotating motion of the array, the (source) direction-dependent characteristics of the trajectories are analysed using the Ray-Tracing formula extended for 3-D models. In simulation, the synthesized signals generated by the fixed and rotating microphone signal models were used as the output signals of the two microphones. The simulation showed that the localization performance is strongly dependent on the azimuthal position of a source, which is caused by the asymmetry of the trajectory amplitude. Additionally, the experimental results of the two experiments carried out in the room environment demonstrated that the proposed system can localize a Gaussian noise source and a voice source in 3-D space.

Keywords

Three-dimensional Sound Source Localization Inter-channel Time Difference Trajectory Rotating Microphone Array Ray-tracing Formula Human-Robot Interaction

1. Introduction

Recently, intelligent robots have been developed to not only support arduous human tasks, but also to interact with people in order to meet various human needs [1, 2]. As an independent object with its own intelligence [3], a robot needs to recognize environmental changes, such as the appearance of unidentified objects or the acoustic events for missions completed. For example, robots working in households should detect user voices and simultaneously be aware of other acoustic events, such as noises emitted from home appliances and other voices from electric devices. As a result, they can pay attention to speakers with more natural human-robot interaction (HRI) skills. In this situation, the technology of sound source localization (SSL) is employed to estimate the acoustic source direction using the acoustic signals from the microphone array; this is one of the most important building blocks of HRI. In addition, intelligent robots need to estimate the azimuth and elevation angles of a source together (i.e., 3-D SSL), due to the fact that a sound event occurs at an arbitrary direction in 3-D space. It is noteworthy, however, that a lot of techniques need to be carried out simultaneously with the given limited resources and the computational power for SSL is restricted. As a result, the computationally efficient 3-D SSL method is increasingly required. In addition, the source direction is defined by the inter-aural polar coordinates shown in Figure 1.

Figure 1.

A sound source direction is defined in the inter-aural polar coordinates by both the azimuthal angle (φp_s) and the elevation angle (θ_S). The sagittal plane is a vertical plane dividing the space into right and left halves. Sources on each sagittal plane share the same azimuth. The median plane is the mid-sagittal plane that bisects the space symmetrically from left to right. The horizontal plane is perpendicular to the sagittal plane and passes through the centre of the coordinates.

In the last few decades, many different SSL algorithms that are applicable to intelligent robots [4, 5] and also other engineering systems (e.g., teleconference systems [6] and surveillance units [7]) have been proposed. Even if the microphone array size, shape and number of microphones differ due to the constraints of various applications, certain localization cues (or direction estimation cues) are commonly used: inter-channel time difference (ICTD), inter-channel level difference (ICLD) and inter-channel spectral difference (ICSD). ICTD is defined as the time difference between the arrivals of a sound wave-front to the microphones; ICLD is defined as the difference between the sound pressure levels at the microphones; and ICSD means the difference between the spectral contents at the microphones. ICTD has been used as the most powerful localization cue [8 –10] in almost all applications. Comparatively, ICLD and ICSD have not been used as frequently for practical applications, but are nevertheless employed in biomimetic research, including that of ear-based SSL systems [11, 12].

In general, most of the SSL systems use more than four microphones for 3-D SSL. Except for circumstances in which directional microphones are used, if only two microphones are used and their locations do not change with time, 3-D SSL is not possible, because front-back confusion occurs due to the existence of many directions sharing the same localization cues, even for a single SSL [13, 14]. This is called the cone-of-confusion in 3-D space. In the absence of an additional structure (e.g., a spiral-shaped structure), more than four microphones in different (imaginable) planes are necessary in order to solve the cone-of-confusion problem inherent to 3-D SSL [15]. However, in the situation where a two-channel moving array for 3-D SSL is used, the measured localization cue such as ICTD will change according to the array motion. Then, from the changing pattern of ICTDs, it is expected that 3-D SSL can be achievable. Therefore, the (microphone) position-dependent ICTDs can be represented as below:

τ (φ_{S}, θ_{S} | α)

(1)

where α is the parameter that identifies the location or the motion of the moving microphone. Here, the (source) direction- and (microphone) position-dependent ICTDs are named as the ICTD trajectory, which is the new concept for a localization cue for efficient 3-D SSL.

This paper proposes a 3-D SSL method using the ICTD trajectory induced by the circular motion of the two-channel rotating array. This array was selected as one of the possible ways to realize a specific ICTD trajectory. One of the two microphones is attached to the rotating plate on the right side of the spherical head, indicated by the red-coloured circle in Figure 2; the other microphone is fixed on the left side of the head, indicated by the blue-coloured circle. Figure 2 shows the schematic drawing of the suggested two-channel rotating microphone array installed on the spherical head. In this paper, in order to generate the known movement of the array, the circular motion is given to the right-sided plate.

Figure 2.

Schematic of the rotating microphone array installed on the spherical head. One of two microphones is attached to the rotating plate on the right side of the head (red-coloured circle). In this paper we call this component the “rotating microphone”. The rotating part is moving in a clockwise direction on the Y_RZ_R plane and the shift angle of this part is measured with respect to the + Z_R axis. The other microphone is fixed on the left side of the head (blue-coloured circle). It is called the “fixed microphone”.

This paper is organized as follows: in section 2, we introduce the new specific localization cue that is the ICTD trajectory relevant to the circular motion of the array. The mathematical derivation of the ICTD trajectory is presented using the extended Ray-Tracing formula for 3-D models. The relationship between the parameters of the ICTD trajectory and the source direction is also presented. Section 3 describes the proposed 3-D SSL algorithm based on the source direction estimator. In section 4, the localization performance of the proposed SSL algorithm is examined using simulations: the signal models of both rotating and fixed microphones are presented. Section 5 shows the experimental setup and the results using two kinds of sources. The discussion is presented in section 6 and the concluding remarks are given in section 7.

2. Localization Cue: ICTD Trajectory

Most of the conventional localization algorithms have been developed using a microphone array fixed at given positions [4, 5, 8 –12, 15] and with constant direction-dependent cues, on the assumption that a source position does not vary. The most commonly used time delay estimation (TDE) method is to use the generalized cross-correlation (GCC) function, which is employed to estimate ICTD between the selected pair of microphones [16, 17]. Among various GCC functions, the GCC-phase transform (PHAT) function is widely used because it is well known for its robust estimations of ICTD in the reverberant field [18].

If we use N omnidirectional microphones, N(N-1)/2 GCC-PHAT functions are obtainable. The GCC-PHAT function between the i^th and j^th microphones is defined as below [16]:

R_{x_{i} x_{j}} (τ) = \int_{- \infty}^{\infty} \frac{G_{x_{i} x_{j}} (f)}{| G_{x_{i} x_{j}} (f) |} e^{j 2 π f τ} d f

(2)

where f is the frequency and τ is a time delay variable. x_i and xj are defined as the i^lh and j^th microphone output signals. G_{x_ix_j} and R_{x_ix_j} are the cross-spectral density function and the GCC-PHAT function between x_i and x_j, respectively. The τ_j, the measured ICTD between x_i and x_j, is calculated by τ_ij = argmax_τR_{x_ix_j}(τ). After that, the estimated direction of the sound source can be found by the relationship between the geometry of the microphone array and the multiple ICTDs. For example, when two microphones are placed on the horizontal plane and apart from each other by L m, the azimuth of a source can be estimated by using equation (3) [19]:

{\hat{φ}}_{S} = \cos^{- 1} (\frac{c \cdot τ_{12}}{L})

(3)

where c is the speed of sound and φ̂_S is the estimated azimuth. For example, SSL using ICTD maps [20, 21] was applied to the microphone array fixed on the robot head.

2.1 ICTD Trajectory

When using an immovable microphone array, the observation of the constant localization cues (e.g., the ICTDs) can be used to efficiently estimate a direction, provided there are sufficient sensors. However, like the two-channel array, the measurement of the single constant localization cue does not guarantee successful SSL due to the cone-of-confusion problem. Thus, it is concluded that the useful 3-D SSL cue should have other characteristics dependent on the source direction, i.e., azimuth and elevation angles.

When using the two-channel rotating array, if the angular velocity is given as w_R, the position-dependent ICTD must be a periodic function with a period of 2π / w_R. It is also noteworthy that the localization cue we want to suggest is similarly based on the ICTD concept. We assumed that the Doppler effect caused by the relative motion between the rotating microphone and a source can be ignored, because the speed of the rotating microphone, i.e., the radius of the rotating circle multiplied by the rotating angular speed is quite small compared with the speed of sound. Furthermore, when considering the application scenario where the talking person as a target source is walking inside a room, the sound source is supposed to move slowly. Thus, the new specific localization cue, including the position-dependent feature of the circular motion of the rotating array, is defined in equation (4) for the source at (φ_s, θ_s):

τ (φ_{S}, θ_{S} | θ_{Shift})

(4)

where θ_shift ɛ [0, 2π] and is measured clockwise from the + Z_R axis. Also, θ_shift indicates the position of the microphone on the right side given a constant rotating radius ( r_R ).

2.2 Extended Ray-Tracing Formula for 3-D Models

The well-known Ray-Tracing formula [22, 23] has been widely used for 2-D models to approximate the inter-aural time difference. However, in order to achieve the ICTD trajectories, this formula should be extended for 3-D models such as a two-channel rotating array installed on the sphere. Figure 3 shows the nomenclatures required to model the propagation distance between the source and the sensor locations. Once the propagation distance is derived, depending on the shift angle of the rotating part, the ICTD trajectory is obtainable with the assumption that the speed of the sound is independent of frequency in a non-dispersive medium. In order to derive the propagation distance, three direction vectors from C_H to the positions of the two microphones and the source need to be expressed as a function of the shift angle and the source direction.

Figure 3.

Nomenclatures related to the rotating microphone array on the spherical head. The centre of the head (C_H) is at the origin of the XYZ coordinates. The red circle represents the rotating microphone at the rotating part on the Y_RZ_R plane. This microphone is rotating with a constant speed (r_R x w_R), where r_R and w_R are the rotating radius and angular velocity, respectively. “The rotation centre C_R is located at $(\sqrt{r_{H^{2}} - r_{R^{2}}, 0, 0})$ ). The shift angle (θ_shift) of the rotating plate is defined as the angle from the + Z_R axis in a clockwise direction. The blue circle represents the fixed microphone that is located at (−r_H, 0, 0). r_H is the radius of the spherical head and θ_R is the angle between the X axis and the direction vector to the rotating microphone from C_H.

\vec{d_{R}} = (r_{H} \cos θ_{R}, r_{R} \sin θ_{s h i f t}, r_{R} \cos θ_{s h i f t})

(5)

\vec{d_{F}} = (- r_{H}, 0, 0)

(6)

\vec{d_{S}} = r_{S} (\sin φ_{S}, \cos φ_{S} \cos θ_{S}, \cos φ_{S} \sin θ_{S})

(7)

where ${\vec{d}}_{R}$ and ${\vec{d}}_{F}$ are the direction vectors from C_H to the rotating and fixed microphones respectively, r_S is the distance between a source and C_H, and d_S is the direction vector from C_H to the source.

The Ray-Tracing formula for the 3-D model includes the concept of the critical circle, which is the counterpart of the critical point in the 2-D model [22]. When the observation point is hidden by the head, the wave-front from the source initially propagates to the critical circle directly and then secondarily propagates along the surface to the observation point. These propagation steps are shown in Figure 4 and the critical circle is represented by the red-coloured line. If we denote the direction vector to the i^th microphone on the surface as ${\vec{d}}_{I}$ , then the angle between two direction vectors (i.e., ${\vec{d}}_{I}$ and ${\vec{d}}_{S}$ ) is denoted as $θ_{\vec{d}_{I} {\vec{d}}_{S}}$ and defined below:

Figure 4.

Two steps of wave propagation from the sound source to the hidden observation point are illustrated by the green and purple lines. If the observation point (indicated by the red point) is hidden by the sphere from the view of the source, the wave-front approaches the critical circle directly. After that, the wave-front reaches the observation point along the surface.

θ_{\vec{d_{I}} \vec{d_{S}}} = \cos^{- 1} (\frac{〈 \vec{d_{I}} \cdot \vec{d_{S}} 〉}{| \vec{d_{I}} | | \vec{d_{S}} |})

(8)

Then, $θ_{\vec{d}_{R} {\vec{d}}_{S}}$ and $θ_{\vec{d}_{F} {\vec{d}}_{S}}$ can be expressed by equations (9) and (10) with the assumption that r_S ≫ r_H:

\begin{array}{l} θ_{\vec{d_{R}} \vec{d_{S}}} (θ_{shift}) = \cos^{- 1} (\cos θ_{R} \sin φ_{S} \\ + \sin θ_{R} \cos φ_{S} \cos θ_{R} \sin θ_{shift} \\ + \sin θ_{R} \cos φ_{S} \sin θ_{S} \cos θ_{shift}) \end{array}

(9)

θ_{\vec{d_{F}} \vec{d_{S}}} = \frac{π}{2} + φ_{S}

(10)

If $θ_{\vec{d}_{R} {\vec{d}}_{S}} \leq θ_{C}$ where θ_c is ${cos}^{- 1} (| {\vec{d}}_{R} | / | {\vec{d}}_{S} |) \sim 90^{°}$ , then the wave propagates along the direct path only. Otherwise, if the microphone is hidden by the spherical head, which mathematically indicates that $θ_{\vec{d}_{R} {\vec{d}}_{S}} \geq θ_{C}$ , the propagation distance by the diffracted wave motion along the surface should also be considered. Here, equation (11) shows the propagation distance from the source to the rotating microphone, denoted by D_R, and equation (12) shows the distance from the source to the fixed microphone, represented by D_F:

D_{R} = {\begin{cases} \sqrt{r_{S}^{2} + r_{H}^{2} - 2 r_{S} r_{H} \cos (θ_{\vec{d_{R}} \vec{d_{S}}})}, θ_{\vec{d_{R}} \vec{d_{S}}} \leq θ_{c} \\ r_{S} \sin (θ_{c}) + r_{H} (θ_{\vec{d_{R}} \vec{d_{S}}} - θ_{c}), θ_{c} \leq θ_{\vec{d_{R}} \vec{d_{S}}} \leq π \end{cases}

(11)

D_{F} = {\begin{cases} \sqrt{r_{S}^{2} + r_{H}^{2} - 2 r_{S} r_{H} \cos (θ_{\vec{d_{F}} \vec{d_{S}}})}, θ_{\vec{d_{F}} \vec{d_{S}}} \leq θ_{c} \\ r_{S} \sin (θ_{c}) + r_{H} (θ_{\vec{d_{F}} \vec{d_{S}}} - θ_{c}), θ_{c} \leq θ_{\vec{d_{F}} \vec{d_{S}}} \leq π \end{cases}

(12)

If we denote D_d^∞ as $\lim_{r_{s} \to \infty} [D_{F} - D_{R}]$ , the ICTD trajectories can be obtained by dividing D_d^∞ by c. For example, when the physical dimensions of the rotating microphone array are those given in Table 1, the resulting ICTD trajectories of the frontal sources on the horizontal plane are those shown in Figure 5. It is clear that the mean of the ICTD trajectories varies according to the change in the azimuth angle of the source, because the Y _R Z _R plane, on which the circular motion of the rotating microphone occurs, is perpendicular to the X axis. Also, for the frontal sources, the resulting up-and-down pattern of the ICTD trajectories is expected because of the clockwise microphone motion from the + Z _R axis. On the other hand, we can conjecture that the pattern of the ICTD trajectories of the rear sources will be reversed. In addition, for the laterally biased sources, no significant features are visible because the source is on the X axis, which is the symmetric axis of the array's rotating motion. Figure 6 shows the ICTD trajectories for the sources on the median plane. The vertical axis has no physical meaning because we added a 0.3 ms offset to the original ICTD trajectories as the elevation angle of the source increases by 45°, in order to present all ICTD trajectories in a single graph. As shown in Figure 6, as the elevation angle is increased, the phase of the trajectory is shifted. The source elevation angle is defined on the sagittal plane and the shift angle of the rotating part is also defined on the sagittal plane, i.e., the Y _R Z _R plane; see Figures 1 and 3. This is why the ICTD trajectory is shifted as much as the change of the elevation angle of a source. Then, it seems possible to estimate a source's elevation angle by finding the phase shift of the trajectory. Therefore, we can expect that the mean value of the ICTD trajectories will be a useful cue for the azimuth estimation and an amount of the shift of the ICTD trajectory can be used to efficiently estimate the elevation.

Table 1.

The dimensions of the rotating microphone array and the angular speed of the rotating part are given

Nomenclatures	Numerical Values
r_S	10 m
r_R	3.5 cm
r_H	15 cm
w_R=dθ_shift/dt	600 rpm.
θ_R	0.236 rad.
C_R	(0.146 m, 0 m, 0 m)
C_H	(0 m, 0 m, 0 m)

Figure 5.

The ICTD trajectories of the frontal sources on the horizontal plane. The up-and-down motion of the ICTD trajectories is apparent because the shift angle of the rotating microphone increases in a clockwise direction from the +Z_R axis. Additionally, for the left- and right-sided sources, the ICTD trajectories described by the cyan dotted lines have no significant features because the propagation from the laterally-biased source to the rotating microphone does not change as the shift angle varies.

Figure 6.

The ICTD trajectories for sources on the median plane are shown. As the source elevation changes, the trajectory pattern shifts. For the source on the top of the head, the distance between the rotating microphone and the source is the shortest at θ_shift= 0°, which indicates that the ICTD is maximal. The distance increases up to θ_shift = 180° and goes back to the shortest distance within a single period.

2.3 Characteristics of the ICTD Trajectory of the Rotating Microphone Array

In section 2.2, examples of the ICTD trajectories obtained using the Ray-Tracing formula were shown. In section 2.3, we describe the characteristics of the ICTD trajectories of the rotating microphone array. First, the relation between the mean of the ICTD trajectory and the azimuth angle of the source will be derived. Second, the relation between the phase shift of the trajectory and the elevation angle will be shown. In addition, the amplitude of the ICTD trajectory will be presented as a function of the azimuth angle only.

First of all, the mean value of the ICTD trajectory is defined as equation (13):

\bar{ICTDT} = \frac{1}{2 π c} \int_{0}^{2 π} D_{d}^{\infty} (θ_{shift}) d θ_{shift}

(13)

The wave propagation from a source to a microphone is strongly dependent on the azimuth angle of a source; see Figure 4 and equation (10). For example, when a sound source is to the left of the head, only direct wave propagation occurs to the fixed microphone. On the other hand, the consecutive propagation along the direct and indirect paths occurs from the source to the rotating microphone because the rotating microphone is hidden by the head from the view of the source. If a source is to the right of the head, the wave propagation characteristics are reversed. In particular for a source with an azimuth angle within [- θ_R, + θ_R], the propagation characteristic to the rotating microphone changes according to its shift angle. In order to represent propagation characteristics more precisely, we divide them into three categories: (case 1) the wave propagation is along the direct path only, (case 2) the consecutive propagation is along the direct and indirect paths and (case 3) There is a transition between case 1 and case 2, depending on the shift angle. In terms of these categories, Table 2 shows the propagation characteristics according to the azimuth angle of the source. For the sources with azimuth angles within [- θ_R, 0], the propagation to the rotating microphone corresponds to case 3. The transition from case 1 to case 2 occurs when the rotating microphone passes θ_b and the subsequent transition from case 2 to case 1 occurs at π + θ_b, where θ_b is defined in equation (14). For the sources with azimuth angles within [0, + θ _R ], the transition from case 1 to case 2 occurs at θ_b and the consecutive transition from case 2 to case 1 occurs at π - θ_b:

Table 2.

The wave propagation characteristics from the source to the rotating and fixed microphones

The Azimuth Intervals	Corresponding Microphone
The Azimuth Intervals	Fixed Mic.	Rotating Mic.
[−π/2, −θ_R]	Case 1	Case 2
[−θ_R, 0]	Case 1	Case 3
[0, +θ_R]	Case 2	Case 3
[+θ_R, π/2]	Case 2	Case 1

θ_{b} = \min \underset{θ_{s h i f t}}{\arg} [\sin (θ_{s h i f t}) \leq \tan φ_{S} \cot θ_{R}]

(14)

where θ_shift ɛ [0, 2π]. With respect to the azimuth interval, the mean value of the ICTD trajectory is derived as a function of the azimuth angle of the source only as shown in Figure 7. It is apparent that a one-to-one relationship exists between the mean value of the ICTD trajectory and the azimuth angle of the source. Therefore, it is possible to estimate the azimuth angle of the source once the mean value of the ICTD trajectory is obtained.

Figure 7.

The mean value of the ICTD trajectory as a function of the azimuth angle only is shown. The one-to-one relationship between the mean value of the ICTD trajectory and the azimuth angle is clearly defined. The vertical dashed lines indicate the azimuth angles (i.e., ±θ_R. θ_R is cos⁻¹ $(\sqrt{r_{H^{2}} - r_{R^{2}}, 0, 0})$ = 13.4934°) in the rotating microphone array with dimensions as in Table 1.

On the other hand, the specific shift angles, which correspond to the maximal or minimal values of the ICTD trajectory, are useful for finding the elevation angle of the source. These specific shift angles are defined as below:

\begin{array}{l} θ_{shift}^{\max} = \arg \max_{θ_{shift}} D_{d}^{\infty} (θ_{shift}) \\ θ_{shift}^{\min} = \arg \min_{θ_{shift}} D_{d}^{\infty} (θ_{shift}) \end{array}

(15)

which implies that θ ^max _shift and θ ^min _shift are equal to π / 2 - θ_S and 3π / 2 - θ_S, respectively. It is obvious that the elevation angle increases from the + Y axis in an anticlockwise direction, while the shift angle of the rotating microphone increases from the + Z_R axis in a clockwise direction (see Figures 1 and 3). In summary, by finding the two parameters of the ICTD trajectory, i.e., $\bar{ICTDT}$ and θ ^max/min _shift , the azimuth and elevation angles of the source can be found independently.

In addition, as shown in Figure 5, the amplitude of the ICTD trajectory changes as the azimuth angle is varied. Naturally, we can expect the trajectory amplitude to be dependent on the azimuth angle only. Its definition is given below:

{ICTDT}_{P P} = \frac{1}{c} | D_{d}^{\infty} (θ_{shift}^{\max}) - D_{d}^{\infty} (θ_{shift}^{\min}) |

(16)

We express the amplitude of the ICTD trajectory as its peak-to-peak value using the specific shift angles in equation (15). Figure 8 visualizes its amplitude as a function of the azimuth angle. It is notable that the ICTD trajectories of the left-sided sources have larger ICTDT_pp compared with those of the right-sided sources, except the source at (φ_s, θ_s) = (−90°, 0°). The variation of the ICTD trajectory is affected due to the motion of the rotating microphone only. When the sphere hides the entire trajectory of the rotating microphone's motion from the view of the source, the wave propagation in case 2 occurs, and the variation of the propagation distances becomes the largest (see Table 2). Also, when the source moves from the left to the right, the portion of the direct wave propagation increases and the ICTDT_pp decreases. Equation (17) shows the ICTDT_pp according to azimuth intervals:

Figure 8.

The values of ICTDT_pp are a function of the azimuth angle. In particular, the left-sided sources within [- π / 2 + θ_R, - θ_R] have the same ICTDT_pp, which corresponds to the time taken for a wave-front to travel the length of 2r_Rθ_R, and 2rRθ_R / c is equal to 0.206 msec. 2r_Rθ_R is the greatest length made by the rotating motion on the surface within a full revolution. As a source approaches the right, ICTDT_pp decreases. Exceptionally, the ICTDT_pps of the sources at (−90°, 0°) and (+90°, 0°) are zero, because these sources are located on the X axis, which is perpendicular to the Y_RZ_R plane.

\begin{array}{l} {ICTDT}_{P P} = \\ = {\begin{matrix} \begin{array}{l} r_{H} \cos^{- 1} (\cos θ_{R} \sin φ_{S} - \sin θ_{R} \cos φ_{S}) \\ - r_{H} \cos^{- 1} (\cos θ_{R} \sin φ_{S} + \sin θ_{R} \cos φ_{S}) \end{array} & φ_{S} \in [- \frac{π}{2}, - θ_{R}] \\ r_{H} (\cos^{- 1} (\sin (φ_{S} - θ_{R})) - \frac{π}{2} + \sin (φ_{S} + θ_{R})) & φ_{S} \in [- θ_{R}, + θ_{R}] \\ 2 r_{H} \sin θ_{R} \cos φ_{S} & φ_{S} \in [+ θ_{R}, + \frac{π}{2}] \end{matrix}} \end{array}

(17)

3. Localization Algorithm

The localization of a source can be achieved using the one-to-one relationship between the parameters of an ICTD trajectory and a source direction, as described in section 2.3. However, it is not easy to apply this approach to a real situation where a source and other noises are present simultaneously. In addition, the duration of a source varies and can be too short to calculate τ(θ_shift), even for a single source case. Therefore, to apply the practically feasible SSL to a real environment, a new SSL method is necessary. Section 3.1 presents the source direction estimator (SDE) based on the ICTD trajectory and section 3.2 summarizes the proposed 3-D SSL algorithm.

3.1 Source Direction Estimator

As mentioned before, we used the conventional GCC-PHAT function [16] to obtain the ICTD trajectories. Equation (18) redefines a GCC-PHAT function that is dependent on the shift angle of the rotating microphone:

R_{x_{F} x_{R}} (τ | θ_{shift}) = \int_{- \infty}^{\infty} \frac{G_{x_{F} x_{R}} (f | θ_{shift})}{| G_{x_{F} x_{R}} (f | θ_{shift}) |} e^{j 2 π f τ} d f

(18)

where G_{XF XR}(f | θ_shift) is calculated by using microphone signals that are collected while the rotating microphone is passing around θ_shift. Details about the measurement and the signal processing are presented in sections 4.1 and 4.2. Thus, Gx_Fx_R(f | θ_shift) is strongly dependent on the shift angle of the rotating microphone. It should be noted that the relative motion between a sensor and a source is so small that the Doppler effect in the measured signals is negligible [24]. Therefore, it is reasonable to assume that R_{xF xR}(τ | θ_shift) should have time-varying peak positions. Based on the time- or (shift) angle-dependent feature, we can define the source direction estimator (SDE) as below:

\begin{array}{l} S D E (φ_{S}, θ_{S}) \\ = \frac{\int_{0}^{2 π} R_{x_{F} x_{R}} (τ (φ_{S}, θ_{S} | θ_{shift}) | θ_{shift}) | \frac{d τ (φ_{S}, θ_{S} | θ_{shift})}{d θ_{shift}} | d θ_{shift}}{\int_{0}^{2 π} | \frac{d τ (φ_{S}, θ_{S} | θ_{shift})}{d θ_{shift}} |} \end{array}

(19)

where τ(φ_S, θ_S | θ_shift) is one of the constructed ICTD trajectory databases for a source at (φ_S, θ_S). SDE at (φ_s, θ_s) is in the form of a line integral of R_{xF xR} (θ | θ_shift) along the line of τ(φ_s, θ_S | θ_shift). For example, if R_{xF xR} (τ | θ_shift) is equal to 1 along the line of τ(φ ^a , θ ^a | θ_shift) only, then SDE is 1 at (φ, θ) = (φ ^a , θ ^a ) and 0 at other directions, ideally. Thus, if SDE is generated once, it is possible to estimate the source direction via peak detection.

3.2 Localization Algorithm for Rotating Microphone Array

In this section, the proposed SSL algorithm is described. On the basis of the weak Doppler effect (due to the small relative motion), the collected signals of the fixed and rotating microphones within (at least) a single period are segmented into N _f frames, each including N_fft samples. In addition, the angle allocated to each frame is the shift angle, which is measured at the time the middle sample in the frame is collected. Sections 4.1 and 4.2 give more information about the segmentation process. In the real system, the shift angle is measured directly by the encoder signal; see Figures 17 and 18 for more details. Then, we can obtain $R_{x_{F} x_{R}} (τ | θ_{shift})$ and SDE (φ_S, θ_S) for every possible direction using equations (18–19). The final decision is made by detecting the peak in SDE; we assume that the number of dominant sources is given by the recognition group prior to the SSL process. If it is reported that a single source is recognized, then the estimation of the source direction can be done by equation (20):

({\hat{φ}}_{S}, {\hat{θ}}_{S}) = \arg \max_{φ_{S}, θ_{S}} S D E (φ_{S}, θ_{S})

(20)

where φ̂_s and θ̂_s are the estimated azimuth and elevation angles of a source, respectively. For multiple SSL, various peak detection strategies are applicable when multiple peaks in the SDE are present. However, since our research focused on a single SSL, we used the simplest global peak detection using equation (20). Figure 9 shows the procedure of the proposed SSL algorithm.

Figure 9.

The proposed SSL algorithm based on SDE. Two measured time-domain signals are divided into the given number of frames, N_f, and each frame has N_fft samples. The shift angle corresponding to the middle sample in each frame is allocated to each frame. The framed signals are then used to calculate $R_{x_{F} x_{R}} (τ | θ_{shift})$ . Next, by using the constructed ICTD trajectory database, the SDE can be obtained for every direction. Finally, by finding the dominant peak(s) in SDE in descending order, the source direction(s) can be estimated.

4. Simulation

In section 4, we evaluate the performance of the proposed SSL algorithm using synthesized signals. To do this, signal models of the fixed and rotating microphones were needed. These models are given in section 4.1 and the results of the simulation for a single source are described in section 4.2. The localization performance is evaluated with respect to the localization error, which is defined as the angle between the true and perceived direction vectors. In this simulation, the physical dimensions of the rotating microphone array are given in Table 1.

4.1 Signal Models of the Fixed and Rotating Microphones

As shown in Figure 3, the rotating microphone array is installed on a spherical head with a radius of r_H. One of the two microphones is fixed at (−r_H, 0, 0) on the surface of the spherical head (this microphone is hereafter called the “fixed microphone” for convenience). Then, the output signal of the fixed microphone in a continuous time domain, denoted as x_F (t), can be modelled as below:

x_{F} (t) = h_{S}^{x_{F}} {(t)}^{T} * s (t | φ_{S}, θ_{S})

(21)

where h_S^x_F (t) is the spherical impulse response [25] from the source position to the fixed microphone position on the spherical head, s(t | φ_s, θ_s) is the source signal contents, and * indicates the convolution operator. As shown in equation (21), h_S^x_F (t) is not a function of θ_shift because this microphone does not move. However, the other microphone (i.e., the rotating microphone) is located on the rotating plate and moves in a circular motion on the Y _R Z _R plane (see Figures 2 and 3). Then, the measured signal of the rotating microphone should be strongly dependent on the shift angle. The signal model of the rotating microphone denoted as x_R(t) can be defined as:

x_{R} (t) = h_{S}^{x_{R}} {(t | θ_{Shift})}^{T} * s (t | φ_{S}, θ_{S})

(22)

where h_S^x_R(t | θ_shift) is the spherical impulse response from the source position to the rotating microphone position. In this case, h_S^x_R is a function of θ_shift due to its circular motion. The synthesized signal refers to the discrete-time domain signal. The generation of the synthesized signal of the fixed microphone, denoted as x_F[n], is carried out by simply discretizing x_F (t), as shown below:

x_{F} [n] = x_{F} (n Δ t_{S})

(23)

where Δt_S is the sampling time and x_F[n] is the n^th sample of the synthesized signal of the fixed microphone. On the other hand, the motion of the rotating microphone makes the generation of x_R[n] more complicated. For example, when we assume that the rotating microphone is shifted +θ_N° in a clockwise direction from the +Z _R axis and fixed during the measurement, then the output signal x_R(t) of the rotating microphone is:

x_{R} (t | θ_{shift} = θ_{N}) = h_{S}^{x_{R}} {(t | θ_{shift} = θ_{N})}^{T} * s (t | φ_{S}, θ_{S})

(24)

By using this notation, M x_R(·), which is the matrix of conditioned (continuous) output signals, can be modelled as equation (25) and is composed of N_f output signals with Δθ_N degree resolution:

M x_{R} (\cdot) = [\begin{array}{l} x_{R} (t | θ_{shift} = 0) \\ x_{R} (t | θ_{shift} = Δ θ_{N}) \\ ⋮ \\ x_{R} (t | θ_{shift} = (N_{f} - 1) \cdot Δ θ_{N}) \end{array}]

(25)

where x_R(t | θ_shift) is equal to x_R(t | θ_shift + 2π) due to the circular motion of the rotating microphone, which means a cyclo-stationary process when a source content is stationary [26]. We assume that the other dimensions do not vary. In this simulation, we set the sampling frequency (f_S) and the number of frames (N_f) as 44.1 kHz and 360, respectively. Thus, Δθ_N becomes 1°, and M x _R [·], which is the matrix of conditioned discretized signals, can be modelled in equation (26):

M x_{R} [N_{f}, Δ θ_{N}] = [\begin{array}{l} x_{R} (n Δ t_{S} | θ_{shift} = 0) \\ x_{R} (n Δ t_{S} | θ_{shift} = Δ θ_{N}) \\ ⋮ \\ x_{R} (n Δ t_{S} | θ_{shift} = (N_{f} - 1) \cdot Δ θ_{N}) \end{array}]

(26)

From equation (26), the synthetized signal of x_R[n] along the shift angle axis can be represented as follows:

x_{R} [n] = x_{R} (n Δ t_{S} | θ_{shift} = (n - 1) \cdot Δ θ_{N})

(27)

For instance, when the source is located in the direction of (φ_s, θ_s) = (0°, 0°) and its signal content is a Gaussian white noise signal, the resulting values included in Mx_s[·] are presented in Figure 10. It is found that the amplitude of the synthesized signal is increasing as the shift angle of the rotating microphone gets close to 90° and is generally decreasing as the shift angle becomes close to 270°. This is a reasonable result: when the rotating microphone approaches the source direction, the measured signal must be less attenuated by the spherical head. In this simulation model, the angular velocity of the rotating plate is 600 rpm. The synthesized output signal of the rotating microphone is collected along the signal detection line with w_R of 600 rpm. In this case, the synthesized microphone outputs are presented in Figure 11.

Figure 10.

M X _R [·] of conditioned and discretized output signals of the rotating microphone with respect to the time and shift angle axes

Figure 11.

The synthesized output signals of the rotating and fixed microphones are X_R[n] (top) and x_F[n] (bottom), respectively

4.2 Simulation Results

Various criteria to evaluate the SSL performance have been suggested by previous researchers [4, 6, 11–12, 15, 20]. One of the most commonly used criteria is based on the absolute error between true and perceived directions and it can be applied to the evaluation of azimuth or elevation angle estimations separately. However, for the evaluation of 3-D SSL performance, it would be more reasonable to incorporate both azimuth and elevation together. If we express the perceived (or estimated) azimuth and levation angles as φ̂_S and θ̂_S respectively, then the true and perceived direction vectors ( $_{\vec{tdv}}$ , $_{\vec{pdv}}$ ) are defined with respect to the inter-aural polar coordinate:

\begin{array}{l} \vec{t d v} = (\sin φ_{S}, c o s φ_{S} \cos θ_{S}, c o s φ_{S} \sin θ_{S}) \\ \vec{p d v} = (\sin {\hat{φ}}_{S}, c o s {\hat{φ}}_{S} \cos {\hat{θ}}_{S}, c o s {\hat{φ}}_{S} \sin {\hat{θ}}_{S}) \end{array}

(28)

Using these definitions, the localization error is defined as cos⁻¹ $_{\vec{tdv}}$ , $_{\vec{pdv}}$ . For example, when a Gaussian white noise source is at (0°, 0°) and its duration is longer than the rotating period, GCC-PHAT functions corresponding to the circular motion of the rotating microphone were shown in Figure 12. Each GCC-PHAT function was calculated by using the segments of the synthesized signals of the two microphones. These segments have 1,024 (i.e., N _fft ) samples and there are 900 overlapping samples between adjacent frames. The meaningful region in the time domain is from −1.4 · 10⁻³ seconds to + 1.4 · 10⁻³ seconds. It is found that the peak location of the GCC-PHAT function moves up and down in the time domain, as expected (see section 2.2). In the situation where one fixed sound source is at (0°, 0°), the peak position is the highest in the time domain when θ_shift is equal to 90° and is the lowest when θ_shift is 270°. No other significant features were found because additional noise signals were not included in the synthesized signals.

Figure 12.

The GCC-PHAT functions when the source is located directly at the front side (0°, 0°). As we expected, the up-and-down pattern of the peak location is clearly visible. In this noise-free simulation of a single source, there are no distinguishing local peaks along the time axis.

By using equation (19), SDE is obtained using the GCC-PHAT functions and the database of the approximated ICTD trajectories. Figure 13 shows the calculated SDE. The dominant peak is quite visible and bell-shaped side edges originating from the peak are spread out primarily along the elevation angle axis. This result is due to several factors. If the time resolution is infinitesimally small, the bell-shaped edges become invisible. However, the acquisition or processing system has its limitations, such as finite f_S. As a result, adjacent ICTD trajectories may overlap each other. More specifically, the locations of the peaks of the GCC-PHAT functions are matched with more than one ICTD trajectory partially in the time domain. Thus, the side edges become visible. Also, we can expect that as the time interval increases, the overlapped region will expand and the SDE values corresponding to the side edges will increase. Secondly, even if SDE is calculated in the discrete-time domain with a denser time resolution, the side edges should appear, because the signal bandwidth is limited. Thus, it can be expected that the calculated GCC-PHAT function is not equal to an ideal impulse. Besides, the effect of the rotational motion on the synthesized signals remains, although it is not remarkable. Therefore, the processing in the discrete-time domain and the motion of the rotating microphone cause the bell-shaped edges.

Figure 13.

SDE for the source at the front side of the rotating array is shown. It was found that the dominant peak is around the true direction of the source (0°, 0°). Additionally, the bell-shaped side edges originated from the peak due to the regional overlap of adjacent ICTD trajectories. The shape of the peak is stretched in the direction of the elevation angle axis due to the short up-and-down motion of the rotating microphone, compared with the width of the array.

We examined the 3-D SSL performance of the proposed SSL algorithm for a Gaussian white noise source with respect to the localization error as mentioned above. The range of the source direction is as follows: its azimuth angle spans from −90° to +90° with 10° intervals and its elevation angle varies from 0° to +330° (−30°) with 30° intervals. The number of source directions is 228. It is assumed that the rotating microphone array system was located in a free field. Figure 14 shows the localization error distribution for all of the source directions. Generally, the performance gets better as the source is close to the left, opposite to the rotating microphone, due to the left and right asymmetry of the azimuth-dependent ICTD trajectory amplitude (see Figure 8). Also, it is reasonable that an elevation-dependent feature was not visible. The distribution of the mean errors along the azimuth angle was shown in Figure 15.

Figure 14.

Localization errors for 228 directions are depicted. As we can see, the elevation-dependent feature was not found. However, it was quite visible that the SSL performance is strongly dependent on the azimuth angle of a source only.

Figure 15.

The mean error along the azimuth angle of a source. The localization errors of the left-sided sources are almost the same, except for the leftmost source. The right-sided sources tend to be estimated with worse resolution compared with the left-sided sources.

4.3 Computational load comparison

To be an efficient 3-D SSL method, the signal processing costs must be light. In this section, the computational load of the proposed localization method is compared with those of the delay-and-sum beamformer [27] and the steered response power (SRP) – PHAT method [28]. For example, SRP-PHAT requires the frequency-domain processing to do the phase transform (PHAT). Here, if the number of microphones is denoted as M, the computation of all the possible GCC-PHAT functions requires M(M-1)/2 phase transforms. For a discrete Fourier transform size of N_fft, a sinple FFT takes 5N_fftlog₂N_fft operations.

DFT of the ail the microphones: M X (5N_fftlog₂N_fft)

Spectral processing: 7N_fftM(M-l)/2

Inverse DFT: M(M-1)/2 X (5N_fftlog₂N_fft)

SRP-PHAT calculation for possible directions (N_φN_θ): M(M-l)l2 X N_φN_θ

Thus, the total SRP-PHAT processing cost is M(M+1)/2 X 5N_fftlog₂N_fft + M(M-1)/2 X (N_φN_θ + 7N_fft). In the same way, the cost of the proposed localization algorithm is (3M-1)/2 X 5N_fftlog₂N_fft + (M-1) X (N_φN_θ+ 7/2N_fft) and the cost of the delay-and-sum beamformer is MN X (N_φN_θ) where N is the frame length, 100. Therefore, the approximated costs are computed: (1) delay-and-sum beamformer, MN (N_φN_θ), (2) SRP-PHAT method, M²(N_φN_θ+12N_fft), (3) the proposed method, M (N_φN_θ+9/2N_fft). It is noteworthy that if the N_fft is smaller than N_φN_θ, the signal processing cost of the proposed localization method is N or M times less than the delay and sum beamformer and SRP-PHAT methods. It is reasonable that the proposed localization cue can be computed by M microphone pairs when using the (M+1)-channel microphone array. However, when using the SRP-PHAT method, the M(M-1)/2 microphone pairs are utilized for a single localization process. On the other hand, it is known that the SRP-PHAT method can be applied to the situation where the signal to noise ratio (SNR) is less than zero. However, in the proposed localization method, the TDE error will have an effect directly on the localization performance because the proposed cue is based on the measured ICTD trajectory. Thus, it can be reasonably expected that the proposed localization performance will be significantly more degraded than that of the SRP-PHAT method when SNR<0.

5. Experiment

We developed a rotating microphone array according to the proposed design (see Figure 2 and Table 1). It should be noted that the two microphone signals needed to be transmitted wirelessly for safety reasons. Thus, both a microphone and a transmitter needed to be placed inside the rotating block. An ultrasonic motor was chosen to make this block rotate inside the head. Details about the structure of the proposed array and the measurement process are provided in section 5.1. Section 5.2 shows the results of the two experiments for the feasibility test: one involving a Gaussian white noise source and the other involving a voice source.

5.1 Experimental Set-up

For our proposed rotating microphone array, we chose a wireless system (Q240, RFQ) consisting of a dual-channel receiver, two transmitters, and two microphones (QB686, RFQ). In order to put a transmitter unit and a microphone together in a rotating block, the electronic boards inside the transmitter unit had to be rearranged and installed in a cylindrical plastic block. Figure 16 shows the interior arrangement of the necessary blocks and other units inside the spherical head. There are two cylindrical blocks, two ultrasonic motors, one encoder, and one motor driver. The cylindrical block on the right side is called the “rotating block” and this block consists of the rearranged electronic boards used to transmit the microphone signal (#. 1) and the pin-type microphone located 3 cm from the centre of the cap. This block is connected to the ultrasonic motor (USR-E3T/24V, SHINSEI), which is driven by the motor driver (D6060E, SHINSEI). Additionally, the encoder is attached to the motor. Thus, the shift angle of the microphone is measured using the encoder signal. The other block on the left side is hereafter called the “fixed block” for convenience. The pin-type microphone (#. 2) is attached at the centre of the cap. The transmitter unit is outside the block. The left and right side views are also presented in Figure 17. The physical dimensions are the same as those in Table 1, except the rotating radius r_R, which is 3 cm in the road array. Therefore, the ICTD trajectory database needed to be reconstructed.

Figure 16.

A top view of the hemisphere showing the interior arrangement of the rotating and fixed blocks, two ultrasonic motors, one encoder, and one motor driver. The rotating block on the right side contains the electronic boards for transmitting the microphone signal. The shift angle of the rotating microphone is measured by using the encoder.

5.2 Experimental Results

The experiments for the feasibility test were carried out in the room environment: the room size was 3.2 × 5.5 × 2.8 m³ (width x length x height) and the reverberation time was 0.26 seconds (t₆₀). The input signal was produced through a full range speaker (TC9FSD13, VIFA) on the speaker jig. Figure 18 shows the rotating array system placed in the room. Two experiments were conducted in order to check the feasibility: one involving a Gaussian white noise signal and the other using a male voice as a source signal.

5.2.1 Gaussian white noise source

First, the experiment using a Gaussian noise source as an input signal to the speaker was conducted. In this experiment, the SSL performance for a source in the median plane was evaluated. Only the elevation angle of a source was varied from −30° to 210° with 10° intervals. The source content was Gaussian white noise signal with frequency contents from 1.5 kHz to 20 kHz generated by the random noise generator (SF-06, RION) and was produced longer than the one rotating period. The angular frequency was set to 54 rpm. For example, when the source is at (0°, 0°), the measured microphone signals and the z-phase encoder signal are depicted in Figure 19. The total measurement time was 3 seconds and the signal duration was set to 2 seconds. By using the encoder signal in the z-phase, we collected the samples within a single rotating period and allocated N _fft samples to each frame. For 25 directions in the median plane, the mean localization error was 1.75° and the standard deviation was 1.65°. Therefore, the experimental result showed that our proposed SSL algorithm is applicable to the SSL of a Gaussian noise source.

Figure 17.

The right-side view of the spherical head is shown in the left and the left-side view is presented in the right

Figure 18.

The spherical head equipped with the rotating microphone array is set up in the measurement room

Figure 19.

The output signals of two microphones and the encoder signal (z-phase) when the source is at (0°, 0°)

5.2.2 Male voice source

The previous experiment employed a Gaussian white noise signal as a source. In this experiment, a male voice was used as the sound source, without using a speaker jig. The male's position was fixed during the measurement so that his mouth was at (45°, 0°) while speaking. The angular frequency of the rotating block was reduced to 21 rpm in order to involve the silent region. The output signals of the two microphones and the encoder signal are depicted in Figure 20. It is known that voice signals are not stationary with time. Also, the spectral modification is strongly dependent on the relative position of the sensor and the source. If the microphones are not attached on an object such as a sphere, but located in the free field, the spectral contents in the measured microphone signals will be the same. Figure 21 shows the GCC-PHAT functions along the shift angle of the rotating microphone. In the region where sufficient signal contents were collected, the GCC functions were obtained quite reasonably because the peak location seemed to change in a sinusoidal form. The empty black-coloured circles show the estimated ICTDs.

Figure 20.

The output signals of two microphones and the encoder signal (z-phase) in the time domain when the voice source is at (45°, 0°). As shown, the voice signal is non-stationary.

Figure 21.

The GCC-PHAT functions along the shift angle of the rotating microphone. In the region with sufficient signal contents, the functions were obtained easily. This can be interpreted to mean that the peak location of each function is shifting up and down as the rotating microphone moves in a circular motion. The more smoothed peaks result from the comparatively narrow frequency band of the measured voice signals.

Figure 22 shows the SDE for all possible directions with 2° resolution on both the azimuth and elevation directions. Consequently, the dominant peak in the SDE was found. As we examined earlier, bell-shaped side edges originate from the peak. Negative values were found at some regions. This result seems reasonable because a GCC function can have a negative value, which indicates that considerable contents in the measured signals are out-of-phase with each other. The final step to find the location of the (positive) peak in the SDE was carried out to estimate the direction of a source as equation (20).

Figure 22.

The source direction estimator when a source was at (45°, 0°). The estimated source direction was (39°, −1°) even though the silent region was included.

6. Discussion

The concept of the proposed localization cue, which is a (source) direction- and (microphone) position-dependent ICTD trajectory, can be applied to the circular microphone array as well. In general, if a microphone array is composed of (M+1) sensors, all the information from every possible microphone pair is under consideration, in order to practically improve the SSL resolution. If the M-channel circular microphone array is located on the right side of the sphere and the one additional microphone is fixed on the other side, the (microphone which is the element of the M-channel circular array) position-dependent ICTD trajectory can be reproduced exactly the same as the proposed ICTD trajectory. Thus, the proposed localization cue-based 3-D SSL can be also applicable to the circular microphone array. However, the more microphones that are used for SSL, the more costly it is to produce the microphone array, especially due to the price of the Analog-to-Digital converters (ADC), which is proportional to the number of channels. However, sequential sampling and signal processing could be an alternative to reduce the production cost.

On the other hand, the source position was supposed to be outside the rotating microphone array. However, noises emitted by the (ultrasonic) motor and its driver inside the sphere could be interior noisy sources. Thus, we needed to suppress the propagation of these noises into the microphone by combining the microphone and the electronic boards in a cylindrical block, as shown in Figures 16 and 17. In addition, the directivity of the pin-type microphone (QB686, RFQ) utilized in the research was compared with that of the omnidirectional 1/4 inch microphone (4178, B&K). It is generally known that the remote microphone is used for public speaking, i.e., the primary source is a speaker's voice. Thus, this type of microphone needs to have directionality. For comparison, two directivity patterns were measured and shown in Figure 23. The omni-directionality of the B&K microphone is clearly visible and the directivity pattern of the pin-type microphone is asymmetric with respect to the 90° direction. If we consider that the microphones are facing outward through the block cap and that the directivity pattern of the microphone is asymmetric, the interior noises are not a serious problem.

Figure 23.

Directivity patterns of the two microphones, i.e., 1/4 inch microphone (B&K) and pin-type microphone (RFQ). The asymmetry in the directivity of the pin-type microphone is clearly visible.

As mentioned before, we assumed that a sound source is fixed. In daily life, a source moves slowly compared with the rotation period of the array. However, in a situation where there is a fast-moving source, the patterns of the peak and the side edges in the SDE would be quite different compared with those in Figures 12 and 21. Usually, the movement of the source occurs along the azimuth angle axis. Therefore, the peak shape in the SDE would be stretched along the time axis according to the direction of the source movement and the magnitude of the peaks would be suppressed. In this case, without the information about the initial direction of the fast-moving source, its direction cannot be estimated using a single measurement because the peak shape in the SDE is not a time-dependent feature. Even though it is possible to track a fast-moving source when increasing the angular velocity of the rotating part, a safety issue can arise.

7. Conclusion

This paper proposed an ICTD trajectory as the new 3-D SSL cue and, as one of the possible ways to realize the proposed cue concept, the two-channel rotating microphone array was discussed. The characteristics of the ICTD trajectory induced by the circular motion of the rotating array were presented by the Ray-Tracing method: the mean value of the ICTD trajectory is dependent on the azimuth angle of a source only and the shift angle corresponding to the maximum (or minimum) ICTD is directly related to the elevation angle of a source. Also, the amplitude of the ICTD trajectory is asymmetric with respect to the front side, which is caused by the circular motion of the rotating microphone on the right side of the sphere. The simulation results demonstrated that the amplitude of the ICTD trajectory is the essential factor for the SSL performance. The results of the two experiments carried out in the room environment demonstrated that the 3-D SSL method using the ICTD trajectory of the two-channel rotating microphone array can effectively localize a Gaussian white noise source and a voice source in 3-D space. It is noteworthy that the estimator was in the form of the line-integral of GCC-PHAT functions similar to the steered beam power (SRP)-PHAT method [27, 28].

Footnotes

8. Acknowledgements

This work was supported by the second stage of the Brain Korea 21 Project, the Intelligent Robotics Development Program, one of the Frontier R&D Programs funded by the Ministry of Knowledge Economy (MKE) in 2012, and the National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIP) (No. 2010-0028680).

References

Kerstin

(2007) Socially intelligent robots: dimensions of human-robot interaction. Phil. Trans. R. Soc. B. 362:679–704.

Fong

Illah

Kerstin

(2003) A survey of socially interactive robots. Robot Auton. Syst. 42:143–166.

Anderson

M.L.

(2003) Embodied cognition: A field guide. Artif. Intel. 149:91–130.

Valin

J. M.

Michaud

Rouat

Létourneau

(2003) Robust sound source localization using a microphone array on a mobile robot. In: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems 2:1228–1233.

Mumolo

Massimiliano

Gianni

(2003) Algorithms for acoustic localization based on microphone array in service robotics. Robot Auton. Syst. 42:69–88.

Wang

Peter

(1997) Voice source localization for automatic camera pointing system in videoconferencing. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing 1:187–190.

Valenzise

Gerosa

Tagliasacchi

Antonacci

Sarti

(2007) Scream and gunshot detection and localization for audio-surveillance systems. In: Proceedings of the IEEE Conference on Advanced Video and Signal Based Surveillance pp. 21–26.

Chen

Jacob

Yiteng

A.H.

(2006) Time delay estimation in room acoustic environments: an overview. EURASIP J. Adv. Sig. Pr. 2006:1–19.

Chen

J.C.

Kung

Ralph

E.H.

(2002) Source localization and beamforming. IEEE Signal Proc. Mag. 19:30–39.

10.

Georgiou

P.G.

Chris

Panagiotis

(1997) Robust time delay estimation for sound source localization in noisy environments. In: Proceedings of the IEEE ASSP Workshop on Applications of Signal Processing to Audio and Acoustics.

11.

Hwang

Park

(2011) Sound direction estimation using an artificial ear for robots. Robot Auton. Syst. 59: 208–217.

12.

Kim

Kazuhiro

Hiroshi

G. O.

(2013) Improved Sound Source Localization and Front-Back Disambiguation for Humanoid Robots with Two Ears. In: Moonis

Tibor

Koen

V.H.

Mark

Catholijn

M.J.

Jan

, editors. Recent Trends in Appl. Artif. Intell. Springer Berlin Heidelberg pp. 282–291.

13.

Cheng

C. I.

Gregory

H. W.

(1999) Introduction to head-related transfer functions (HRTFs): Representations of HRTFs in time, frequency, and space. J. Audio. Eng. Soc. 49:231–249.

14.

Shinn-Cunningham

B. G.

Scott

Norbert

(2000) Tori of confusion: Binaural localization cues for sources within reach of a listener. J. Acoust. Soc. Am. 107:1627–1636.

15.

Huang

Ohnishi

Sugie

(1998) Spatial Localization of Sound Sources: Azimuth and Elevation Estimation. In: Proceedings of IEEE Instrumentation and Measurement Technology Conference pp. 330–333.

16.

Knapp

Glifford

(1976) The generalized correlation method for estimation of time delay. IEEE Acoust. Speech Signal Proc. 24:320–327.

17.

Azaria

David

(1984) Time delay estimation by generalized cross correlation methods. IEEE Acoust. Speech Signal Proc. 32:280–285.

18.

Gustafsson

Rao

B. D.

Trivedi

(2003) Source localization in reverberant environments: Modeling and statistical analysis. IEEE Speech Audio Proc. 11:791–803.

19.

Brandstein

M. S.

Harvey

F. S.

(1997) A practical methodology for speech source localization with microphone arrays. Computer Speech and Language 11:91–126.

20.

Kwon

Park

(2009) Multiple sound source localization using the spatially mapped GCC functions. In: Proceedings of ICROS-SICE pp. 1773–1776.

21.

Lee

Park

(2014) Estimation of multiple sound source directions using artificial robot ears. Appl. Acoust. 77:49–58.

22.

Woodworth

R. S.

Schlossberg

(1954) Experimental Psychology. New York: Holt.

23.

Blauert

(1997) Spatial hearing: the psychophysics of human sound localization. Cambridge: MIT press.

24.

Knapp

C.H.

Carter

G.C.

(1977) Estimation of time delay in the presence of source or receiver motion. J. Acoust. Soc. Am. 61:1545–1549.

25.

Duda

R.O.

William

L.M.

(1998) Range dependence of the response of a spherical head model. J. Acoust. Soc. Am. 5:3048–3058.

26.

Giannakis

G.B.

(1998) Cyclostationary signal analysis. In: Madisetti

V.K.

Williams

D.B.

editors. Digital Signal Processing Handbook. Boca Raton, FL:CRC.

27.

Cai

Wang

(2010) Accelerated steered response power method for sound source localization using orthogonal linear array. Appl. Acoust. 71:134–139.

28.

DiBiase

J. H.

Silverman

H. F.

Brandstein

M. S.

(2001) Robust localization in reverberant rooms in Microphone Arrays. Springer Berlin Heidelberg.

Three-Dimensional Sound Source Localization Using Inter-Channel Time Difference Trajectory

Abstract

Keywords

1. Introduction

2. Localization Cue: ICTD Trajectory

2.1 ICTD Trajectory

2.2 Extended Ray-Tracing Formula for 3-D Models

2.3 Characteristics of the ICTD Trajectory of the Rotating Microphone Array

3. Localization Algorithm

3.1 Source Direction Estimator

3.2 Localization Algorithm for Rotating Microphone Array

4. Simulation

4.1 Signal Models of the Fixed and Rotating Microphones

4.2 Simulation Results

4.3 Computational load comparison

5. Experiment

5.1 Experimental Set-up

5.2 Experimental Results

5.2.1 Gaussian white noise source

5.2.2 Male voice source

6. Discussion

7. Conclusion

Footnotes

8. Acknowledgements

References