Abstract
Irregularities in microphone distribution enrich the diversity of spatial differences to decorrelate interferences from the beamforming target. However, the large degrees of freedom of irregular placements make it difficult to analyse and optimize array performance. This article proposes fast and feasible optimal irregular array design methods with improved beamforming performance for human speech. Important geometric features are extracted to be used as the input vector of the neural network structure to determine the optimal irregular arrangements of sensors. In addition, a hyperbola design method is proposed to directly cluster microphones in the hyperbola areas to produce rich differential distance entropies and yield significant signal-to-noise ratio improvements. These methods can be easily applied to guide non-computer-aided optimal irregular array designs for human speech in acoustic scenes such as immersive cocktail party environments.
Introduction
Microphone array processing uses time, spatial and spectral diversity to capture target acoustic signals and suppress interference and noise. Although regular arrays with uniformly spaced elements have been well studied, their performance are typically limited by the problems derive from symmetrical array arrangement, such as spatial aliasing and inconsistent performance over signal spectral.1–3 It has been demonstrated that irregular arrays have the potential to outperform regular ones, especially for speech signals in immersive environments.4,5 However, the large degrees of freedom of irregular microphone placements make it difficult to analyse and optimize array performance. Although optimization approaches have been proposed for irregular arrays with minimized gain pattern residues from desired pattern shape, it is not clear what are the crucial geometric features to determine the beamforming performance of totally randomly distributed array, such as array aperture and element spacing for regular arrays.6,7 Irregular array synthesis algorithms for antenna design have been proposed in the literature.8–10,11–13 But they are not easily feasible for broadband human speech signals with limited knowledge of possible source locations, dynamic acoustic scenes and unknown sound propagation models. In the applications with moving sources and various background noise, such as high-speed train and crowded public scene with irregular space, direct and rapid array design method with stochastic arrangement of sensors based on the prior knowledge of acoustic environment remains lacking.
Therefore, this article proposes fast and feasible optimal irregular array design methods with enhanced signal-to-noise ratio (SNR) performance for human speech applications in immersive environment. Important distribution features, which are demonstrated theoretically and experimentally to show great impacts on performance metrics, are extracted to be applied as the input of a pre-trained neural network (NN) structure to predict array performance without the use of time-consuming simulations. Optimal arrays with high probabilities of superior beamforming performance can be directly picked based on prior knowledge of acoustic scenes. Another cluster design method is also developed based on hyperbola theory. Optimal arrays can be directly generated for specified acoustic scenes by clustering microphones in the hyperbola areas to produce rich differential distance entropies and provide superior noise suppression capability, even without optimization.
Problem formulation
Assuming a three-dimensional (3D) space for the field of view (FOV),
where
To consider the impact of microphone positions, delay-and-sum beamforming algorithm is applied with inverse distance weighting in this article. The expected optimal geometries should statistically enhance the performance of array, regardless of the beamforming types. The power gain leaked between beamforming focal point
where
When searching for the optimal geometry features, the coefficients of delay-and-sum beamformer can be considered as the function of microphone coordinates. The distribution of differential path from all pairwise microphones to the potential source positions is the important statistical factor to determine the array beamforming ability for noise suppression.7,14,15 By applying the expected operations in equation (3), assuming the attenuation factors are uncorrelated with pairwise distance differences of microphones and considering only direct path propagation,14–16 the output power of beamformer for sound sources at and away from focal points can be expressed as
where the angular brackets represent the average power of source signals. As seen in equation (4), for the target source located at
As shown in equation (4), the key point for noise source suppression is to increase the incoherent level of transmission phases, which are related to the differential-path distance (DPD) distribution of overall pairwise microphones to the interfering sources and focal point. With fixed signal spectral and possible source distribution, limited range of DPD levels results in stronger partial coherence for multichannel signals received from interfering sources and might degrade the SNR performance of beamformer. Therefore, when searching for the optimal array geometry, instead of identifying exact positions of each microphone, the diversity and spread of DPDs are important for achieving incoherence to suppress noise signals. DPD distribution with wide range and rich diversity (such as uniform distribution) with the phase terms spreading from
As shown in Table 1, combining with the typical array parameters of aperture and centroid, statistics based on DPD distributions can be considered as important geometric features to characterize similar arrays and predict the beamforming performance of arrays without any Monte Carlo experiments.
14
Table 1 also lists results from multi-way analysis of variance (ANOVA) to further demonstrate the strong correlation between geometric features and key performance matrices of array, such as mainlobe width (MLW) and mainlobe-to-peak-sidelobe ratio (MPSR).
18
The proposed geometry features {
Key geometry features related to array beamforming performance.
ANOVA: analysis of variance; MPSR: mainlobe-to-peak-sidelobe ratio; MLW: mainlobe width; DPD: differential-path distance.
In the next section, proposed geometric features are applied as the input vector for array optimization algorithms (e.g. a NN structure) to rapidly predict array SNR performance. Considering mutable acoustic applications, such as high-speed train and crowded public scenes, feasible cluster design method for stochastic arrays is proposed to directly generate optimal microphone clusters with good values of proposed features and to guide fast non-computer-aided optimal array design.
Optimal geometries for stochastic arrays
NN method
Because the relationship between irregular array features and beamforming performance is complex and nonlinear, a deep NN, which is good at non-deterministic mapping, is applied in this section. Geometry features extracted from the acoustic scene along with microphone number are applied as the first layer of a NN structure to rapidly predict the array beamforming performance for human speech signals.
As shown in Figure 1, microphone positions and prior knowledge about the acoustic scene are considered as the inputs, including probability density functions of possible target and noise source locations, related to the usual moving tracks and speaking manners of sources’ behaviour. If no prior knowledge is available, uniform distribution is the default setting to evenly consider all the spatial points in FOV as the possible source location. The objective function is expressed as
where
The first layer of the optimization structure extracts five geometric features from the input vector, which are {

Neural network structure to predict array performance.
Hyperbola cluster design
It has been demonstrated that high entropy and wide spread of DPD distribution derived from array geometric statistics and acoustic scene can increase the incoherence of noise components in received multichannel signals and further improve beamforming SNR. However, because DPD statistics do not have intuitive simple geometric interpretations that can be used to guide allocation of microphone distribution directly for mutual application environment, a cluster design method based on hyperbola area is proposed in this section for non-computer-aided optimal array design.
By defining the hyperbola areas based on knowledge of acoustic scene, the hyperbola cluster (HC) method can be used to directly generate an optimal array with good values of geometry features and further guide non-computer-aided optimal microphone placements. As mentioned in equation (4), with pairwise microphones {
where different value of
As shown in Figure 2, hyperbola curve is explained as the locus of points with a constant absolute value of differential distances to two focuses. With given two spatial positions

Optimal array geometries. The blue circles represent distributed microphones. The red crosses represent the possible noise source locations. The red triangle represents desired target as focal point of beamformer. (a) Computer-aided GA array and (b) HC array.
Figure 2 gives the optimal arrays resulted from computer-aided heuristic searching 19 and hyperbola cluster design method. The hyperbola areas are marked by dashed lines with different colours. In Figure 2(a), it can be seen that the optimal geometries resulted from genetic algorithm (GA) searching,5,19,20 most of the microphones are actually clustered in the hyperbola grey areas, which demonstrates the effectiveness of hyperbola analysis. Figure 2(b) provides a corresponding HC array directly generated by the HC method. Simulations and real-case experiments with human speech signals have demonstrated that the designed HC arrays show comparable or even better beamforming SNR, when compared with computer-aided optimized GA arrays.
Experimental results
Experiments in three acoustic scenes with different potential source distributions/spaces were performed to evaluate SNR performance for human speech signals. Audio cage with the size of 10 × 10 × 2 m3 was applied to simulate the indoor immersive environment for multi-source audio surveillance application cases. Three types of optimized arrays were employed: optimized arrays obtained by 100 GA iterations, arrays directly generated by HC and arrays selected from random distributions by a NN structure. In addition, the SNRs of a relevant random array set and regular array with the same centroid and dispersion are also provided for comparisons.
Table 2 compares the SNR results of the random array set and regular arrays in cocktail party experiments. Sound sources transmitting broadband human speech signals are distributed in the audio cage and are randomly selected as the target and noise sources. For specified geometry sets with similar aperture, average and top SNRs over 100 arrays are computed to demonstrate the effectiveness of proposed array geometry optimization method. All three types of optimal irregular geometries revealed enhanced beamforming performance, which demonstrates the feasibility of the array design methods proposed in this article and the effectiveness of proposed geometric features. Due to the statistical parameters and probabilistic rule applied in the optimization, as the acoustic scene becomes more complicated (more potential speakers and more microphones in an overlapping noise/target space), an even stronger SNR improvement can be observed.
SNR (dB) comparisons for cocktail party experiments.
HC: hyperbola cluster; NN: neural network; SNR: signal-to-noise ratio.
Through heuristic searching optimization of GA, significant SNR improvements are observed over all cases. Superior arrays are sorted out that outperform regular arrays and random array sets (100 arrays for each set with similar aperture and design space). Moreover, even without time-consuming optimization or heuristic searching by GA, as the direct design method, HC and NN directly generate optimal geometries with higher probability to show good beamforming performance. These direct-designed optimal arrays show comparable or even better SNR results than computer-aided GA arrays. And meanwhile, large SNR improvements are observed compared with corresponding regular arrays. In Figure 3, the top-view gain patterns for real-case cocktail party experiments when targeting the top source are given. It can be seen that our optimal arrays showed superior noise suppression abilities in this scene, whereas the spatial resolutions are also improved in comparison with the regular ones.

Top-view gain patterns when beamforming at the top source. The red circles represent microphone positions. The triangles and crosses represent the target and noise source positions. (a) Regular array, (b) GA array, (c) HC array and (d) NN array.
Conclusion
This article has proposed feasible irregular array design methods with improved beamforming performance for cocktail party applications. Important geometric features have been proposed for use as NN structure inputs to predict array performance and directly pick optimal irregular geometries with good beamforming performance. In addition, in order to generate rich DPD entropy to better suppress noise signals, HC arrays derived from hyperbola areas can be directly generated based on prior knowledge of acoustic scene, providing improved SNR performance comparable with other complex optimization methods. Proposed method can be easily applied to guide non-computer-aided optimal irregular array design in dynamic multi-source acoustic applications such as mobile platforms with changing acoustic scenes and high-speed trains/aircraft with irregular spaces.
Footnotes
Handling Editor: Xi (Vincent) Wang
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by Fundamental Research Funds for Central Universities (Grant No. 2018JBM008), National Natural Science Foundation of China (Grant No. 61501025) and Beijing Natural Science Foundation (Grant No. 4172045).
