Interior sound quality evaluation model based on convolutional neural networks and bi-directional long short-term memory

Abstract

The interior sound quality holds a central position in the vehicle quality evaluation system. It shapes the users’ perception of the vehicle and significantly influences consumers’ purchasing decisions. Therefore, it is extremely crucial to accurately assess its quality. Numerous researchers have been dedicated to developing intelligent prediction models to precisely measure the in-vehicle sound quality. The deep convolutional neural network (CNN), due to its excellent ability of automatic feature learning, has been widely applied in the processing and analysis of noise and vibration problems. However, there are two issues in these studies: 1) CNN performs poorly in multi-dimensional feature extraction; 2) CNN has limited adaptability when dealing with dynamic data. To overcome the above problems, an in-vehicle sound quality evaluation model integrating CNN and bidirectional long short-term memory network (Bi-LSTM) was constructed. The results show that the model achieves a maximum prediction accuracy of 96% in the training set, and no significant overfitting occurs, demonstrating the feasibility of the generalization ability and prediction accuracy of the new model.

Keywords

interior sound quality evaluation convolutional neural networks bi-directional long short-term memory machine learning for acoustics sound signal processing

Introduction

As an indispensable means of transportation in modern society, the automobile greatly facilitates people’s daily lives. With the improvement of living standards, people’s requirements for automobile comfort are also increasing. Automotive NVH performance is one of the non-negligible indicators of automotive comfort,¹ and the importance of in-vehicle sound quality as a core component of NVH performance evaluation is self-evident. Therefore, the development of an effective sound quality evaluation model cannot only significantly reduce the experimental cost but also simplify the sound quality evaluation process, which is of great practical significance and far-reaching economic value to the development of the automotive industry.

In recent years, scholars from both domestic and international academic circles have progressively established a plethora of theoretical frameworks and computational models specifically tailored for the systematic investigation of in-vehicle acoustic quality evaluation systems. Most of them take physical acoustic indexes and psychoacoustic parameters as inputs and use multiple linear regression,² support vector machine,^3–6 radial basis function, BP neural network,^7,8 and wavelet neural network^9–11 as frameworks to build evaluation models, which have good predictive performance. However, considering that the modeling requires multiple parameter inputs and the processing of the parameters relies on complex acoustic theory and empirical knowledge,¹² scholars have begun to construct deep learning-based sound quality evaluation models. Deep learning-based methodologies systematically utilize the temporal spectrogram of noise as input, where deeper neural network architectures enable the extraction of hierarchical audio features and superior alignment with human auditory perceptual characteristics. This framework not only captures the intrinsic complexity of acoustic signals but also emulates the neurocognitive mechanisms of auditory processing, thereby enhancing the model’s predictive accuracy in sound quality evaluation tasks. In 2017, Gauthier et al. compared three subjective evaluation models for sound quality, namely stepwise regression, elastic network, and the Lasso algorithm. It was found that the Lasso algorithm had the highest prediction accuracy, and these evaluation results can be used as a design guide for engineers for sound quality optimization.¹³ Ma Congjian et al. proposed a neural network-based sound quality evaluation method for the interior noise of pure electric vehicles, and the average error of the prediction results was 9%, which can be used for the prediction and evaluation of the sound quality of the noise of electric vehicles.¹⁴ In 2020, Huang Xiaorong et al. from Xihua University used time-frequency images of in-vehicle noise as inputs and extracted noise acoustic features by deep convolutional neural network, which achieved good evaluation results. They also used a neuron visualization algorithm to analyze the feature maps learned by deep CNN, revealing that the deep CNN feature learning process is similar to color filtering and Gabor filtering of noisy images.¹⁵

The analysis of the above studies shows that the interior noise of automobiles contains rich information, and the sound quality evaluation method based on deep learning has good results. However, there are still the following problems to be solved:

1) Convolutional Neural Networks (CNNs) inherently excel in extracting spatial features through convolutional kernels, endowing the obtained representations with prominent localization properties. While CNNs demonstrate efficacy in processing temporal features when integrated with 1D convolutional layers (as evidenced by models such as Encodec, Wav2Vec, and WaveNet), their architecture tends to prioritize local receptive fields, which may hinder the capture of long-range temporal dependencies in complex acoustic signals. This limitation manifests not as an inherent incapability to process time-series data, but rather as a trade-off between local feature precision and global temporal context modeling.

2) Real-world acoustic datasets often exhibit dynamic characteristics (e.g., environmental noise variations under diverse operational conditions), requiring models to reconcile local feature extraction with global temporal dynamics. Although CNNs can handle certain dynamic scenarios via multi-scale convolutional designs, their reliance on static kernel operations may impede the modeling of nonlinear temporal evolutions and cross-dimensional dependencies. Notably, state-of-the-art frameworks like WaveNet have addressed this by integrating dilated convolutions to expand receptive fields, underscoring that CNN-based limitations in dynamic data processing are often mitigated through architectural innovations rather than inherent structural flaws.

To address the aforementioned challenges, this study proposes an acoustic quality evaluation model integrating Convolutional Neural Networks (CNNs) and Bidirectional Long Short-Term Memory (BiLSTM) networks. The model’s predictive performance is systematically validated using confusion matrix-based metrics, including precision, recall, F1-score, and overall accuracy, to quantitatively assess its classification efficacy across diverse acoustic scenarios. Ultimately, it is proved that the proposed model can be used for the sound quality evaluation of the interior noise of special vehicles.

Interior noise sample collection and subjective evaluation experiment

Interior noise sample collection

In this study, sample data of interior noise were collected with reference to GB/T 18697-2002, thereby completing the accumulation of preliminary data samples. The experiment was conducted on an empty, smooth, hard road surface. The reference experimental conditions were GB/T18697-2002 “Acoustic-Measurement of noise inside motor vehicles”¹⁶ and GB/T14365-2017 “Acoustic-Measurements of sound pressure level emitted by stationary road vehicles”,¹⁷ and the acquisition equipment was the LMS data acquisition front. The windows of the vehicle were closed during the experiment, and the air conditioning and other auxiliary equipment were also turned off. Figure 1 illustrates the layout diagram of measuring points at the co-driver’s position.

Figure 1.

Measurement points at the co-pilot’s location.

Table 1 shows the key parameters of the test vehicle.

Table 1.

The key parameters of the test vehicle.

Parameter	Value/type
Complete vehicle quality	7.6t
Engine displacement	6.7L
Maximum power output	180kW@2500r/min
Maximum engine torque	925Nm@1400r/min
Transmission system type	4*4

Table 2 shows the model and parameters of the experimental acquisition equipment.

Table 2.

The model and parameters of the experimental equipment.

Equipment name	Model	Technical parameter
Sound pressure transducer	INV9202	Sensitivity:47.4mV/Pa
Sound pressure transducer	INV9202	Frequency response characteristic:10Hz-8kHz (±1dB)
Multi-channel data acquisition system	LMS SCADAS SCM205	The maximum number of channels in the chassis:24
Multi-channel data acquisition system	LMS SCADAS SCM205	Maximum sampling rate:51.2kHz
Acoustic calibrator	AHAI2601	Frequency:1000Hz
Acoustic calibrator	AHAI2601	Sound pressure level:94dB

Considering that interior noise is complex and influenced by multiple factors, three groups of 11s data of each sample were tested for each working condition. The test conditions are shown in Table 3.

Table 3.

Distribution of test conditions.

Working condition classification	Engine idle condition	Constant speed condition
Test condition	750r/min 1500r/min 2500r/min	30km/h 40km/h 50km/h 60km/h 70km/h 80km/h

After the collection is completed, three sets of sound samples of each working condition are played by Testlab software, and the most stable group is selected for subsequent analysis. Finally, the noise samples were cropped to 5 seconds to prevent the reviewers from being distracted by listening for a long period of time.

Subjective evaluation of interior sound quality

The auditory experience of vehicle occupants serves as the ultimate gauge of cabin noise quality. Subjective evaluation experiments, grounded in human sound perception, provide a comprehensive assessment of noise samples. The experimental workflow is outlined in Figure 2. The evaluators conducted listening training before the experiment. The evaluators listened to the audio that was not used as the follow-up research object in a quiet environment, evaluated the irritability as the evaluation index, and screened the evaluators with normal hearing level through their feedback.

Figure 2.

Flow of subjective evaluation experiment.

The data generated by the rating method is relatively simple in form, easy to be processed and analyzed subsequently, and conducive to the subsequent establishment of a sound quality evaluation model. In summary, the rating method is selected as the subjective evaluation method in this paper.

Subjective evaluation experiment

The members of the jury are all current graduate students in vehicle engineering, totaling 40, all aged 22 to 26 years old, with an average age of 24 years old, and all of them have a certain degree of understanding of automobile noise.

In this paper, the annoyance degree is used as the evaluation index, and 10 scoring levels are set up, with the distribution of scores increasing from 1 to 10, in which 1 to 4 indicates very dissatisfied, 5 indicates slightly dissatisfied, 6-7 indicates basically satisfied, 8 indicates very satisfied, and 9-10 indicates totally satisfied.

In order to give the jury members an overall understanding of the noise samples and to avoid inconsistencies in the evaluation criteria before and after, listening training was conducted before the formal test. The method is to take part of the noise samples, randomly disrupt the order, and have the jury members try to listen to the scores in turn.

The evaluation experiment was conducted in an open office at 8-9 p.m. The LMS Testlab software is used to play the sample noise on the same computer, and the evaluator scores the sample noise according to his own feelings after listening. The experimental atmosphere is relaxed and comfortable. Table 4 shows some raters’ ratings.

Table 4.

Table of partial evaluation scores.

Noise sample number	R1	R2	R3	R4	…	R40
T1	2	2	3	4	…	3
T2	3	3	2	3	…	4
T3	2	2	3	3	…	6
T4	4	3	5	5	…	8
T5	2	5	4	7	…	4
T6	5	6	4	9	…	6
…	…	…	…	…	…	…
T47	6	9	8	7	…	9
T48	7	6	6	8	…	7

Data correlation analysis

Due to the strong subjectivity of the subjective evaluation test for interior noise, further data inspection is required after obtaining the subjective ratings. In this paper, the Pearson correlation coefficient is used to test the correlation of the evaluation results of the jury; the range is [-1,1], the negative value indicates negative correlation, and the closer the absolute value is to 1, the stronger the correlation is.

This paper first calculates the sum of different evaluation correlation coefficients and takes the average value of the arithmetic and obtains the Pearson correlation coefficient between the average correlation coefficient of this evaluator and all other evaluators and then the Pearson number of a single evaluator and all other evaluators. It is calculated that the maximum correlation coefficient is 0.73, the minimum is 0.48, and the overall score is low. Considering that it may be because the tested vehicle is a special vehicle, the overall noise environment in the vehicle is poor, which has a great impact on the testers. Therefore, the evaluation scores of 10 evaluators were excluded. The correlation coefficient of 30 reviewers was recalculated. After recalculation, the correlation coefficient of two reviewers was less than 0.7, which was eliminated. Finally, the maximum correlation coefficient is 0.86, and the minimum value is 0.70. This shows that the evaluation results of the remaining 30 evaluators are strongly correlated and can be used for subsequent calculation and analysis. Table 5 shows the correlation coefficient of 30 evaluators.

Table 5.

Correlation coefficients for evaluators.

No.	Correlation coefficient	No.	Correlation coefficient	No.	Correlation coefficient	No.	Correlation coefficient
R1	0.72	R9	0.71	R17	0.70	R25	0.77
R2	0.73	R10	0.72	R18	0.72	R26	0.72
R3	0.75	R11	0.74	R19	0.72	R27	0.73
R4	0.75	R12	0.75	R20	0.86	R28	0.70
R5	0.76	R13	0.73	R21	0.72	R29	0.77
R6	0.78	R14	0.75	R22	0.74	R30	0.78
R7	0.85	R15	0.73	R23	0.75
R8	0.79	R16	0.80	R24	0.82

The arithmetic mean of the ratings for each condition was used as the final score, and the evaluation level for each condition was categorized according to the annoyance level to obtain a rating scale distribution graph (as shown in Figure 3).

Figure 3.

Distribution of evaluation levels.

As seen from the distribution graph, the number of “very dissatisfied” and “slightly dissatisfied” samples totaled 26, accounting for 54% of the overall samples. These classification levels can be used as the labeling basis for supervised learning.

Deep learning based evaluation model for in-vehicle sound quality

Audio data preprocessing

Noisy audio data is preprocessed to convert one-dimensional data into high-dimensional data to enhance features and facilitate deep feature extraction by CNN. In this paper, the preprocessing is performed using a combination of logarithmic Mel spectrum and time-frequency masking.

Logarithmic Mel spectrum

Studies have shown that people can easily distinguish between 500 and 1000 Hz sounds but have difficulty distinguishing the difference between 7500 and 8000 Hz.^18–20 Therefore, the Mel scale was proposed, which is linearly related to the normal scale in the low frequency band and logarithmically related in the high frequency band, so that people’s ability to perceive the same frequency differences is essentially the same.

F_{m e l} = 2595 l o g_{10} (1 + \frac{f}{700})

(1)

f = 700 (1 0^{\frac{m}{2595}} - 1)

(2)

Where:

F_{m e l}

denotes the Mel frequency and

f

denotes the natural frequency.

Using the relationship between the Mel scale and normal frequency, one designs a Mel filter bank, and by using the Mel filter bank, one can obtain a log-Mel spectrogram, whose operation flow is shown in Figure 4.

Figure 4.

Extraction process of logarithmic Mel spectrum.

Figure 5 illustrates a logarithmic Mel spectrum, where the horizontal axis is time, the vertical axis is frequency, and the axis values indicate the amplitude of the frequency at a particular point in time. This kind of image provides information about the time and frequency of the noise signal in deep learning feature extraction, which helps to better simulate the auditory sensation of the human ear. Considering the training requirements, we segmented the audio data into 1-second segments and generated color images of 812 noise samples by batch processing.

Figure 5.

Logarithmic Mel spectrum under a certain working condition.

Time-frequency masking

Time-frequency masking is realized by zeroing the pixel values in some regions of the time-frequency image and is a data enhancement method. It increases the sample size and destroys the original data, thus reducing the model overfitting as much as possible. The time-frequency image dimension is 80 (frequency) × 100 (time). The maximum mask width F in the frequency domain is 20, and the starting position f is uniformly sampled from the [0, 80-F] interval. The mask length f-len is randomly selected from the [1, F] interval. The maximum length of the time domain mask T is 25, the starting position t is uniformly sampled from the [0,100-T] interval, and the mask length t-len is randomly selected from the [1, T] interval.²¹ Figure 6 demonstrates the spectrogram after time-frequency masking, where the image is data-enhanced by zeroing the specified frequency intervals and time intervals of the time-frequency image, and the dataset is expanded to 2436 samples.

Figure 6.

Time-frequency masked image under a certain working condition.

The network structure of CNN and BiLSTM

Convolutional neural networks (CNNs), a class of feed-forward neural networks, are extensively applied in image recognition, audio classification, and target detection.²² Their unidirectional architecture is adept at processing static data, making them well-suited for classification tasks. CNNs’ artificial neurons partially respond to local inputs, excelling in image processing.

As an improved version of recurrent neural network (RNN), BiLSTM introduces two-way information flow and various gates. The control units, such as the forgetting gate, input gate, and output gate, are used to realize the forward and backward long-term dependence of time series data. State modeling is outstanding in natural language processing, speech recognition, and other sequence tasks. The core difference between the two is: CNN. It focuses on the hierarchical extraction of spatial local features, which is suitable for structured data with translation invariance.

Convolutional layer

The 2D convolution generates the output feature map by sliding the convolution kernel over the input image and performing a weighted sum over each local region. At each location, the convolution kernel is elementwise multiplied with the corresponding portion of the input image and summed to obtain a pixel value for the output feature map. Adjusting the convolution kernel parameters, padding and step size can control the size of the feature map and how it is extracted. Multiple convolutional kernels capture features at different scales to enhance network representation. Eventually, the feature map is passed to the next layer of processing. During the operation of convolution, the dimension of the convolutional layer is the same as that of the input layer, and the general representation of its spatial dimensions is $h \times w \times c$ . $h$ , $w$ , $c$ denote the height, width, and number of channels of the 3D tensor, respectively. Assuming that the dimensions of the feature map of the previous input layer is $h_{i n} \times w_{i n} \times c_{i n}$ . The size of the feature map output from the current convolutional layer is $h_{o u t} \times w_{o u t} \times c_{o u t}$ . Then the relationship between the spatial dimensions established by the two is as follows:

{\begin{cases} h_{o u t} = (\frac{h_{i n} - k + 2 p}{s}) + 1 \\ w_{o u t} = (\frac{w_{i n} - k + 2 p}{s}) + 1 \end{cases}

(3)

Where:

s

denotes the sliding step of all the convolution kernels in the convolution layer when traversing the input feature map, and

p

is a parameter used to adjust the spatial size of the input and output feature maps.

k

is the one-sided geometric size of the convolution kernel, and if the size of the convolution kernel of the current layer is

k \times k

, then the structure of the convolution layer is a 4-dimensional tensor of

k \times k \times m \times n

, which means that the convolution layer has a total of

m

channels, and each channel contains

n

convolution kernels.

Activation function and maximum pooling

Post-convolutional layers, activation functions are typically invoked. These nonlinear functions project convolutional neural network computations from input to output space, enabling the network to capture complex nonlinearities and enhancing model generalization. Typical activation functions include Tanh, ReLU, and Sigmoid. Sigmoid and Tanh can suffer from vanishing gradients near saturation, whereas ReLU mitigates this issue. Consequently, this study exclusively employs ReLU as the activation function. Pooling operations, integral to downsampling, curtail the feature load in models. Notable pooling techniques encompass average, global, dynamic, and maximum pooling. Maximum pooling particularly minimizes feature map resolution and computational demands by extracting maximum values across various scales or positions. In order to effectively extract the features of the image, maximum pooling is selected for all pooling operations in this paper, and its mathematical expression is as follows:

y = \underset{x \in s}{\underset{⏟}{\max (x_{i})}}

(4)

Where:

x_{i}

denotes the specified subregion,

s

denotes the number of subregions, and max denotes the maximum pooling.

BiLSTM layer

The Long Short-Term Memory (LSTM) network is an advanced variant of the Recurrent Neural Network (RNN). Structurally akin to RNNs, LSTM introduces a long-term memory module within its hidden layers, enabling the retention of early time series features and mitigating the gradient vanishing issue inherent in RNNs.

The structure of the LSTM cell consists of three special gates, which are the input gate, the forget gate, and the output gate. The information transfer process of the LSTM cell at moment t is shown in Figure 7. In Figure 7: $h_{t - 1}$ is the output of the state unit at moment $t - 1$ ; $X_{t}$ is the input at moment $t$ ; $+$ is the addition operation; $\times$ is the product operation.

Figure 7.

Basic structure of the LSTM cell.

For a given time step $t$ and input $X_{t}$ , the LSTM network calculates the following values:

Forget gate:

f_{t} = σ (W_{f} \cdot [h_{t - 1}, X_{t}] + b_{f})

(5)

Input gate:

i_{t} = σ (W_{i} \cdot [h_{t - 1}, X_{t}] + b_{i})

(6)

New cell state:

{\tilde{C}}_{t} = \tan h (W_{c} \cdot [h_{t - 1}, X_{t}] + b_{c})

(7)

Cell state:

C_{t} = f_{t} * C_{t - 1} + i_{t} * {\tilde{C}}_{t}

(8)

Output gate:

o_{t} = σ (W_{o} \cdot [h_{t - 1}, X_{t}] + b_{o})

(9)

Hidden state:

h_{t} = o_{t} * \tanh (C_{t})

(10)

Where:

σ

—sigmoid activation function;

W_{f}

b_{f}

—weight matrix and bias vector of forget gate;

f_{t}

—the output vector of the forgetting gate at time t;

i_{t}

—the output vector of the input gate at time

t

;

W_{i}

—weight matrix of input gate;

b_{i}

—the bias vector of the input gate;

C_{t - 1}

and

C_{t}

—the information stored in the state unit at time

t - 1

and time

t

;

o_{t}

—output value of the output gate;

h_{t}

—the output value of the state unit at time

t

;

W_{o}

and

b_{o}

—the weight matrix and bias vector of the output gate.

The BiLSTM architecture improves model performance through a two-way processing mechanism. The structure contains two independent LSTM layer: the forward layer processes the input information in chronological order, and the reverse layer parses the data in reverse order. This is two-way. The processing strategy enables the model to capture the forward and backward correlation features of the sequence data at the same time, in which the forward layer is gradually divided from the starting end. In the analysis, the reverse layer starts from the end of the reverse operation.

Classifier

Following convolution and pooling, the resultant 3D array is flattened into a 1D array for input into the BiLSTM network, extracting spectrogram features in depth. These features are then classified by a classifier comprising fully connected layers and a Softmax function, as depicted in Figure 8. The classifier’s output is the probability value for the sound quality evaluation level.

Figure 8.

The structure diagram of the classifier.

The fully connected layer facilitates complete data propagation by interlinking the outputs of the final pooling layer with subsequent neurons. It learns the nonlinear mapping from input to output, aligning results with sample label space. For enhanced adaptability and generalization, this study employs two fully connected layers in the model architecture.

Using the Softmax function, the real vector output of the fully connected layer can be converted into a probability distribution so that the value range of each type of sound quality evaluation level is in [0,1], and its sum is 1. The specific function expression is as follows:

y_{i} = \frac{\exp (x_{i})}{\sum_{j = 1}^{N} \exp (x_{j})}

(11)

Where:

y_{i}

—the probability that the interior noise samples are classified as the

i - t h

class;

x_{i}

and

x_{j}

—elements of the input vector;

N

—level of classification.

The sound quality rating is divided into five categories, so here N = 5. Through the conversion of the logarithmic Mel spectrum, the time-frequency masking data is enhanced, and the data set is constructed. According to the previous introduction of related operations, this study constructs a hybrid architecture model of two-dimensional convolution and LSTM. Through the Softmax function, the network output is mapped to the probability distribution of sound quality level. Figure 9 shows the structure of the composite model.

Figure 9.

CNN-BiLSTM hybrid network structure.

Loss function

Loss functions are used to measure the gap between the model’s predictions and the actual results and help adjust the weights during algorithm updates to improve prediction accuracy. Common loss functions include mean square error, cross entropy, and logarithmic loss functions. In multicategorization tasks, the cross-entropy loss function is often chosen to measure the difference between the probability distribution and the true distribution for each category, and its mathematical representation is as follows:

H (P, Q) = - \sum_{i = 1}^{n} P (x_{i}) \log (Q (x_{i}))

(12)

Where:

P (x_{i})

is the true distribution of the sample, which in this paper denotes the distribution of the true interior noise evaluation ratings, and

Q (x_{i})

denotes the probability distribution predicted by the model, which denotes the distribution of the interior noise evaluation ratings calculated by the Softmax function.

Optimizer

In backpropagation, an optimizer minimizes the loss function by updating the neuron weights. Common optimizers include the gradient descent method, the momentum optimization method, and the adaptive learning rate method. Although the gradient descent method is a simple algorithm, it is very sensitive to the learning rate, so in this paper, we choose to use the adaptive learning rate method to update the network weights, and the specific algorithm is the Adam algorithm.

Adam’s algorithm adeptly manages the learning rate and gradient direction via first- and second-order moment estimations, offering robust hyperparameter interpretability and model stability, thereby facilitating training. Developed by Kingma and Ba (2015),²³ Adam computes first-order moments and second-order moments to scale updates, with bias correction for initial estimates.

Stochastic inactivation and L2 regularization

In order to reduce the risk of overfitting, the Dropout layer is accessed after the two-layer fully connected layer. Dropout sets a retention probability for each neuron in each connection layer, and each neuron is likely to be ‘inactivated.’ When the neuron is ‘inactivated,’ it is no longer involved in the calculation of forward propagation and back propagation. This reduces the degree of interdependence of parameters in the model and enhances the generalization ability of the model. At the same time, random inactivation can reduce the complexity of the model and improve the computational efficiency.

In the deep layer of the network, the value of Dropout generally does not exceed 0.5. After verification of a small number of data sets, the model in this paper has better generalization ability when Dropout is equal to 0.4, so the coefficient of Dropout is selected as 0.4.

Beyond stochastic deactivation, regularization is a key strategy to mitigate overfitting. It enhances model robustness by incorporating parameter-related terms into the loss function. L1 and L2 are prevalent methods; L1 induces sparsity, while L2 amplifies the penalty on larger parameter values by squaring them. This study employs L2 regularization with a coefficient of 0.0001.

Model training and validation

This study constructs an objective in-vehicle sound quality evaluation model using Python and PyTorch, an open-source deep learning framework. The dataset is split into training (80%) and testing (20%) sets. Optimal network parameters are determined from the training set and stored, and then the testing set is evaluated with gradient deactivation. Model accuracy is subsequently validated using a confusion matrix. The initial learning rate was set to 0.001 and the hyperparameters were selected as shown in Table 6.

Table 6.

Selection of hyperparameters.

Learning rate	Optimizer	Drop out	L2 coefficient regularization	Minimum number of batches
0.001	Adam	0.4	0.0001	64

Figure 10 compares CNN-BiLSTM model accuracy on training and testing datasets. In the whole training process, the accuracy value of the training set is better than that of the testing set, but the difference between the two is not large, and the difference is only 2.6%, indicating that the model has no obvious overfitting phenomenon. The best accuracy on the testing set is 93.4%. Figure 11 presents the loss value of the training set and the testing set of the CNN-BiLSTM model, both of which are maintained at a lower value after 600 rounds.

Figure 10.

Accuracy curves for training and testing sets (CNN-BiLSTM).

Figure 11.

Loss value curves for training and testing sets (CNN-BiLSTM).

Figure 12 compares the accuracy of the CNN model on the training and testing sets, and the best accuracy on the test set is 89.9%. Figure 13 shows the loss values of the training set and the testing set of the CNN model. After 300 rounds, both are maintained at a low value.

Figure 12.

Accuracy curves for training and testing sets (CNN).

Figure 13.

Loss value curves for training and testing sets (CNN).

Figure 14 shows the accuracy of the CNN-Attention model with the SENet (Squeeze-and-Excitation Networks) mechanism on the training and testing sets, where the best accuracy on the testing set reaches 91.1%. Figure 15 illustrates the loss values for the CNN-Attention model’s training and testing sets. Following 250 training epochs, both stabilize at low levels.

Figure 14.

Accuracy curves for training and testing sets (CNN-Attention).

Figure 15.

Loss value curves for training and testing sets (CNN-Attention).

In summary, the highest accuracy of the CNN-BiLSTM model on the testing set is 3.5% and 2.3% higher than that of the CNN model and CNN-Attention model, respectively. Thus, its predictive performance outperforms the latter two.

For the CNN-BiLSTM model accuracy assessment, 200 log-Meier spectrograms, 40 randomly selected from each of five sound quality classes, constituted the validation set. Model predictions were validated using PyTorch’s argmax function with gradients disabled.

The confusion matrix, alternatively termed a likelihood or error matrix, is a pivotal tool in machine learning and statistics for assessing classification model performance. It elucidates model classification efficacy across categories, highlighting both correct and incorrect classifications. This study employs the confusion matrix to elucidate model validation outcomes.

Figure 16 depicts the confusion matrix for model evaluation. True labels are shown in rows; predicted labels in columns. Diagonal numbers and percentages indicate correct predictions and accuracy, while off-diagonal values denote misclassifications and bias.

Figure 16.

Confusion matrix for the evaluation model.

The model has a good prediction effect at both ends of the evaluation level. For the label of ‘basically satisfaction’, which is in the middle level of evaluation, the accuracy on the validation set is about 85%. Finally, the overall accuracy on the validation set is 90%, indicating that the model has high prediction accuracy, indicating high predictive capability. Table 7 shows Precision, recall, and F1-scores for each class.

Table 7.

Precision, recall, and F1-scores for each class.

Class	Precision	Recall	F1-score
Very dissatisfied	0.95	0.9	0.92
Slightly dissatisfied	0.78	0.9	0.84
Basically dissatisfied	0.87	0.85	0.86
Very satisfied	0.97	0.9	0.93
Totally satisfied	0.95	0.95	0.95

Conclusion

In this study, in order to simultaneously consider the spatial and temporal characteristics of noise, we built a CNN model incorporating BiLSTM. Meanwhile, the introduction of BiLSTM enhances the model’s adaptability to the dynamic changes of sound. We collected 68 in-vehicle noise samples in accordance with the requirements of the national standard at a site that meets the test conditions. After preprocessing, we obtained data samples that could be used for subjective evaluation and then organized a jury to conduct subjective evaluation experiments of in-vehicle noise. After removing the data with poor correlation, the distribution of subjective evaluation ratings was finally obtained. In addition, the audio data were converted into a high-dimensional image by log-Meier spectral transformation to obtain a feature map containing information in both time and frequency domains. The dataset required for the model was obtained using data enhancement methods. The results show that the model achieves 93.4% prediction accuracy on the testing set, which exceeds the accuracy of the testing set of the CNN model by 3.5%, demonstrating the usability and accuracy of the sound quality evaluation model fusing CNN and BiLSTM in the paper.

Limitations and future work

Although the model has excellent performance, there are still some limitations: the training data only come from specific models, the coverage is limited, and the generalization ability of commercial vehicles and passenger vehicles has not been fully verified; at the same time, the computational complexity of the fusion model is high, and the response delay may occur in the real-time evaluation scenario.

The data set of this study only contains interior noise samples of a specific type of military transport vehicle, which may limit the generalization performance of the model in a wider range of vehicle types. Military transport vehicles are unique in structural design and typical working conditions, and their noise source distribution and spectrum characteristics are different from other models. Therefore, if the model is directly applied to other models with different noise characteristics, the performance may decline, which is the main limitation of the current data set in terms of scale and diversity.

However, research focusing on military transport vehicles has clear practical value. As a key equipment in special scenarios, the sound quality in the vehicle directly affects the driver’s fatigue degree and operation efficiency, but the research on the sound quality of the vehicle is still lacking. The 68 sets of noise samples collected in this study cover typical working conditions such as idle speed, low speed, medium speed and high speed, which ensures the reliability and representativeness of the data set in the sound quality evaluation of such vehicles. This study provides a feasible technical solution for the sound quality evaluation of this specific model and fills the research gap in this field.

In order to reduce the delay of real-time evaluation and improve the generalization ability of the model in a wider range of vehicle scenarios, future work will focus on the following three aspects: (1)improving the inference speed through lightweight technologies such as model pruning and knowledge distillation; (2)Expand the scale and diversity of the data set, and include the interior noise of different brands and models under various working conditions such as highway cruising and climbing to cover more comprehensive noise spectrum characteristics; (3)Transfer learning is introduced. Based on the pre-training model of military vehicles, small-scale civil vehicle samples are used for fine-tuning to reduce the domain differences between models and enhance the generalization ability of the model across models.

Footnotes

ORCID iD

Hao Ran Feng

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Liu

. Application of Statistical Energy Analysis in Vehicle Interior Noise Analysis. Noise and Vibration Control; 2006(02): 66–69.

Zhang

Duan

Lin

. Subjective and objective evaluation of acceleration sound quality of commercial vehicle diesel engine. Journal of Tianjin University (Natural Science and Engineering Technology Edition) 2019; 52(02): 150–156.

Cui

. Application of Support Vector Machine in Prediction of Interior Sound Quality of Vehicle Acceleration. Science and Technology Innovation and Application 2018; 8(02): 163–164.

Song

. Annoyance prediction of interior sound quality of hybrid electric vehicle based on LSSVM. Journal of Chongqing University of Technology (Natural Science) 2019; 33(10): 33–39.

Li'e Tu Xu

, et al. Application of support vector machine in prediction of interior sound quality of automobile acceleration. Automotive Engineering 2015; 37(11): 1328–1333.

Gao

Tang

Liang

. Vehicle interior noise quality evaluation system basedon radial basis function neural network. Journal of Jilin University (Engineering Edition). 360 2012; 42(06): 1378–1383.

Yang

Gao

. Sound quality prediction model of automobile wind vibration noise based on GA-BP. Mechanical Engineering 2021; 57(24): 241–249.

Chen

Tang

, et al. Research on Prediction Model of Tractor Sound Quality Based on Genetic Algorithm. Applied Acoustics 2022; 185: 108411.

Wang

Sun

. Prediction of car door closing sound quality based on EEMD sample entropy and wavelet neural network. Noise and vibration control 2019; 39(03): 122–127.

10.

Borelio

Cour

Nguven

. Analyzing Structure Borne Sound Transmission in Car Body Using Combined FE-SEA Techniques. SAE International 2005; 36: 655.

11.

Pourseiedrezaei

Loghmani

Keshmirim . Prediction of Psychoacoustic Metrics Using Combination of Wavelet Packet Transform and an Optimized Artificial Neural Network. Archives of Acoustics 2019; 44(3): 561–573.

12.

Pourseiedrezaei

Loghmani

Keshmirim . Development of a Sound Quality Evaluation Model Based on an Optimal Analytic Wavelet Transform and an Artificial Neural Network. Archives of Acoustics 2021; 46(3): 55–65.

13.

Lee

Kim

Chae

, et al. Sound quality analysis of a passenger car based on rumbling index. SAE Technical Paper, 2005.

14.

Chen

Liu

, et al. Sound quality evaluation of the interior noise of pure electric vehicle based on neural network model. IEEE Transactions on Industrial Electronics 2017; 64(12): 9442–9450.

15.

Huang

, et al. Sound quality prediction and improving of vehicle interior noise based on deep convolutional neural networks. Expert Systems with Applications 2020; 160: 113657.

16.

GB/T 18697-2002 Acoustic-Measurement of noise inside motor vehicles. 2002.

17.

GB/T 14365-2017 Acoustic-Measurements of sound pressure level emitted by stationary road vehicles. 2017.

18.

Moore

BCJ

. Frequency discrimination for complex tones. Journal of the Acoustical Society of America 1973; 54(6): 1524–1532.

19.

Kohlrausch

Plomp

. Frequency discrimination in normal and hearing-impaired subjects. Journal of the Acoustical Society of America 1990; 88(5): 2266–2276.

20.

Fletcher

Munson

. Loudness, its definition, measurement and calculation. Journal of the Acoustical Society of America 1933; 5(2): 82–108.

21.

Park

Chan

Zhang

, et al. (2019). SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. ICASSP. arXiv:1904.08779.

22.

Lecun

Bottou

Bengio

, et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE 1998; 86(11): 2278–2324.

23.

Kingma

. Adam: A Method for Stochastic Optimization. Computer Science 2014; https://doi.org/10.48550/arXiv.1412.6980

Noise sample number	R1	R2	R3	R4	…	R40
T1	2	2	3	4	…	3
T2	3	3	2	3	…	4
T3	2	2	3	3	…	6
T4	4	3	5	5	…	8
T5	2	5	4	7	…	4
T6	5	6	4	9	…	6
…	…	…	…	…	…	…
T47	6	9	8	7	…	9
T48	7	6	6	8	…	7

Noise sample number	R1	R2	R3	R4	…	R40
T1	2	2	3	4	…	3
T2	3	3	2	3	…	4
T3	2	2	3	3	…	6
T4	4	3	5	5	…	8
T5	2	5	4	7	…	4
T6	5	6	4	9	…	6
…	…	…	…	…	…	…
T47	6	9	8	7	…	9
T48	7	6	6	8	…	7

Noise sample number	R1	R2	R3	R4	…	R40
T1	2	2	3	4	…	3
T2	3	3	2	3	…	4
T3	2	2	3	3	…	6
T4	4	3	5	5	…	8
T5	2	5	4	7	…	4
T6	5	6	4	9	…	6
…	…	…	…	…	…	…
T47	6	9	8	7	…	9
T48	7	6	6	8	…	7