Abstract
The interior sound quality holds a central position in the vehicle quality evaluation system. It shapes the users’ perception of the vehicle and significantly influences consumers’ purchasing decisions. Therefore, it is extremely crucial to accurately assess its quality. Numerous researchers have been dedicated to developing intelligent prediction models to precisely measure the in-vehicle sound quality. The deep convolutional neural network (CNN), due to its excellent ability of automatic feature learning, has been widely applied in the processing and analysis of noise and vibration problems. However, there are two issues in these studies: 1) CNN performs poorly in multi-dimensional feature extraction; 2) CNN has limited adaptability when dealing with dynamic data. To overcome the above problems, an in-vehicle sound quality evaluation model integrating CNN and bidirectional long short-term memory network (Bi-LSTM) was constructed. The results show that the model achieves a maximum prediction accuracy of 96% in the training set, and no significant overfitting occurs, demonstrating the feasibility of the generalization ability and prediction accuracy of the new model.
Keywords
Introduction
As an indispensable means of transportation in modern society, the automobile greatly facilitates people’s daily lives. With the improvement of living standards, people’s requirements for automobile comfort are also increasing. Automotive NVH performance is one of the non-negligible indicators of automotive comfort, 1 and the importance of in-vehicle sound quality as a core component of NVH performance evaluation is self-evident. Therefore, the development of an effective sound quality evaluation model cannot only significantly reduce the experimental cost but also simplify the sound quality evaluation process, which is of great practical significance and far-reaching economic value to the development of the automotive industry.
In recent years, scholars from both domestic and international academic circles have progressively established a plethora of theoretical frameworks and computational models specifically tailored for the systematic investigation of in-vehicle acoustic quality evaluation systems. Most of them take physical acoustic indexes and psychoacoustic parameters as inputs and use multiple linear regression, 2 support vector machine,3–6 radial basis function, BP neural network,7,8 and wavelet neural network9–11 as frameworks to build evaluation models, which have good predictive performance. However, considering that the modeling requires multiple parameter inputs and the processing of the parameters relies on complex acoustic theory and empirical knowledge, 12 scholars have begun to construct deep learning-based sound quality evaluation models. Deep learning-based methodologies systematically utilize the temporal spectrogram of noise as input, where deeper neural network architectures enable the extraction of hierarchical audio features and superior alignment with human auditory perceptual characteristics. This framework not only captures the intrinsic complexity of acoustic signals but also emulates the neurocognitive mechanisms of auditory processing, thereby enhancing the model’s predictive accuracy in sound quality evaluation tasks. In 2017, Gauthier et al. compared three subjective evaluation models for sound quality, namely stepwise regression, elastic network, and the Lasso algorithm. It was found that the Lasso algorithm had the highest prediction accuracy, and these evaluation results can be used as a design guide for engineers for sound quality optimization. 13 Ma Congjian et al. proposed a neural network-based sound quality evaluation method for the interior noise of pure electric vehicles, and the average error of the prediction results was 9%, which can be used for the prediction and evaluation of the sound quality of the noise of electric vehicles. 14 In 2020, Huang Xiaorong et al. from Xihua University used time-frequency images of in-vehicle noise as inputs and extracted noise acoustic features by deep convolutional neural network, which achieved good evaluation results. They also used a neuron visualization algorithm to analyze the feature maps learned by deep CNN, revealing that the deep CNN feature learning process is similar to color filtering and Gabor filtering of noisy images. 15
The analysis of the above studies shows that the interior noise of automobiles contains rich information, and the sound quality evaluation method based on deep learning has good results. However, there are still the following problems to be solved: 1) Convolutional Neural Networks (CNNs) inherently excel in extracting spatial features through convolutional kernels, endowing the obtained representations with prominent localization properties. While CNNs demonstrate efficacy in processing temporal features when integrated with 1D convolutional layers (as evidenced by models such as Encodec, Wav2Vec, and WaveNet), their architecture tends to prioritize local receptive fields, which may hinder the capture of long-range temporal dependencies in complex acoustic signals. This limitation manifests not as an inherent incapability to process time-series data, but rather as a trade-off between local feature precision and global temporal context modeling. 2) Real-world acoustic datasets often exhibit dynamic characteristics (e.g., environmental noise variations under diverse operational conditions), requiring models to reconcile local feature extraction with global temporal dynamics. Although CNNs can handle certain dynamic scenarios via multi-scale convolutional designs, their reliance on static kernel operations may impede the modeling of nonlinear temporal evolutions and cross-dimensional dependencies. Notably, state-of-the-art frameworks like WaveNet have addressed this by integrating dilated convolutions to expand receptive fields, underscoring that CNN-based limitations in dynamic data processing are often mitigated through architectural innovations rather than inherent structural flaws.
To address the aforementioned challenges, this study proposes an acoustic quality evaluation model integrating Convolutional Neural Networks (CNNs) and Bidirectional Long Short-Term Memory (BiLSTM) networks. The model’s predictive performance is systematically validated using confusion matrix-based metrics, including precision, recall, F1-score, and overall accuracy, to quantitatively assess its classification efficacy across diverse acoustic scenarios. Ultimately, it is proved that the proposed model can be used for the sound quality evaluation of the interior noise of special vehicles.
Interior noise sample collection and subjective evaluation experiment
Interior noise sample collection
In this study, sample data of interior noise were collected with reference to GB/T 18697-2002, thereby completing the accumulation of preliminary data samples. The experiment was conducted on an empty, smooth, hard road surface. The reference experimental conditions were GB/T18697-2002 “Acoustic-Measurement of noise inside motor vehicles”
16
and GB/T14365-2017 “Acoustic-Measurements of sound pressure level emitted by stationary road vehicles”,
17
and the acquisition equipment was the LMS data acquisition front. The windows of the vehicle were closed during the experiment, and the air conditioning and other auxiliary equipment were also turned off. Figure 1 illustrates the layout diagram of measuring points at the co-driver’s position. Measurement points at the co-pilot’s location.
The key parameters of the test vehicle.
The model and parameters of the experimental equipment.
Distribution of test conditions.
After the collection is completed, three sets of sound samples of each working condition are played by Testlab software, and the most stable group is selected for subsequent analysis. Finally, the noise samples were cropped to 5 seconds to prevent the reviewers from being distracted by listening for a long period of time.
Subjective evaluation of interior sound quality
The auditory experience of vehicle occupants serves as the ultimate gauge of cabin noise quality. Subjective evaluation experiments, grounded in human sound perception, provide a comprehensive assessment of noise samples. The experimental workflow is outlined in Figure 2. The evaluators conducted listening training before the experiment. The evaluators listened to the audio that was not used as the follow-up research object in a quiet environment, evaluated the irritability as the evaluation index, and screened the evaluators with normal hearing level through their feedback. Flow of subjective evaluation experiment.
The data generated by the rating method is relatively simple in form, easy to be processed and analyzed subsequently, and conducive to the subsequent establishment of a sound quality evaluation model. In summary, the rating method is selected as the subjective evaluation method in this paper.
Subjective evaluation experiment
The members of the jury are all current graduate students in vehicle engineering, totaling 40, all aged 22 to 26 years old, with an average age of 24 years old, and all of them have a certain degree of understanding of automobile noise.
In this paper, the annoyance degree is used as the evaluation index, and 10 scoring levels are set up, with the distribution of scores increasing from 1 to 10, in which 1 to 4 indicates very dissatisfied, 5 indicates slightly dissatisfied, 6-7 indicates basically satisfied, 8 indicates very satisfied, and 9-10 indicates totally satisfied.
In order to give the jury members an overall understanding of the noise samples and to avoid inconsistencies in the evaluation criteria before and after, listening training was conducted before the formal test. The method is to take part of the noise samples, randomly disrupt the order, and have the jury members try to listen to the scores in turn.
Table of partial evaluation scores.
Data correlation analysis
Due to the strong subjectivity of the subjective evaluation test for interior noise, further data inspection is required after obtaining the subjective ratings. In this paper, the Pearson correlation coefficient is used to test the correlation of the evaluation results of the jury; the range is [-1,1], the negative value indicates negative correlation, and the closer the absolute value is to 1, the stronger the correlation is.
Correlation coefficients for evaluators.
The arithmetic mean of the ratings for each condition was used as the final score, and the evaluation level for each condition was categorized according to the annoyance level to obtain a rating scale distribution graph (as shown in Figure 3). Distribution of evaluation levels.
As seen from the distribution graph, the number of “very dissatisfied” and “slightly dissatisfied” samples totaled 26, accounting for 54% of the overall samples. These classification levels can be used as the labeling basis for supervised learning.
Deep learning based evaluation model for in-vehicle sound quality
Audio data preprocessing
Noisy audio data is preprocessed to convert one-dimensional data into high-dimensional data to enhance features and facilitate deep feature extraction by CNN. In this paper, the preprocessing is performed using a combination of logarithmic Mel spectrum and time-frequency masking.
Logarithmic Mel spectrum
Studies have shown that people can easily distinguish between 500 and 1000 Hz sounds but have difficulty distinguishing the difference between 7500 and 8000 Hz.18–20 Therefore, the Mel scale was proposed, which is linearly related to the normal scale in the low frequency band and logarithmically related in the high frequency band, so that people’s ability to perceive the same frequency differences is essentially the same.
Using the relationship between the Mel scale and normal frequency, one designs a Mel filter bank, and by using the Mel filter bank, one can obtain a log-Mel spectrogram, whose operation flow is shown in Figure 4. Extraction process of logarithmic Mel spectrum.
Figure 5 illustrates a logarithmic Mel spectrum, where the horizontal axis is time, the vertical axis is frequency, and the axis values indicate the amplitude of the frequency at a particular point in time. This kind of image provides information about the time and frequency of the noise signal in deep learning feature extraction, which helps to better simulate the auditory sensation of the human ear. Considering the training requirements, we segmented the audio data into 1-second segments and generated color images of 812 noise samples by batch processing. Logarithmic Mel spectrum under a certain working condition.
Time-frequency masking
Time-frequency masking is realized by zeroing the pixel values in some regions of the time-frequency image and is a data enhancement method. It increases the sample size and destroys the original data, thus reducing the model overfitting as much as possible. The time-frequency image dimension is 80 (frequency) × 100 (time). The maximum mask width F in the frequency domain is 20, and the starting position f is uniformly sampled from the [0, 80-F] interval. The mask length f-len is randomly selected from the [1, F] interval. The maximum length of the time domain mask T is 25, the starting position t is uniformly sampled from the [0,100-T] interval, and the mask length t-len is randomly selected from the [1, T] interval.
21
Figure 6 demonstrates the spectrogram after time-frequency masking, where the image is data-enhanced by zeroing the specified frequency intervals and time intervals of the time-frequency image, and the dataset is expanded to 2436 samples. Time-frequency masked image under a certain working condition.
The network structure of CNN and BiLSTM
Convolutional neural networks (CNNs), a class of feed-forward neural networks, are extensively applied in image recognition, audio classification, and target detection. 22 Their unidirectional architecture is adept at processing static data, making them well-suited for classification tasks. CNNs’ artificial neurons partially respond to local inputs, excelling in image processing.
As an improved version of recurrent neural network (RNN), BiLSTM introduces two-way information flow and various gates. The control units, such as the forgetting gate, input gate, and output gate, are used to realize the forward and backward long-term dependence of time series data. State modeling is outstanding in natural language processing, speech recognition, and other sequence tasks. The core difference between the two is: CNN. It focuses on the hierarchical extraction of spatial local features, which is suitable for structured data with translation invariance.
Convolutional layer
The 2D convolution generates the output feature map by sliding the convolution kernel over the input image and performing a weighted sum over each local region. At each location, the convolution kernel is elementwise multiplied with the corresponding portion of the input image and summed to obtain a pixel value for the output feature map. Adjusting the convolution kernel parameters, padding and step size can control the size of the feature map and how it is extracted. Multiple convolutional kernels capture features at different scales to enhance network representation. Eventually, the feature map is passed to the next layer of processing. During the operation of convolution, the dimension of the convolutional layer is the same as that of the input layer, and the general representation of its spatial dimensions is
Activation function and maximum pooling
Post-convolutional layers, activation functions are typically invoked. These nonlinear functions project convolutional neural network computations from input to output space, enabling the network to capture complex nonlinearities and enhancing model generalization. Typical activation functions include Tanh, ReLU, and Sigmoid. Sigmoid and Tanh can suffer from vanishing gradients near saturation, whereas ReLU mitigates this issue. Consequently, this study exclusively employs ReLU as the activation function. Pooling operations, integral to downsampling, curtail the feature load in models. Notable pooling techniques encompass average, global, dynamic, and maximum pooling. Maximum pooling particularly minimizes feature map resolution and computational demands by extracting maximum values across various scales or positions. In order to effectively extract the features of the image, maximum pooling is selected for all pooling operations in this paper, and its mathematical expression is as follows:
BiLSTM layer
The Long Short-Term Memory (LSTM) network is an advanced variant of the Recurrent Neural Network (RNN). Structurally akin to RNNs, LSTM introduces a long-term memory module within its hidden layers, enabling the retention of early time series features and mitigating the gradient vanishing issue inherent in RNNs.
The structure of the LSTM cell consists of three special gates, which are the input gate, the forget gate, and the output gate. The information transfer process of the LSTM cell at moment t is shown in Figure 7. In Figure 7: Basic structure of the LSTM cell.
For a given time step
Forget gate:
The BiLSTM architecture improves model performance through a two-way processing mechanism. The structure contains two independent LSTM layer: the forward layer processes the input information in chronological order, and the reverse layer parses the data in reverse order. This is two-way. The processing strategy enables the model to capture the forward and backward correlation features of the sequence data at the same time, in which the forward layer is gradually divided from the starting end. In the analysis, the reverse layer starts from the end of the reverse operation.
Classifier
Following convolution and pooling, the resultant 3D array is flattened into a 1D array for input into the BiLSTM network, extracting spectrogram features in depth. These features are then classified by a classifier comprising fully connected layers and a Softmax function, as depicted in Figure 8. The classifier’s output is the probability value for the sound quality evaluation level. The structure diagram of the classifier.
The fully connected layer facilitates complete data propagation by interlinking the outputs of the final pooling layer with subsequent neurons. It learns the nonlinear mapping from input to output, aligning results with sample label space. For enhanced adaptability and generalization, this study employs two fully connected layers in the model architecture.
Using the Softmax function, the real vector output of the fully connected layer can be converted into a probability distribution so that the value range of each type of sound quality evaluation level is in [0,1], and its sum is 1. The specific function expression is as follows:
The sound quality rating is divided into five categories, so here N = 5. Through the conversion of the logarithmic Mel spectrum, the time-frequency masking data is enhanced, and the data set is constructed. According to the previous introduction of related operations, this study constructs a hybrid architecture model of two-dimensional convolution and LSTM. Through the Softmax function, the network output is mapped to the probability distribution of sound quality level. Figure 9 shows the structure of the composite model. CNN-BiLSTM hybrid network structure.
Loss function
Loss functions are used to measure the gap between the model’s predictions and the actual results and help adjust the weights during algorithm updates to improve prediction accuracy. Common loss functions include mean square error, cross entropy, and logarithmic loss functions. In multicategorization tasks, the cross-entropy loss function is often chosen to measure the difference between the probability distribution and the true distribution for each category, and its mathematical representation is as follows:
Optimizer
In backpropagation, an optimizer minimizes the loss function by updating the neuron weights. Common optimizers include the gradient descent method, the momentum optimization method, and the adaptive learning rate method. Although the gradient descent method is a simple algorithm, it is very sensitive to the learning rate, so in this paper, we choose to use the adaptive learning rate method to update the network weights, and the specific algorithm is the Adam algorithm.
Adam’s algorithm adeptly manages the learning rate and gradient direction via first- and second-order moment estimations, offering robust hyperparameter interpretability and model stability, thereby facilitating training. Developed by Kingma and Ba (2015), 23 Adam computes first-order moments and second-order moments to scale updates, with bias correction for initial estimates.
Stochastic inactivation and L2 regularization
In order to reduce the risk of overfitting, the Dropout layer is accessed after the two-layer fully connected layer. Dropout sets a retention probability for each neuron in each connection layer, and each neuron is likely to be ‘inactivated.’ When the neuron is ‘inactivated,’ it is no longer involved in the calculation of forward propagation and back propagation. This reduces the degree of interdependence of parameters in the model and enhances the generalization ability of the model. At the same time, random inactivation can reduce the complexity of the model and improve the computational efficiency.
In the deep layer of the network, the value of Dropout generally does not exceed 0.5. After verification of a small number of data sets, the model in this paper has better generalization ability when Dropout is equal to 0.4, so the coefficient of Dropout is selected as 0.4.
Beyond stochastic deactivation, regularization is a key strategy to mitigate overfitting. It enhances model robustness by incorporating parameter-related terms into the loss function. L1 and L2 are prevalent methods; L1 induces sparsity, while L2 amplifies the penalty on larger parameter values by squaring them. This study employs L2 regularization with a coefficient of 0.0001.
Model training and validation
Selection of hyperparameters.
Figure 10 compares CNN-BiLSTM model accuracy on training and testing datasets. In the whole training process, the accuracy value of the training set is better than that of the testing set, but the difference between the two is not large, and the difference is only 2.6%, indicating that the model has no obvious overfitting phenomenon. The best accuracy on the testing set is 93.4%. Figure 11 presents the loss value of the training set and the testing set of the CNN-BiLSTM model, both of which are maintained at a lower value after 600 rounds. Accuracy curves for training and testing sets (CNN-BiLSTM). Loss value curves for training and testing sets (CNN-BiLSTM).

Figure 12 compares the accuracy of the CNN model on the training and testing sets, and the best accuracy on the test set is 89.9%. Figure 13 shows the loss values of the training set and the testing set of the CNN model. After 300 rounds, both are maintained at a low value. Accuracy curves for training and testing sets (CNN). Loss value curves for training and testing sets (CNN).

Figure 14 shows the accuracy of the CNN-Attention model with the SENet (Squeeze-and-Excitation Networks) mechanism on the training and testing sets, where the best accuracy on the testing set reaches 91.1%. Figure 15 illustrates the loss values for the CNN-Attention model’s training and testing sets. Following 250 training epochs, both stabilize at low levels. Accuracy curves for training and testing sets (CNN-Attention). Loss value curves for training and testing sets (CNN-Attention).

In summary, the highest accuracy of the CNN-BiLSTM model on the testing set is 3.5% and 2.3% higher than that of the CNN model and CNN-Attention model, respectively. Thus, its predictive performance outperforms the latter two.
For the CNN-BiLSTM model accuracy assessment, 200 log-Meier spectrograms, 40 randomly selected from each of five sound quality classes, constituted the validation set. Model predictions were validated using PyTorch’s argmax function with gradients disabled.
The confusion matrix, alternatively termed a likelihood or error matrix, is a pivotal tool in machine learning and statistics for assessing classification model performance. It elucidates model classification efficacy across categories, highlighting both correct and incorrect classifications. This study employs the confusion matrix to elucidate model validation outcomes.
Figure 16 depicts the confusion matrix for model evaluation. True labels are shown in rows; predicted labels in columns. Diagonal numbers and percentages indicate correct predictions and accuracy, while off-diagonal values denote misclassifications and bias. Confusion matrix for the evaluation model.
Precision, recall, and F1-scores for each class.
Conclusion
In this study, in order to simultaneously consider the spatial and temporal characteristics of noise, we built a CNN model incorporating BiLSTM. Meanwhile, the introduction of BiLSTM enhances the model’s adaptability to the dynamic changes of sound. We collected 68 in-vehicle noise samples in accordance with the requirements of the national standard at a site that meets the test conditions. After preprocessing, we obtained data samples that could be used for subjective evaluation and then organized a jury to conduct subjective evaluation experiments of in-vehicle noise. After removing the data with poor correlation, the distribution of subjective evaluation ratings was finally obtained. In addition, the audio data were converted into a high-dimensional image by log-Meier spectral transformation to obtain a feature map containing information in both time and frequency domains. The dataset required for the model was obtained using data enhancement methods. The results show that the model achieves 93.4% prediction accuracy on the testing set, which exceeds the accuracy of the testing set of the CNN model by 3.5%, demonstrating the usability and accuracy of the sound quality evaluation model fusing CNN and BiLSTM in the paper.
Limitations and future work
Although the model has excellent performance, there are still some limitations: the training data only come from specific models, the coverage is limited, and the generalization ability of commercial vehicles and passenger vehicles has not been fully verified; at the same time, the computational complexity of the fusion model is high, and the response delay may occur in the real-time evaluation scenario.
The data set of this study only contains interior noise samples of a specific type of military transport vehicle, which may limit the generalization performance of the model in a wider range of vehicle types. Military transport vehicles are unique in structural design and typical working conditions, and their noise source distribution and spectrum characteristics are different from other models. Therefore, if the model is directly applied to other models with different noise characteristics, the performance may decline, which is the main limitation of the current data set in terms of scale and diversity.
However, research focusing on military transport vehicles has clear practical value. As a key equipment in special scenarios, the sound quality in the vehicle directly affects the driver’s fatigue degree and operation efficiency, but the research on the sound quality of the vehicle is still lacking. The 68 sets of noise samples collected in this study cover typical working conditions such as idle speed, low speed, medium speed and high speed, which ensures the reliability and representativeness of the data set in the sound quality evaluation of such vehicles. This study provides a feasible technical solution for the sound quality evaluation of this specific model and fills the research gap in this field.
In order to reduce the delay of real-time evaluation and improve the generalization ability of the model in a wider range of vehicle scenarios, future work will focus on the following three aspects: (1)improving the inference speed through lightweight technologies such as model pruning and knowledge distillation; (2)Expand the scale and diversity of the data set, and include the interior noise of different brands and models under various working conditions such as highway cruising and climbing to cover more comprehensive noise spectrum characteristics; (3)Transfer learning is introduced. Based on the pre-training model of military vehicles, small-scale civil vehicle samples are used for fine-tuning to reduce the domain differences between models and enhance the generalization ability of the model across models.
Footnotes
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
