Abstract
The monitoring of large-scale civil infrastructure, such as railways and tunnels, using distributed sensing systems, for example, fiber-optic sensing, is cost-effective and hence highly promising. However, compared to conventional spot sensors, distributed sensing systems still face some challenges, for example, measurement accuracy and physical interpretability. In this study, we employ a distributed acoustic sensing (DAS) system to detect rail fastener failures and propose an interpretable cross-modal transfer learning approach to address these challenges. This method enables the fusion of acceleration and DAS signals within a deep learning framework, leveraging interpretability techniques to infer the rationale behind model predictions. Additionally, a field test focusing on rail fastener failures was conducted, during which the dynamic responses of the rail under various fastener failure conditions were independently measured using accelerometers and DAS. The proposed method was applied and validated using field test data. Results demonstrate that the deep learning model trained with the proposed cross-modal transfer learning method achieves superior accuracy compared to traditional methods using DAS data alone. Furthermore, interpretable features were successfully extracted in the proposed model for follow-up predictions, underscoring its enhanced generalization capability and interpretability for rail track monitoring. These results highlight not only the method’s effectiveness for rail monitoring but also its potential applicability to other civil structures requiring integrated distributed and point sensing.
Keywords
Introduction
Railway infrastructure, for example, railway tracks, is critical for modern societal and economic development, owing to its efficiency, safety, and environmental friendliness. However, railway infrastructure gradually deteriorates due to prolonged usages and external environmental factors.1,2 Simultaneously, the intensification of climate change, such as extreme storms and floodings, poses new challenges to the structural integrity and safety of railway tracks.3–6 Accordingly, ensuring the health and safety of railway tracks has become increasingly vital. Conventional track monitoring methods primarily depend on manual inspections, which are both labor-intensive and costly. The adoption of advanced technologies and intelligent maintenance solutions for automated real-time track monitoring is an unmet need for both industrial and academia communities of railway engineering.7–10 In particular, the failure of rail fasteners is considered one of the most common faults in railway infrastructure during operations, typically manifested as fastener loosening or breakage.11,12 When a fastener fails, it provides insufficient downward pressure on the rail, compromising the rail position and stability. Such abnormalities not only accelerate the wear of other components within the track system but also affect train operational smoothness, increasing maintenance costs and safety risks.13–15 More critically, in extreme cases, fastener failure can result in severe rail displacement or even fracture, potentially leading to derailments and other major operational accidents.16,17 Therefore, monitoring and maintaining the condition of rail fasteners is of paramount importance for ensuring the safe operation of railway systems.
In recent years, various efforts are made to address the challenges, with several sensing and detection technologies being extensively explored for monitoring fastening failures. Among these, image processing technologies have attracted significant attention.18–20 For instance, Bai et al. 21 developed a convolutional neural network (CNN) model trained on image data of fasteners under diverse conditions, enabling the deep learning model to effectively identify fastening failures. However, despite the promising results in controlled environments, image-based technologies face several challenges in practical engineering applications. 22 External factors, such as variations in weather conditions, fluctuations in lighting intensity, surface contamination of the ballast, and dynamic interference caused by high-speed train operations, can adversely affect image quality.23,24 These factors may significantly reduce the accuracy of deep learning models in real-world scenarios, posing substantial limitations to their deployment in railway monitoring systems.
In addition, several researchers have also studied detection of fastener failure based on vibration responses, with its primary advantage lying in the direct detection of changes in the mechanical properties of the rail.25,26 However, current studies are largely focused on laboratory experimental settings, for instance, Wang et al. 27 analyzed vibration response variations in fasteners under different conditions through hammering tests, highlighting changes linked to fastener failure. While accelerometers offer high precision in vibration detection, their single-point measurement capability limits their applicability to real-world railway networks, particularly those extending over hundreds of kilometers. In recent years, various distributed sensing systems have been proposed for structural health monitoring.28–30 These systems operate based on the propagation characteristics of laser light within optical fibers, enabling the monitoring of environmental factors along the fiber path, for example, the distributed acoustic sensing (DAS) system can detect acoustic and vibration signals through distributed fiber-optic sensors.31–33 This makes the application of DAS technology in long-distance railway track health monitoring a promising concept. However, signals obtained from DAS systems often exhibit limited physical interpretability and high complexity, posing significant challenges in extracting meaningful information about structural health conditions.34–36 Although advanced deep learning methods can be used for training and prediction on these data, the internal mechanisms of these models are opaque, and their prediction results often lack sufficient interpretability, which limits their widespread adoption in engineering fields. 37
To address the constraints or issues of the current monitoring technologies, this study aims to apply DAS technology to the health monitoring of railway and proposes a rail fastener failure detection method based on an interpretable deep learning model. The method utilizes DAS technology and optimizes the deep learning model by integrating the rail acceleration signals, thereby enhancing its generalization ability. Specifically, a cross-modal transfer learning approach is proposed, which extracts the mechanical characteristic information of the rail embedded in the acceleration data and applies such knowledge into the deep learning model based on distributed sensing, achieving effective fusion of multimodal data. In addition, an interpretable deep learning framework is developed to provide detailed explanations of the model’s detection knowledge basis. Finally, field experiments are conducted to validate the effectiveness and feasibility of the proposed method.
The rest of the article is organized as follows: the second section provides a detailed description of the research methodology, including the mechanical principles, the proposed deep learning approach, and the corresponding interpretability techniques. The third section presents a field experiment on rail fastener failures and corresponding results. The fourth section demonstrates the performance of the proposed method on experimental field data, including the accuracy of model detection and the rationale behind model predictions derived from interpretability techniques. Finally, the fifth section concludes the research and presents an outlook for future work.
Methodology
In this section, the mechanical basis for detecting the failure of rail fastener clips is introduced in “Mechanical foundation for rail fastener clip failure detection” section. “Deep learning network architecture” section presents the deep learning model used in this study, including its overall structure and details of each module. “Cross-modal transfer learning” section describes the proposed cross-modal transfer learning method, covering the model training process, the composition of the loss function, and its working principles. Finally, “Interpretable mechanisms” section introduces the principles and implementation process of the interpretability techniques used in this study.
Mechanical foundation for rail fastener clip failure detection
Figure 1 illustrates the fastening system and its mechanical model. From Figure 1(a), it can be observed that a typical fastening system mainly consists of the clips, screw spikes, and pads. Among these, the fastening pads provide elastic support for the rail, while the screw spikes secure the baseplate and the pad firmly to the sleeper. The fastening clip applies downward pressure on the rail, ensuring close contact between the rail and the fastening pad. Two mechanical models 38 for the fastening system are shown in Figure 1(b) and (c). In the solid-contact model, the fastening pad is modeled as an elastic solid, while the mechanical effect of the fastening clip on the rail is represented by a spring-damper element in this model. Additionally, the spring-damper model is a more simplified approach and the most commonly used model in track dynamic analysis. It is widely applied in engineering practices such as track design and modeling. In this model, the effects of the clip and pad are integrated and simplified into a spring-damper element, where the elastic support provided to the rail is governed by the stiffness k.

Fastening system and its mechanical model schematic diagrams: (a) typical fastening system, (b) solid-contact model for the fastening system, and (c) spring-damper model for the fastening system.
It can be found that the primary role of the fastening clip in the mechanical system is to ensure close contact between the rail and the fastening system, while its dynamic characteristics are not significant in track dynamic analysis. However, when the fastening clip becomes loose, the contact relationship between the rail and the pad transitions to partial contact or even no contact. This can be represented in the spring-damper model as a reduction in the spring stiffness k, which further leads to decreases in the natural frequency of the rail system, such changes can be reflected in the free vibration response of the rail. Moreover, if the number of loose fasteners near a section of rail increases, the natural frequency of the rail will change further. Therefore, this study aims to detect loose fasteners in the rail system by monitoring the free vibration response of the rail after dynamic excitation.
DAS is an advanced sensing technology based on optical fiber, which is able to measure external vibrations by analyzing the phase change of backscattered light in the optical fiber. 39 The basic principle is to use optical fiber as the sensing medium, when the rail produces vibration, these vibrations will cause small local deformation of the optical fiber, which will lead to changes in the phase of the scattered light in the optical fiber. By demodulating the phase change, the DAS system can obtain the vibration response signal of the rail.40,41 Therefore, this article proposes to utilize the DAS system to measure the response of free vibration of rails in different conditions of fasteners, and then achieve the detection of the failure of rail fasteners. Specifically, a fastener failure index (FFI) is defined in this study, representing the number of loose fasteners within a certain range of the rail. When this index is 0, it indicates that all fasteners on the rail are securely tightened.
Deep learning network architecture
The proposed deep learning model aims to predict the rail FFI based on the measured vibration response of the rail. Since the FFI is treated as a continuous variable, the task is formulated as a regression problem to support model interpretability. This formulation enables the use of interpretation techniques such as LIME to identify which parts of the input signal contribute positively or negatively to the predicted FFI, offering deeper insights into the model’s decision-making process. The architecture of the deep learning network model proposed in this study is shown in Figure 2. It consists of a signal input layer, CNN blocks, transformer blocks, and an output layer. Specifically, the signal input layer can accept multichannel signals, such as multidirectional acceleration signals or DAS signals from multiple sensing points (SPs). Since this study uses the free vibration response of the rail to detect fastener failure, and the amplitude of the time-domain response typically varies with the excitation source, the frequency spectrum sequence of the free vibration response is considered more suitable as the model input.

The framework of the deep learning model used for identifying fastener failure.
The CNN architecture has been widely adopted in various signal processing and pattern recognition tasks due to its effectiveness in hierarchical feature extraction.42–44 In this research, the CNN blocks are used to reduce the dimensionality and compress the input raw signals, extracting key features from the signals to alleviate the computational complexity of subsequent network layers. The CNN blocks mainly consist of convolutional layers and pooling layers, which can be computed as follows45,46:
where
The transformer blocks are used to capture the global dependencies of the signal. By leveraging the self-attention mechanism, it can dynamically assign weights to different sections of the signal. Originally introduced for sequence modeling in natural language processing, the transformer architecture has since demonstrated strong capability in learning long-range relationships across various domains.47–49 Specifically, each transformer module consists of a multihead self-attention layer and a fully connected feed-forward layer.50–52 Each layer is followed by a residual connection and a normalization layer. The input to the attention function consists of queries and keys with dimensions d k and values with dimension d v . To enable parallel computation, a set of queries and keys is packed into matrices Q and K, respectively. The corresponding values are also packed into matrix V. Then, the attention function A is computed as follows:
The multihead attention mechanism splits the attention into h heads, with each head calculated independently. After computing the attention for each head, their outputs are concatenated and passed through a final linear transformation to obtain the output. The multihead attention function M is defined as follows:
where
After the feature extraction of the raw signals by the CNN and transformer blocks in the encoder, a fully connected layer is used for information integration and to output the prediction result, which is the FFI.
Cross-modal transfer learning
Since the DAS system enables distributed monitoring, it is more suitable for fastener failure detection on railway tracks spanning tens of kilometers. However, its vibration monitoring primarily relies on the Rayleigh scattering effect, using the slight phase shift caused by vibration in the optical fiber to detect vibrations along the fiber. As a result, compared to traditional vibration measurement devices such as accelerometers, the data from DAS typically contains more noise and may be more difficult to interpret. Therefore, a promising approach is to transfer the features measured by the accelerometers into the DAS data, enabling monitoring based on DAS.
In traditional transfer learning tasks, knowledge is usually transferred between different domains with the same type of data.53,54 However, in this study, we aim to transfer the key features of rail vibration obtained from accelerometer signals into a network that uses DAS signals as input. The data acquisition methods for DAS signals and accelerometer signals are completely different, which means that the deep learning network extracts features from these two types of signals in fundamentally different ways. Therefore, cross-modal transfer learning is proposed in this study. It is important to note that in machine learning-related research, the term of data modal typically refers to different types of data forms, which is not the same concept as modalities in structural dynamics. The cross-modal transfer-learning framework proposed in this article is shown in Figure 3.

The deep learning training process of the proposed cross-domain adaptation: (a) pretraining the model using multiple data sources and (b) fine-tuning the model on the target domain.
In our framework, two independent encoders,
where n represents the number of samples,
The second term is encoder discrepancy (ED) loss
which minimizes the direct
The overall pretraining loss is a weighted sum of these terms:
with hyperparameter
As shown in Figure 3(b), after pretraining, the weights of
where
Interpretable mechanisms
Typically, deep learning models are considered black boxes, meaning that their decision-making processes are often difficult to interpret.55,56 This lack of transparency has become one of the most significant barriers to applying deep learning methods in engineering fields such as structural health monitoring. Understanding the underlying reasoning behind a model’s predictions is crucial for trust and validation. 57
To address this issue, local interpretable model-agnostic explanations (LIME) is applied in this study to explain the basis of deep learning models in detecting fastener failure. 58 The core concept of LIME is to assess how the performance of the predictive model changes when the data are perturbed. A new dataset is then created by generating perturbed samples and their corresponding predictions. Then, a model is trained on this new dataset, where the instances are weighted based on their proximity to the instance being explained. This approach helps identify which input features the model relies on to make its prediction. This process can be described as follows59,60:
where g denotes the explanation model for the instance x, and
In this study, the input to the deep learning model is signals, which are sequences of data that cannot be processed by traditional LIME. Based on traditional algorithms, we propose a signal-specific LIME technique. This technique divides the input signal into segments of equal length, with each segment’s response considered as a feature of the input, thus forming a new dataset. Specifically, let the original time series be
Divide X into K equal segments:
For each segment
The perturbed time series corresponding to a binary vector
Define the weight for a perturbed sample
where
where
To train the explanation model, the loss function is defined as follows:
where the summation is over the set of perturbed samples Z. The final optimization problem for the time series explanation is
Subsequently, LIME constructs and trains an interpretable model on this dataset, which can be used to assess the contribution of each signal segment to the prediction.
Field experiments and signal analysis
In this section, a field experiment on rail fastening failure was conducted. “Experimental setup” section provides a detailed introduction to the experiment site, equipment, deployment, and conditions. “Experimental results” section presents the response results measured by the accelerometers and the DAS system, along with a detailed comparison and analysis of these results.
Experimental setup
The conditions at the experimental site and the layout of the sensors are shown in Figure 4. Five accelerometers were installed on the rail waist and rail foot, each capable of measuring the acceleration response in three directions, and the DAS optical fiber was deployed on the rail foot, which was fixed by adhesive tapes in order to be tightly attached to the rail foot. The hammer loading position was located between the fastener #a and the fastener #b. The positions of the five accelerometers are shown in Figure 4(a). In the DAS system, the six SPs closest to the hammer impact (labeled SP-A to SP-F) were selected for response analysis in this study. Among them, SP-B to SP-E are explicitly marked in the figure, while SP-A and SP-F are located on either side of the impact point and are the farthest from it. It is important to note that each SP in the DAS system does not correspond to a single physical location. Instead, it represents an aggregated response over a certain segment of the optical fiber, obtained by spatially integrating the signal along that segment. Thus, each SP can be regarded as a virtual sensor covering a specific distance.

Hammer test setup for the track and sensor layout: (a) schematic diagram of sensor and hammering positions, (b) field view of accelerometers layout, (c) field view of accelerometers layout, and (d) field view of tight and loose fastening conditions.
The equipment involved in this experiment is shown in Figure 5. Specifically, PCB model 086D20 impacts hammer (PCB Piezotronics, Depew, NY, USA) was used to provide the excitation force for the rail, and PCB model 356A25 accelerometer (PCB Piezotronics, Depew, NY, USA) was used for this experiment with a measurement range of 200 g, and the NI-9234 data acquisition module is used to acquire acceleration signals. The AP sensing N52 DAS system (AP Sensing GmbH, Böblingen, Germany), including the interrogator unit and the data processing unit, is used in this experiment to send laser pulses into the fiber-optic cable, as well as to collect and process signal data.

Equipment involved in the field experiment: (a) impact hammer, (b) accelerometer, (c) accelerometer data acquisition module, (d) DAS interrogator unit and processing unit, and (e) fiber-optic cable.
In this experiment, six working conditions were set up to characterize various fastener failure conditions, which are shown in Table 1. Specifically, in condition 1, all the fastener strips were fastened, which represented the normal condition as the control group. The rest of the conditions represent abnormal conditions, that is, some of the fastener strips were loosened. The highest number of loosened fastener strips occurred in condition 6, where three fastener strips were loosened. A field view of the tight and loose fastener strips is shown in Figure 4(d).
Mechanical parameters used in the cases.
Experimental results
Figure 6 shows the hammer impact load acquired in the experiment and the dynamic response measured by the accelerometer and DAS. Specifically, the dynamic response mainly consists of the forced vibration response of the rail under the hammer impact load and the free vibration response of the rail without external load afterward. From Figure 6(a), it can be seen that the load is applied to the structure for about 0.005 s, at which moment the responses measured by both accelerometers and DAS peak, while its dynamic response gradually decays for most of the subsequent time. The RMS values of the accelerometer and DAS measured responses are also shown in Figure 6(b) and (c), respectively. It can be noticed that the dynamic response of both sensors decays rapidly around the first 0.2 s, which is marked as a gray area, and gradually stabilizes thereafter. This stabilized value is considered as the background noise of the test site. It can be seen that the noise in the DAS measurements is significantly larger than that measured by the accelerometer, which may be due to the fact that the DAS is more sensitive to ambient sound, including sounds caused by mechanical equipment and human activities in the test site. Also, since it is a distributed measurement system, the signal from each sampling point is actually a combined reflection of vibration and sound within a certain range near that point, rather than a strictly single-point measurement, and thus may be more widely affected by environmental factors. To reduce the effect of ambient noise on the signal analysis, the time period of rapid decay of the response obtained from both sensors was extracted for subsequent response analysis, that is, the first 0.2 s.

The typical signal obtained from a hammer impact test: (a) hammer impact force signal, (b) acceleration signal, and (c) DAS signal. DAS: distributed acoustic sensing.
The typical spectrum of the free vibration responses of the rail in the vertical, lateral, and longitudinal directions under hammering loads are shown in Figures 7 to 9. It can be found that the trend of the spectra of the responses measured at different locations is relatively consistent, that is, the vertical vibration of the rail reaches a peak near 265 Hz and the lateral vibration reaches a peak near 112 Hz when the clip is tightened. These two frequencies are considered to be the first-order natural frequencies of the rail in different directions, which is consistent with the results in exiting researches.2–4 At this modal frequency, the rail undergoes in-phase motion across all positions, which can be mechanically modeled as a simplified mass-spring system. In other conditions, the fastener clips of the rails are loosened and the natural frequency of the rails is reduced in both directions. Specifically, in condition 6, when the three fastener clips are loosened, the vertical natural frequency of the rail is reduced to approximately 230 Hz and the lateral natural frequency is reduced to approximately 102 Hz. This is because, in the mechanical system of the rail, the rubber fastener pads provide elastic support for the rail. The main function of the fastener clip in the system is to provide sufficient downforce for the rail, thus ensuring close contact between the rail and the fastener pad. Therefore, when the clip is loosened, the support effect of the fastener plate on the rail is weakened, which can be regarded as a reduction in the elastic modulus of the elastic support in the mechanical system, resulting in a reduction in the natural frequency of the system.

The frequency spectrum of the vertical dynamic response of each accelerometer under different conditions (the condition is labeled as C in the legend): (a) accelerometer #1, (b) accelerometer #2, (c) accelerometer #3, (d) accelerometer #4, and (e) accelerometer #5.

The frequency spectrum of the lateral dynamic response of each accelerometer under different conditions (the condition is labeled as C in the legend): (a) accelerometer #1, (b) accelerometer #2, (c) accelerometer #3, (d) accelerometer #4, and (e) accelerometer #5.

The frequency spectrum of the longitudinal dynamic response of each accelerometer under different conditions (the condition is labeled as C in the legend): (a) accelerometer #1, (b) accelerometer #2, (c) accelerometer #3, (d) accelerometer #4, and (e) accelerometer #5.
In this experiment, the responses measured at the six SPs of the DAS close to the hammering point were used for the analysis, and the spectrum of the responses measured by the DAS at different conditions is illustrated in Figure 10. It can be seen that the DAS response for all conditions shows a significant peak near 50 Hz, where no significant response is observed for the acceleration signal. The phenomenon is considered to be related to the installation method of the fiber-optic cable, as the fiber optic is glued on the surface of the rail, which leads to insufficient transmission of the vibration response from the rail to the fiber optic, which may result in a significant response of 50 Hz that is not measured by the rail accelerometers.

The frequency spectrum of the responses measured at each SP of the DAS system under various operating conditions (the condition is labeled as C in the legend): (a) SP-A, (b) SP-B, (c) SP-C, (d) SP-D, (e) SP-E, and (f) SP-F. SP: sensing point; DAS: distributed acoustic sensing.
In addition, compared to the acceleration response, the noise of the response measured by DAS is more significant, especially in the high-frequency band, and its response peaks in many frequency bands without showing significant characteristic trend, and it is difficult to be reasonably interpreted by the mechanical knowledge. This is because DAS measurements are based on fiber-optic interferometry, which senses the environmental vibration and sound through phase changes in fiber-optic transmission. Also, the signal at each DAS SP is not independent, but is a composite reflection of vibrations and sounds in the area adjacent to a section of fiber, which makes it more susceptible to ambient noise. In addition, the signal at each SP is usually spatially smoothed by interpolation or filtering processing, which also results in the high-frequency signal being attenuated while the low-frequency noise component is amplified.
However, it can be noticed that although not particularly significant, the peak of the DAS signal near 100–110 Hz also varies with the conditions, showing a similar trend to the lateral acceleration response of the rails. Whereas, at the vertical natural frequency of the rail, that is, around 250 Hz, the DAS signal is much noisier and therefore does not show a significant pattern. The above phenomenon shows that DAS signals are not as easy to interpret as acceleration signals due to different acquisition principles, but its response also contains some responses related to structural dynamic characteristics, which can also be measured by accelerometers. This implies that more complex methods, such as machine learning methods, need to be carried out in the research of DAS-based structural anomaly diagnosis and detection in order to fully exploit the physical information embedded in the signals.
Model performance and analysis
In this section, the performance of the deep learning model on data obtained from field experiments is conducted and discussed. Specifically, “Model implementation” section provides a comprehensive description of the model implementation, including dataset preprocessing, the configuration of deep learning model parameters, and performance evaluation metrics. “Performance based on individual data source” section presents the performance of models trained using acceleration signals and DAS signals independently, along with an evaluation of these results using interpretability techniques. “Performance based on cross-modal transfer learning” section discusses the performance of the deep learning model trained with the proposed cross-modal transfer learning method, accompanied by an in-depth analysis and discussion of the findings.
Model implementation
Based on the method proposed in the second section, deep learning models were developed using only DAS signals, only acceleration signals, as well as the proposed cross-modal transfer learning framework. The field experimental data obtained in the third section were randomly divided into training, validation, and test datasets in a ratio of 7:1:2. These datasets were used to train and evaluate the developed deep learning models. Specifically, the training set was used to fit the model, the validation set was used to monitor overfitting and determine the optimal model checkpoint during training, and the test set was used to evaluate the final performance of the trained models. To reduce the randomness introduced by a single data split and enhance the robustness of the evaluation results, the dataset was randomly partitioned five times, and model training and testing were conducted independently on each split. Specifically, for the acceleration signals, the input to the model consisted of the spectral data of the acceleration responses in three directions measured by one accelerometer. For the DAS signals, the data from the six SPs closest to the impact zone were used as the input to the deep learning model.
The encoders for both acceleration and DAS signals are CNN and Transformer modules. Specifically, the CNN consists of three one-dimensional convolutional layers with kernel size 3 and filter sizes of 256, 128, and 64, respectively, enabling effective feature extraction and compression from the raw input. Each layer uses LeakyReLU activation followed by batch normalization to improve training stability. A max pooling layer is applied after the CNN to further reduce the temporal dimension. The transformer component comprises three stacked blocks, each containing a four-head multihead self-attention mechanism with a model dimension of 64, along with a feedforward neural network, residual connections, and layer normalization to enhance temporal modeling capability and stability. Dropout layers are incorporated in all transformer blocks to mitigate overfitting. After temporal features are extracted by the encoder, a fully connected regression module is applied. This module begins with a flattening layer, followed by two dense layers with 128 and 32 units, respectively, using LeakyReLU activation and dropout for regularization. Finally, a sigmoid-activated output layer predicts the FFI value.
In the model training, the Adam optimizer 61 was employed in conjunction with a dynamic learning rate strategy, where the learning rate gradually decreased from 1e-3 to 1e-4 to enhance the convergence efficiency and stability of the model. The initial number of training iterations for each model was set to 2000, and the iteration count was further increased if the training loss had not stabilized. A batch size of 20 was used during training. Additionally to mitigate overfitting to the training data, the model checkpoint with the lowest validation loss observed during training was selected as the final model. 62 To comprehensively evaluate the model performance, MSE was used as the evaluation metric, quantifying the model’s prediction accuracy on the test set, which is defined as follows:
where N represents the number of samples,
In addition to MSE, several classification metrics were adopted, including classification accuracy, macro-averaged precision, recall, and F1 score. Since the ground truth values are discrete while the predicted values are continuous, each predicted value was first mapped to the nearest discrete class label. Accuracy was then calculated as the ratio of the number of correct predictions to the total number of samples:
To better capture performance across all classes, especially in the case of class imbalance, macro-averaged precision, recall, and F1 score were also calculated. These metrics are defined as follows:
where C is the total number of classes, and TP i , FP i , and FN i denote the numbers of true positives, false positives, and false negatives for class i, respectively. Macro-averaging ensures that each class contributes equally to the overall score, regardless of its frequency in the dataset.
Moreover, the coefficient of the ED loss term
Performance based on individual data source
Figure 11 illustrates the loss curves of the accelerometer-based and DAS-based models during a representative training process. Both models show convergence in training and validation losses within 2000 epochs, indicating the adequacy of the selected training duration. Overall, the DAS-based model exhibits more pronounced overfitting, while its training loss continues to decrease, the validation loss initially declines but later fluctuates significantly with an upward trend. In contrast, the validation loss of the accelerometer-based model also fluctuates but generally shows a downward and convergent trend. This is likely due to higher noise levels in the DAS signals, which make the model more prone to fitting noise patterns in the training data, thereby reducing generalization ability and leading to overfitting. To mitigate this issue, the model checkpoints corresponding to the lowest validation loss were selected for subsequent evaluation, which occurred at epoch 1772 for the accelerometer-based model and epoch 1262 for the DAS-based model.

Training and validation loss curves: (a) accelerometer-based model and (b) DAS-based model. DAS: distributed acoustic sensing.
Table 2 presents a performance comparison between the DAS-based and accelerometer-based models on the test set. In terms of MSE, the accelerometer-based model significantly outperforms the DAS-based model, with an average MSE of 3.86 × 10−5 compared to 4.52 × 10−3. Similarly, the classification accuracy of the accelerometer-based model is higher, achieving an average of 99.50%, while the DAS-based model reaches an average of 94.00%. These results indicate that the accelerometer-based model can accurately predict the FFI using rail response signals. Although the DAS-based model also achieves reasonable accuracy, its overall performance is lower. This is likely due to the higher signal-to-noise ratio and stability of accelerometer signals, which facilitate the extraction of informative features. In contrast, DAS signals are more susceptible to coupling quality and environmental noise, which may lead to overfitting during training and reduce the model’s generalization ability on unseen data.
Performance comparison between DAS-based and accelerometer-based models on the test set.
DAS: distributed acoustic sensing; Std: standard deviation.
Figures 12 and 13 show the contributions of different frequency bands in the acceleration signals to the model predictions, obtained using interpretability techniques. Positive contributions are represented by red bars, indicating that the feature increases the predicted FFI value, while negative contributions are represented by green bars, indicating that the feature decreases the predicted FFI value.

The contribution of various frequency features of the acceleration signal to the results of the model predictions, which were derived using the LIME technique. These features are represented by frequencies in the X, Y, and Z directions, ordered by their contribution to the predictions. The different panels represent samples from different conditions: (a) FFI = 0, (b) FFI = 1, (c) FFI = 2 and (d) FFI = 3. LIME: local interpretable model-agnostic explanations; FFI: fastener failure index.

The spectra of acceleration signals in the X, Y, and Z directions for each conditions, with colored vertical bars overlaid on the spectra indicating the contribution of specific frequency bands identified by LIME: (a) FFI = 0, (b) FFI = 1, (c) FFI = 2 and (d) FFI = 3. LIME: local interpretable model-agnostic explanations; FFI: fastener failure index.
It can be observed from the figures that the response frequency bands contributing the most to the model predictions are primarily concentrated in the 220–270 Hz vertical response band and the 100–115 Hz lateral response band, which are consistent with the natural frequencies of the rail in both directions as analyzed in the third section. Specifically, as shown in Figure 13, the vertical response in the 220–250 Hz frequency band generally has a positive contribution to the FFI prediction, while in the 250–270 Hz frequency band, it has a negative contribution. This indicates that the greater the response in the 220–250 Hz band, the higher the predicted FFI, meaning a greater number of rail fastener failures. Conversely, the greater the response in the 250–270 Hz band, the lower the predicted FFI, meaning fewer rail fastener failures. This aligns perfectly with the analysis based on physical knowledge in the third section. The same phenomenon is also observed in the contributions of different frequency bands of the lateral acceleration response. From these results, it can be seen that the deep learning model trained with acceleration data utilizes certain features for detecting the rail FFI that are consistent with the underlying physical knowledge. This performance is considered to exhibit high robustness and generalization, indicating that the trained model can be fully applied to rail fastener failure detection in other datasets under similar conditions.
Figures 14 and 15 illustrate the contributions of different frequency bands in the DAS signals to the model predictions, obtained using interpretability techniques. From the figures, it can be found that the frequency bands contributing most to the model predictions in the DAS signals are primarily concentrated in the 40–70 Hz range for each node, which corresponds to the main peaks in the DAS signal spectrum. However, responses in frequency bands related to the natural frequencies of the rail were almost not utilized by the deep learning model in the prediction. This indicates that although the model achieved high prediction accuracy on the test dataset, the basis for its predictions does not align with the physical principles underlying rail fastener failure. This could be because the DAS system captured other information near 50 Hz that is indirectly related to rail fastener failure. Additionally, this might suggest the presence of overfitting during model training. In cases of small sample sizes, the model may exhibit an excessive reliance on specific frequency features, which could be incidental noise or unique patterns in the data rather than truly critical features. Such overfitting may result in performance degradation when the model encounters other unseen data, limiting its generalizability. Therefore, it is necessary to implement measures to mitigate overfitting in the deep learning model based on DAS signals, guiding the model to focus more on responses in frequency bands associated with the natural frequencies of the rail.

The contribution of various frequency features of the DAS signal to the results of the model predictions, which were derived using the LIME technique. These features are represented by the frequencies of the responses of the six nodes, ordered by their contribution to the predictions. The different panels represent samples from different conditions: (a) FFI = 0, (b) FFI = 1, (c) FFI = 2 and (d) FFI = 3. DAS: distributed acoustic sensing; LIME: local interpretable model-agnostic explanations; FFI: fastener failure index.

The spectra of DAS signals in six nodes for each condition, with colored vertical bars overlaid on the spectra indicating the contribution of specific frequency bands identified by LIME: (a) FFI = 0, (b) FFI = 1, (c) FFI = 2, (d) FFI = 3. DAS: distributed acoustic sensing; LIME: local interpretable model-agnostic explanations; FFI: fastener failure index.
Performance based on cross-modal transfer learning
Using the proposed cross-modal transfer learning method mentioned in the second section, models were separately established, pretrained, and retrained. Figure 6 illustrates the loss evolution of the proposed cross-modal transfer learning model during a representative training process, including both the pretraining and re-training phases. In Figure 16(a), the model is pretrained using both accelerometer and DAS data by jointly optimizing the MSE and ED losses. To ensure that all loss components sufficiently converge, the total number of training epochs was set to 6000. The results show that the MSE loss converges around epoch 2000, while the ED loss requires a longer period and stabilizes only after approximately 5000 epochs. The best model was obtained at epoch 5872, highlighting the importance of a longer training schedule. Figure 16(b) presents the re-training phase on DAS data. The model, benefiting from pretraining, starts with low training and validation losses, indicating a strong initial capacity to handle DAS signals. The loss continues to decrease slightly in the early phase and then stabilizes. The final model used for evaluation corresponds to the checkpoint with the lowest validation loss at epoch 1848.

Training and validation loss curves of the proposed cross-modal transfer learning model: (a) loss evolution during the pretraining phase and (b) loss evolution during the re-training phase.
Table 3 compares the performance of the baseline model trained solely on DAS data with the proposed cross-modal transfer learning model on the test set. The results show that the proposed method outperforms the baseline in both MSE and accuracy. Specifically, the proposed model achieves an average MSE of 8.60 × 10−4 and an average accuracy of 97.50% across multiple evaluations. These results demonstrate that the proposed method, by introducing a constraint mechanism, enables the encoder trained on accelerometer signals to effectively guide feature extraction from DAS signals. This facilitates knowledge transfer and adaptation between different signal modalities, thereby improving prediction performance and enhancing the model’s ability to detect rail fastener failures.
Performance comparison between the DAS-only model and the proposed cross-modal transfer learning model on the test set.
DAS: distributed acoustic sensing; MSE: mean squared error; Std: standard deviation.
To further investigate the underlying mechanisms behind the improved performance of the proposed cross-modal transfer learning model, we employed the LIME technique to analyze the factors contributing to its predictions. Figures 17 and 18 show the important parts of the DAS signal that contribute to the prediction results after applying the proposed cross-modal transfer learning. As compared to the method shown in Figure 15, which only uses DAS data for training, the model trained with the proposed cross-modal transfer learning method not only relies on the response at 40–60 Hz frequency band but also extracts features from a wider range of frequency bands. Specifically, in the 100–120 Hz frequency band, which corresponds to the lateral natural frequency of the rail, the model trained with proposed method is able to capture the information contained in the acceleration response of rails. This allows the model to account for changes in the mechanical characteristics of the rail under different conditions, which contribute to the robustness of the model, thus significantly improving its robustness.

The contribution of various frequency features of the DAS signal to the results of the model predictions using proposed cross-domain adaption method, which was derived using the LIME technique. These features are represented by the frequencies of the responses of the six nodes, ordered by their contribution to the predictions. The different panels represent samples from different conditions: (a) FFI = 0, (b) FFI = 1, (c) FFI = 2 and (d) FFI = 3. DAS: distributed acoustic sensing; LIME: local interpretable model-agnostic explanations; FFI: fastener failure index.

The spectra of DAS signals in six nodes for each condition, with colored vertical bars overlaid on the spectra indicating the contribution of specific frequency bands to the model prediction using proposed cross-domain adaption method: (a) FFI = 0, (b) FFI = 1, (c) FFI = 2 and (d) FFI = 3. DAS: distributed acoustic sensing; FFI: fastener failure index.
Discussion
As indicated by the above analysis, although acceptable accuracy can be achieved using DAS signals alone with deep learning for rail fastener failure detection, the extracted features often lack a strong connection to mechanical principles. Consequently, the generalization capability of such models may be limited—for example, a model trained on one section of the track may not perform well in other locations with different structural or environmental characteristics.
To address this limitation, we propose a cross-modal transfer learning framework that uses physically meaningful features extracted from high-precision, spatially limited point sensors (e.g., accelerometers) to guide feature learning in the DAS-based model. This strategy encourages the model to focus on generalizable characteristics, such as structural natural frequencies, rather than site-specific or noise-sensitive patterns, thereby improving both robustness and interpretability.
At the core of this approach is a physics-guided feature transfer mechanism that enhances the value of distributed sensing. By leveraging high-fidelity local measurements to inform the interpretation of lower-resolution, wide-coverage DAS data, the method is particularly effective in scenarios where dense deployment of high-quality sensors is impractical. This enables more reliable assessment of structural health over large-scale infrastructure systems. Given these advantages, the proposed framework shows strong potential for extension beyond railway applications. Other civil structures, such as bridges, tunnels, and underground facilities, often involve a similar combination of localized high-quality sensors and the need for large-scale distributed monitoring. The proposed method can be readily adapted to such contexts to enable efficient and generalizable anomaly detection.
For practical deployment, we suggest that, alongside the installation of the distributed sensing system, a small number of high-precision sensors be deployed at selected key sections. These point sensors can be used to extract physically relevant features that guide DAS-based analysis through transfer learning, thereby enhancing the model’s adaptability across different track segments and geographic environments.
One limitation of this study is that the experimental data were obtained under artificial excitation, resulting in low levels of ambient noise. Future work will involve field experiments under actual train operation, where the vibration responses are expected to contain more complex and noisy patterns. Addressing this challenge will be essential for promoting real-world adoption of the proposed method.
Conclusions
In this study, a novel DAS-based rail fastener failure detection method is proposed. Specifically, a new cross-modal transfer learning approach is developed and applied, enabling the transfer of physical features derived from the rail accelerometer responses to the deep learning model using DAS signals. This approach facilitates the confusion of two data types in the rail fastener monitoring, enhancing the generalization capability of the DAS-based deep learning model. Furthermore, interpretability techniques are employed to quantify the key features related to fastener failures used by the model in predictions, thereby improving its interpretability. Also, a field experiment on rail fastener failures was conducted, in which the dynamic responses of the rail under various failure conditions were recorded using both accelerometers and the DAS system. The proposed method was validated on the field experimental data.
The results indicate that rail fastener failures lead to changes in the natural frequencies of the rail, particularly in its lateral and vertical modes. The deep learning model trained on acceleration responses accurately detected different fastener failure conditions, with the key features identified by interpretability techniques aligning with feature frequency of the rail. Although the deep learning model trained on DAS responses can also achieve a relatively high prediction accuracy, the key features used in its predictions lack interpretability, and thus the model is considered to have poor generalization capability. However, the deep learning model trained using the proposed cross-modal transfer learning method effectively transferred the key feature information from acceleration data to the DAS-based model. As a result, it achieved higher prediction accuracy, and the frequency bands of key features aligned with the natural frequencies of the rails. These findings validate the effectiveness of the proposed method, demonstrating its ability to improve the generalization capability and interpretability of DAS-based deep learning models. Moreover, the proposed framework, which transfers critical structural features from point sensor data to distributed sensing data, shows promising potential for application in other structural monitoring scenarios such as bridges, tunnels, and underground facilities.
Footnotes
Author contributions
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research/project was supported by the National Research Foundation, Singapore under its AI Singapore Programme (AISG award no: AISG2-TC-2021-001), the Ministry of Education Tier 1 Grants, Singapore (no. RG136/22), and NTU Start-up Grant (03INS001210C120). Moreover, the authors would like to thank the Land Transport Authority and SBS Transit Ltd of Singapore for their kind support and assistance in the field experiments.
Data availability Statement
The experimental vibration data used in this study cannot be shared publicly due to restrictions imposed by a collaboration agreement with a government agency.
