Sage Journals: Discover world-class research

Abstract

In recent years, the application of pretrained models in specialized domains has become increasingly important. Traditionally, adapting these models involved fine-tuning their parameters and structures through retraining. However, these fine-tuning methods can be inefficient, particularly when addressing data from specific domains or when modifications are needed in the lower layers of large-scale pretrained models. This study aims to investigate the effectiveness of using pretrained models with frozen weights for downstream tasks in the context of railway track detection, particularly focusing on the railway system. To achieve this, we employed a large-scale semantic segmentation model that had been pretrained on extensive datasets. The models utilized were kept with fixed weights, eliminating the need for retraining. We conducted a comparative analysis of various pretrained models sourced from different datasets to identify the most suitable model for the track detection system. The findings from our experiments revealed the performance metrics of the selected pretrained models, highlighting their effectiveness in the specific domain of railway track detection. Overall, this research demonstrates the practical applicability of pretrained models with frozen weights in specialized fields such as railway systems, offering insights into their usefulness and potential for improving detection algorithms in this domain.

Keywords

Downstream task semantic segmentation track detection computer vision smart railway system

Introduction

Recent deep-learning research has primarily focused on downstream tasks owing to their high generalization capability and adaptability to diverse environments.^1–3 Deep-learning models pretrained on large-scale datasets can extract useful high-level representations and patterns, allowing them to adapt easily to several situations and new data. These pretrained models significantly reduce the time and computational resources required to train new data when applied to new domains, making them a popular choice for a wide range of tasks. The existing method of using deep-learning models for downstream tasks involves fine-tuning a part of the model that has previously learned to extract abstract features from large-scale datasets.^4–7 Fine-tuning methods include initializing the weights of the deeper layers that contain high-level features and then further training the model on new data similar to the pretraining dataset.^8,9 For datasets in which high-level representations of the network cannot be easily used, a method is employed to initialize more layers of the model before training.^10–12 These fine-tuning methods allow developers to obtain models adapted to the desired data. Moreover, it is efficient to modify only the lower layers rather than initializing the weights of the entire model to perform the downstream task. In addition, by attaching specific modules or networks to high-level feature maps obtained from large-scale datasets through several iterations, it is possible to create models that can perform other subtasks desired by the user.^13,14

Although this method can create deep-learning models capable of performing specific downstream tasks, there are limitations when using the latest pretrained deep-learning models.² Even though several large-scale datasets are publicly available, it is relatively difficult to learn weights when training on a dataset from a specific domain with low data similarity, requiring more layers to be newly trained. Moreover, recent deep-learning models that attain superior performance have grown in size, due to the trade-off between model complexity and accuracy. Therefore, they have significantly more parameters than the previous networks. The computational resources required to modify the lower layers of these latest models have increased dramatically compared to previous models, and there are limitations in learning and using them easily, depending on the model type and server environment.^15,16

A recent trend in various research areas, such as natural language processing, computer vision, and speech recognition, is the increasing size of datasets and models to achieve better performance. This has led to the rise of large-scale models such as Large Language Models, Large Vision Models, and Large Vision-Language Models (LVLMs), which are trained on massive datasets.^17–19 However, in large-scale models, it is often challenging to maintain a latent space that accurately reflects the characteristics of the original dataset when retraining on a new data domain, making it difficult to effectively learn the features of new datasets. As a result, recent deep-learning research trends have shifted toward using large-scale models with frozen weights directly for downstream tasks.^20–24

In this study, we applied pretrained large-scale semantic segmentation models, commonly used in computer vision, to railway track detection algorithms. We used large segmentation models pretrained on large-scale datasets as input to the images acquired from railway vehicles and selected and compared the performance of the models that can detect railway tracks. This research enabled real-time track monitoring, allowing early detection and response to risks such as derailment and intrusion. Additionally, a detailed understanding of the condition of the track can help with maintenance planning, including track defects and obstacle detection. Further, improving the safety, efficiency, and sustainability of railway operations.

The proposed system offers significant advantages for railway maintenance by enabling continuous and real-time monitoring of track conditions. Specifically, maintenance vehicles equipped with an integrated track detection system can routinely patrol railway lines, identifying obstacles, debris, and structural damages such as cracks or misalignments. The early detection of these issues facilitates prompt interventions, thereby minimizing the risk of accidents and reducing service disruptions. Moreover, the system is capable of accumulating longitudinal data on track conditions, which allows for comprehensive analysis of wear and deterioration patterns over time. By comparing current track data with historical records, the system can identify gradual degradation, including the formation of minor cracks or uneven rail surfaces that may not be readily observable through manual inspections. This data-driven approach enables railway operators to accurately predict maintenance needs, supporting proactive maintenance scheduling rather than reactive repairs. Consequently, the implementation of this system enhances maintenance efficiency by reducing unplanned downtime and ensuring the consistent operation of trains. Early identification of track wear and structural defects also contributes to extending the operational lifespan of the tracks, as timely maintenance interventions prevent minor issues from escalating into more severe problems that would require extensive repairs or complete track replacements. Overall, this system not only improves safety standards but also optimizes maintenance costs and enhances the overall operational efficiency of railway services.

The remainder of this study is organized as follows. In “Railway track detection” section, we explain the semantic segmentation models and various pretrained deep-learning models that have been successfully applied to track detection and compare the performance of the deep-learning models based on the actual data acquired. “Experimental results” section presents the evaluation of the numerical performance of each model and the detection results when there is an obstacle. Finally, “Conclusions” section is presented.

Railway track detection

Semantic segmentation

Semantic segmentation, a deep-learning-based computer vision technology, is widely used in several industries and is becoming a mainstream technology in deep learning.^{17–19,25–31} Figure 1 shows the inference results of DeepLabV3+,²⁹ a representative semantic segmentation model, using the ADE20K³² dataset for training.

Figure 1.

Semantic segmentation.

As illustrated in Figure 1, semantic segmentation classifies each pixel in an image, identifies the object to which the pixel belongs, and accurately detects the location, shape, and boundaries of the objects. This segmentation information is used in autonomous driving to accurately distinguish between roads and pedestrians or in radiology to detect abnormalities in tumors or tissues in medical images such as magnetic resonance imaging or computed tomography. It has become an important tool for solving complex problems for a wide range of applications.^33–39 If this technology is used in railway systems, it can create a maintenance system that detects and monitors the condition of tracks or analyze the flow of vehicles and passengers within railway stations from an operational management perspective. Furthermore, utilizing images captured by cameras mounted on railway vehicles in combination with segmentation models allows for the rapid real-time detection of potential hazards, such as pedestrians, animals, or other obstacles near the tracks, enabling the implementation of appropriate safety measures. Additionally, recognizing traffic signals and signs through semantic segmentation can enhance the performance of the signaling system. By identifying traffic lights and signs with known precise locations, the exact position of the railway vehicle can be determined, facilitating more efficient train scheduling and reducing the risk of collisions with preceding trains.

Semantic segmentation models with frozen weights for railway detection

Recent downstream tasks use a method of fixing the weights of certain or all the models. This is because it reduces the computational cost required for training, uses high-dimensional representations obtained from a wide range of datasets, and prevents overfitting while achieving efficient learning. Therefore, in cases with specific environments and requirements, such as railway track detection, finding pretrained models using similar domain datasets is important. To make learning more efficient for a new task, reduce the amount of data required for learning, and achieve high performance, comparing and selecting from various pretrained datasets and models is crucial.

This study aimed to apply deep-learning image segmentation to railway systems. Semantic segmentation technology is used to detect railway infrastructures, such as tracks, in real time from images acquired from cameras mounted on trains to detect obstacles, track defects, and other hazards. It also performs various auxiliary tasks, such as predicting the radius of curvature of the line. We conducted a large-scale experiment to detect railroad tracks using a large-scale model developed for semantic segmentation and various datasets used for pretraining the model. The results are shown in Table 1.

Table 1.

Results of segmentation system for railway detection.

Model	ADE20K³²	Cityscapes⁴⁰	COCO-Stuff 10k⁴¹	COCO-Stuff 164k⁴¹	Mapillary Vistas⁴²	Pascal Context⁴³	Pascal Context⁴³ 59	Pascal VOC⁴⁴
BiSeNetV1⁴⁵	None	F	None	S	None	None	None	None
BiSeNetV2⁴⁶	None	F	None	None	None	None	None	None
DeepLabV3³³	F	F	S	S	None	None	None	F
DeepLabV3 + ²⁹	F	F	None	None	S	F	S	F
FastFCN⁴⁷	F	F	None	None	None	None	None	F
FCN⁴⁸	F	F	None	None	None	F	None	F
ICNet⁴⁹	None	F	None	None	None	None	None	None
PIDNet⁵⁰	None	F	None	None	None	None	None	None
PSANet⁵¹	F	F	None	None	None	None	None	F
PSPNet⁵²	F	F	S	S	None	F	S	F
SAN⁵³	None	None	None	S	None	F	None	F
UNet⁵⁴	None	F	None	None	None	None	None	None
UperNet⁵⁵	F	F	None	None	None	None	None	F

The vertical axis in Table 1 represents the deep-learning models used in the experiment, and the horizontal axis represents the datasets used for pretraining. For this experiment, 338 images were directly acquired from the Osong Railway Test Line in South Korea, which consisted of 211 straight-track images and 127 curved-track images. The test line was established to evaluate new railway infrastructure and technologies, with the goal of addressing theoretical issues and gathering practical operational data before deploying them on the main line. This site was chosen for its capability to provide test images under a range of driving conditions. Pretrained models were used in the MMSegmentation open-source toolbox.⁵⁶ In Table 1, “S” (Success) indicates that the model trained on that data set can detect tracks in the test data set, and “F” (Fail) indicates that the model failed to detect tracks. “None” indicates that no pretrained model exists. Table 2 lists the specific configurations of the systems used in the experiments.

Table 2.

System configuration.

Type	Main specification
OS	Windows 10
GPU	NVIDIA GeForce RTX 3060
CPU	Intel Core i9-11900F
RAM	DDR5, 64GB

The methods that successfully find the tracks are BiSeNetV1,⁴⁵ DeepLabV3,³³ DeepLabV3+,²⁹ PSPNet,⁵² and SAN.⁵³ These models are neural networks for semantic segmentation that classify pixels into meaningful classes and use the overall contextual information of the image to achieve more accurate segmentation results. They employ modules and structures, such as the Pyramid Pooling Module (PPM), Atrous Spatial Pyramid Pooling (ASPP), and bilateral segmentation networks, and have shown excellent performance on several major benchmarks. Regarding the detailed differences among the five models, BiSeNetV1⁴⁵ is a bilateral segmentation network model developed for real-time tasks. It consists of two parallel network streams: the Spatial Path generates high-resolution feature maps to preserve object boundary information, whereas the Context Path, with a deeper network, learns rich semantic information. It aims to eliminate bottlenecks arising from using deep networks and has an efficient structure that uses a simple structure and few parameters. Figure 2 shows the simple structure of the BiSeNet model.

Figure 2.

Bisenet architecture.

The second model that successfully detects the tracks is DeepLabV3,³³ which uses an ASPP module to achieve a wide range of receptive fields. Using Atrous Convolution layers with different dilation rates, the model can effectively recognize objects of various sizes while reducing the computational cost. The third model, DeepLabV3+,²⁹ is an extended version of DeepLabV3 in which a decoder module is added to the ASPP module, further improving its performance. This model uses a decoder module to combine low- and high-resolution representations, generating a more refined segmentation map and improving the accuracy around object boundaries. The fourth model, PSPNet,⁵² is a pyramid-scene parsing network. PSPNet does not rely solely on local information for prediction but uses a PPM to extract multiscale information with pooling windows of various sizes, as shown in Figure 3.

Figure 3.

PSPNet architecture.

The extracted information is then combined for use in the final prediction. This allows the model to combine local and global information effectively. Lastly, SAN⁵³ is the Side Adapter Network, which reflects the new trend in Downstream Tasks of adding learnable modules to LVLMs such as CLIP without fine-tuning the model. The side adapter network is attached to the frozen CLIP model, where it fuses some of the features and predicts the mask proposals and attention bias to generate the final segmentation map.

The large-scale models were pretrained on eight different datasets: ADE20K,³² Cityscapes,⁴⁰ COCO-Stuff 10k,⁴¹ COCO-Stuff 164k,⁴¹ Mapillary Vistas,⁴² Pascal Context,⁴³ Pascal Context⁴³ 59, and Pascal VOC.⁴⁴ These datasets contain urban scenes and include classes important to our research, such as roads and tracks. Among these, ADE20K,³² Cityscapes,⁴⁰ and Pascal Context⁴³ contain diverse urban scenes and classes such as roads and tracks, but models trained on these datasets failed to detect the tracks successfully in the experiments. However, models trained on the Pascal Context⁴³ 59 dataset, which includes 59 key classes out of 459 Pascal Context classes, can successfully detect tracks. The datasets that offered models with success in track detection, namely COCO-Stuff 10k,⁴¹ COCO-Stuff 164k,⁴¹ and Mapillary Vistas,⁴² also provided a variety of indoor and outdoor urban scenes, along with classes for tracks.

The ADE20K³² dataset, which included a variety of scenes, was used as a representative dataset for segmentation in the experiments. The models trained on this dataset have a high-level understanding of indoor, outdoor, natural, and urban scenes. Cityscapes⁴⁰ is a dataset designed for the semantic understanding of urban street scenes recorded in 50 cities and includes a variety of objects such as people, vehicles, and buildings, making it the most widely used dataset for pretraining models. COCO-Stuff 10k⁴¹ and COCO-Stuff 164k⁴¹ are datasets based on the well-known COCO 2017 dataset with additional annotations for semantic segmentation. COCO Stuff 10k contains 10,000 images with pixel-level annotations, whereas COCO Stuff 164k contains 164,000 images. Figure 4 shows the images that represent the COCO Stuff dataset.

Figure 4.

COCO Stuff dataset sample⁴¹.

Mapillary Vistas⁴² is a dataset containing a variety of outdoor street scenes, as shown in Figure 5. This includes scenes with diverse geographic locations, weather conditions, times of day, and road types. It also provides information on vehicles, pedestrians, signs, roads, and railway tracks.

Figure 5.

Mapillary Vistas dataset sample⁴².

Pascal Context⁴³ is an extension of the Pascal VOC dataset, consisting of more than 400 classes, and is used to evaluate large-capacity models in semantic segmentation learning. The Pascal Context 59 dataset is a subset of the Pascal Context, which selects the 59 most frequent classes. Although it is limited to 59 classes and can be used for specific tasks, it mitigates learning difficulties owing to the sparse class distribution of the original Pascal Context. Pascal VOC⁴⁴ is a dataset provided by the PASCAL Visual Object Classes Challenge that includes various indoor and outdoor scenes, natural environments, and objects such as people.

Experiments on railway detection using pretrained models showed that the models trained on the COCO-Stuff 10k, COCO-Stuff 164k, Mapillary Vistas, and Pascal Context 59 datasets could detect railway tracks. However, the model trained on the original Pascal Context dataset, which has a smaller dataset compared to diverse classes, could not detect the railway tracks. The model trained on the Pascal Context 59 dataset, which has a number of reduced classes but provides more information on frequent objects, can detect railway tracks. Additionally, the models trained on COCO-Stuff 10k, COCO-Stuff 164k, and Mapillary Vistas, which provide a variety of street-level scenes and annotations, can detect railway tracks. Although the pretrained models using the four datasets successfully partitioned the tracks, researching methods to enhance the model's generalization capabilities through techniques such as transfer learning, data augmentation, and domain adaptation is also a highly important area of study. These techniques can improve the performance of deep-learning models across various environments, including railway systems, and will facilitate their effective application in diverse fields. Table 3 presents some of the segmentation results of deep-learning models that detect railway tracks successfully.

Table 3.

Segmentation result images.

The inference models used were further divided into various submodels based on factors such as the type and depth of the Model Backbone, the number of GPUs used for training, and the amount of training performed. The results of the models with the best qualitative performance are presented.

Experimental results

Numerical experiment results

To evaluate the numerical performance of each model, the railway tracks in the test videos were analyzed at the pixel level, and the precision, recall, and F1 score (harmonic mean of Precision and Recall) were calculated using equations (1), (2), and (3), respectively:

Precision = \frac{T P}{T P + F P}

(1)

Recall = \frac{T P}{T P + F N}

(2)

F 1 Score = \frac{2 \times Precision \times Recall}{Precision + Recall}

(3)For performance evaluation, a true positive (TP) is the case where the deep-learning model correctly predicts a pixel as a track pixel. A false positive (FP) is a case where the model predicts a pixel as a track, but it is not actually a track. A false negative (FN) is the case where the model predicts a pixel as nontrack, but it is actually a track pixel. The experiments were conducted using 338 images acquired from the Osong Railway Test Line, and the ground truth was manually created at the pixel level, as shown in Figure 6.

We employed five pretrained semantic segmentation models containing 32 different variants for the numerical experiments. We then compared the inferred tracks from each model with the ground truth and calculated the precision, recall, and F1 scores, as shown in Table 4.

Figure 6.

Railroad test images and ground truth.

Table 4.

Experiments for each model.

Config	Backbone	Pre-trained dataset	Epoch	FLOPs	Params	Precision	Recall	F1 score
BiSeNetV1⁴⁵	R18-D32	COCO-Stuff 164k	160k	0.015T	13.315M	0.6547	0.7206	0.6861
	R18-D32-Pre	COCO-Stuff 164k	160k	0.015T	13.315M	0.5941	0.6302	0.6116
	R50-D32	COCO-Stuff 164k	160k	0.099T	57.029M	0.6881	0.6315	0.6586
	R50-D32-Pre	COCO-Stuff 164k	160k	0.099T	57.029M	0.5681	0.6587	0.6100
	R101-D32	COCO-Stuff 164k	160k	0.119T	76.021M	0.6893	0.8135	0.7462
DeepLabV3³³	R50-D8	COCO-Stuff 10k	20k	0.270T	65.826M	0.6434	0.6956	0.6685
	R50-D8	COCO-Stuff 10k	40k	0.270T	65.826M	0.7147	0.6766	0.6952
	R50-D8	COCO-Stuff 164k	80k	0.270T	65.826M	0.6669	0.7376	0.7005
	R50-D8	COCO-Stuff 164k	160k	0.270T	65.826M	0.6474	0.6996	0.6725
	R50-D8	COCO-Stuff 164k	320k	0.270T	65.826M	0.5959	0.7753	0.6739
	R101-D8	COCO-Stuff 10k	20k	0.347T	84.818M	0.5359	0.7191	0.6141
	R101-D8	COCO-Stuff 10k	40k	0.347T	84.818M	0.6568	0.8183	0.7287
	R101-D8	COCO-Stuff 164k	80k	0.347T	84.818M	0.6155	0.6680	0.6407
	R101-D8	COCO-Stuff 164k	160k	0.347T	84.818M	0.6795	0.7267	0.7023
	R101-D8	COCO-Stuff 164k	320k	0.347T	84.818M	0.6325	0.7053	0.6669
DeepLabV3 + ²⁹	R50-D8	Mapillary Vistas	300k	0.706T	41.249M	0.8099	0.6204	0.7026
	R101-D8	Pascal Context 59	40k	0.893T	60.238M	0.6744	0.7711	0.7195
	R101-D8	Pascal Context 59	80k	0.893T	60.238M	0.6363	0.8142	0.7143
PSPNet⁵²	R50-D8	COCO-Stuff 10k	20k	0.179T	46.689M	0.6602	0.5151	0.5787
	R50-D8	COCO-Stuff 10k	40k	0.179T	46.689M	0.7221	0.7353	0.7286
	R50-D8	COCO-Stuff 164k	80k	0.179T	46.689M	0.5797	0.6785	0.6253
	R50-D8	COCO-Stuff 164k	160k	0.179T	46.689M	0.7026	0.7602	0.7302
	R50-D8	COCO-Stuff 164k	320k	0.179T	46.689M	0.6642	0.7456	0.7025
	R101-D8	COCO-Stuff 10k	20k	0.256T	65.681M	0.6103	0.5117	0.5566
	R101-D8	COCO-Stuff 10k	40k	0.256T	65.681M	0.4777	0.4494	0.4631
	R101-D8	COCO-Stuff 164k	80k	0.256T	65.681M	0.6820	0.8369	0.7515
	R101-D8	COCO-Stuff 164k	160k	0.256T	65.681M	0.7412	0.8425	0.7886
	R101-D8	COCO-Stuff 164k	320k	0.256T	65.681M	0.5617	0.8411	0.6736
	R101-D8	Pascal Context 59	40k	0.256T	65.623M	0.7218	0.7022	0.7119
	R101-D8	Pascal Context 59	80k	0.256T	65.623M	0.7885	0.7190	0.7521
SAN⁵³	CLIP_ViT-B16	COCO-Stuff 164k	60k	7.210T	8.370M	0.5475	0.5641	0.5557
SAN⁵³	CLIP_ViT-L14	COCO-Stuff 164k	60k	16.332T	9.321M	0.8662	0.8399	0.8529

In Table 4, the pretrained models are composed of different model structures, backbones, epochs, and datasets. R18 uses the ResNet18 backbone, whereas R50 and R101 use ResNet50 and ResNet101, respectively. Pre indicates that the backbone model used has been pretrained on the ImageNet-1000 dataset. ViT-B16 refers to the ViT Base model, and ViT-L14 is the ViT Large model with more parameters. The “D” notation indicates the degree of downsampling, where D8 means the output of the backbone is downsampled by a factor of 8. An epoch represents the number of training iterations. The models were trained using the COCO-Stuff 10k, COCO-Stuff 164k, Mapillary Vistas, and Pascal Context 59 datasets. To evaluate the model performance, precision, recall, and F1 scores are reported in Table 4, and the number of parameters is included to compare the model complexity.

The best-performing models in each model structure used either the ResNet101 backbone or the CLIP ViT-L14. These backbones have the most parameters and are complex; however, they can learn better representations for object discrimination, resulting in the highest segmentation performance. The model with the highest performance was the SAN model, which achieved an F1 score of 0.85. This model has the highest FLOPs but the least learnable parameters. This is because it is trained by attaching a learnable module in parallel to a pretrained CLIP model, which has a large capacity but a small module size, leading to high FLOPs but a smaller overall model size. The second-best-performing model was PSPNet, which achieved an F1 score of 0.79. The model with the lowest FLOPs of 0.12 T and a below-average number of parameters is BiSeNetV1 with a ResNet18 backbone. There is a trade-off between model performance and complexity. The model that achieves the best trade-off is PSPNet with a ResNet101 backbone, whereas the CLIP ViT-L14 backbone-based SAN model outperforms the other models both qualitatively and quantitatively. Figure 7 presents a comparison of the FLOPs, parameters, and F1 scores for the different models.

Figure 7.

Experimental results.

Results in the presence of obstacle

If an obstacle is present on the track, the track detection system is likely to be affected. To analyze the impact of obstacle size on detection performance, we introduced virtual obstacles and evaluated the performance of five pretrained semantic segmentation models. As shown in Figure 8, virtual obstacles with pixel dimensions of 32 × 32, 85 × 69, and 191 × 156 were placed in the images, and experiments were conducted using the models with the highest F1 scores.

Figure 8.

Obstacles on the tracks.

Table 5 presents the F1 scores for each obstacle size, as well as the performance differences compared to the case without obstacles. DeepLabV3 was minimally affected by small- and medium-sized obstacles but experienced a significant performance drop with larger obstacles. In contrast, SAN showed minimal impact from obstacles across all sizes. Table 6 presents the actual track detection results based on obstacle size.

Table 5.

Experimental results according to obstacle size.

Config	No obstacle		Small obstacle		Medium obstacle		Large obstacle
Config	F1 score	Diff	F1 score	Diff	F1 score	Diff	F1 score	Diff
BiSeNetV1⁴⁵	0.7462	-	0.7207	0.0255	0.7107	0.0355	0.1215	0.6247
DeepLabV3³³	0.7287	-	0.7033	0.0254	0.7108	0.0179	0.2364	0.4923
DeepLabV3 + ²⁹	0.7195	-	0.5412	0.1783	0.3881	0.3314	0.0026	0.7169
PSPNet⁵²	0.7886	-	0.7119	0.0767	0.5014	0.2872	0.2112	0.5774
SAN⁵³	0.8529	-	0.8225	0.0304	0.7681	0.0848	0.5709	0.2820

Table 6.

Segmentation results by obstacle size.

Conclusions

In this study, we investigated recent trends in deep-learning research and examined the applicability and utility of models pretrained on large-scale datasets within specific domains. These models demonstrate the capability to extract high-level features from diverse patterns and offer the advantage of retaining the latent characteristics of existing datasets while enhancing adaptability to new domains through weight freezing. As a case study, we applied pretrained semantic segmentation models to detect railway tracks in a railway system. We conducted both qualitative and quantitative comparisons of various pretrained models to identify the most suitable one for track detection and evaluated the practical applicability of these state-of-the-art deep-learning technologies.

Furthermore, by facilitating real-time monitoring of track conditions, this system enables the early detection of obstacles, cracks, and wear, thereby enhancing the operational safety of railway systems. Maintenance teams can respond in a timely manner to identified issues, reducing the likelihood of service disruptions and accidents. The system also supports proactive maintenance by collecting long-term data on track conditions, allowing operators to predict maintenance needs and prevent high-cost repairs.

In conclusion, this study demonstrates how pretrained deep-learning models can be effectively applied to specific downstream tasks, such as railway track detection. As deep-learning technology continues to advance, the adoption of such models is expected to expand across various industries. This research establishes a foundation for future applications and highlights the potential for improving both operational efficiency and safety in the railway sector.

Footnotes

Author contributions

All authors contributed to study conceptualization, statistical analysis, and writing of the study. Seungmin Lee participated in acquisition of data and designed the research. Beomseong Kim and Heesung Lee participated in the analysis and interpretation of data and funding acquisition.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Research Foundation of Korea (NRF) Grant funded by the Korea Government (MSIT) (No. RS-2022-00166346).

ORCID iDs

Beomseong Kim

Heesung Lee

References

Nai

Wen

, et al. Revisiting disentanglement in downstream tasks: a study on its necessity for abstract visual reasoning. In: Proceedings of the AAAI conference on artificial intelligence, Vancouver, Canada, 2024, pp. 14405–14413.

Lester

Al-Rfou

Constant

The power of scale for parameter-efficient prompt tuning. arXiv:2104.08691, 2021.

Sankaranarayanan

Balaji

Jain

, et al. Learning from synthetic data: addressing domain shift for semantic segmentation. In: Proc eedings of the IEEE conference on computer vision and pattern recognition, Salt Lake City, UT, USA, 2018, pp. 3752–3761.

Devlin

Chang

Lee

, et al. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805, 2018.

Krizhevsky

Sutskever

Hinton

. Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 2012; 25: 1106–1114.

Zhang

Ren

, et al. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas, NV, USA, 2016, pp. 770–778.

Vaswani

Shazeer

Parmar

, et al. Attention is all you need. Adv Neural Inf Process Syst 2017; 30: 5998–6008.

Yosinski

Clune

Bengio

, et al.

How transferable are features in deep neural networks?

Adv Neural Inf Process Syst 2014; 27: 3320–3328.

Islam

Rochan

Naha

, et al. Gated feedback refinement network for coarse-to-fine dense semantic image labeling. arXiv:1806.11266, 2018.

10.

Ramachandran

Liu

QV.

Unsupervised pretraining for sequence to sequence learning. arXiv:1611.02683, 2016.

11.

Rothe

Narayan

Severyn

. Leveraging pre-trained checkpoints for sequence generation tasks. Trans Assoc Comput Linguist 2020; 8: 264–280.

12.

Press

Smith

Levy

Improving transformer models by reordering their sublayers. arXiv:1911.03864, 2019.

13.

Liu

, et al. GLIGEN: open-set grounded text-to-image generation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Vancouver, BC, Canada, 2023, pp, 22511–22521.

14.

Zhu

Zhao

, et al. CORA: adapting CLIP for open-vocabulary detection with region prompting and anchor pre-matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Vancouver, BC, Canada, 2023, pp. 7031–7040.

15.

Houlsby

Giurgiu

Jastrzebski

, et al. Parameter-efficient transfer learning for NLP. In: Proceedings of the 36th international conference on machine learning, Long Beach, California, USA, 2019;97:2790–2799.

16.

Liang

Prefix-tuning: Optimizing continuous prompts for generation. arXiv:2101.00190, 2021.

17.

Cha

Mun

Roh

Learning to generate text-grounded mask for open-world semantic segmentation from only image-text pairs. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Vancouver, BC, Canada, 2023, pp. 11165–11174.

18.

Liang

Dai

, et al. Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Vancouver, BC, Canada, 2023, pp. 7061–7070.

19.

Zhou

Lei

Zhang

, et al. ZegCLIP: towards adapting CLIP for zero-shot semantic segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Vancouver, BC, Canada, 2023, pp. 11175–11185.

20.

Zhou

Yang

Loy

, et al. Learning to prompt for vision-language models. Int J Comput Vision 2022; 130: 2337–2348.

21.

Zhang

Fang

Zhang

, et al. Tip-adapter: Training-free CLIP-adapter for better vision-language modeling. arXiv:2111.03930, 2021.

22.

Sung

Cho

Bansal

VL-adapter: parameter-efficient transfer learning for vision-and-language tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, New Orleans, LA, USA, 2022, pp. 5227–5237.

23.

Zhang

Sax

Zamir

, et al. Side-tuning: a baseline for network adaptation via additive side networks. In: Proceedings of the computer vision–ECCV 2020: 16th European conference, Glasgow, UK, 2020;16:698–714.

24.

Sung

Cho

Bansal

. LST: ladder side-tuning for parameter and memory efficient transfer learning. Adv Neural Inf Process Syst 2022; 35: 12991–13005.

25.

Wang

Sun

Cheng

, et al. Deep high-resolution representation learning for visual recognition. IEEE Trans Pattern Anal Mach Intell 2020; 43: 3349–3364.

26.

Chen

Papandreou

Kokkinos

, et al. DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans Pattern Anal Mach Intell 2017; 40: 834–848.

27.

Xie

Wang

, et al. Segformer: simple and efficient design for semantic segmentation with transformers. Adv Neural Inf Process Syst 2021; 34: 12077–12090.

28.

Gkioxari

Dollar

, et al. Mask R-CNN. In: Proceedings of the IEEE international conference on computer vision, Venice, Italy, 2017, pp. 2961–2969.

29.

Chen

Zhu

Papandreou

, et al. Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the European conference on computer vision, Munich, Germany, 2018, pp. 801–818.

30.

Liu

Tian

Wang

, et al. Delving into shape-aware zero-shot semantic segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Vancouver, BC, Canada, 2023, pp. 2999–3009.

31.

Huang

Chen

, et al. Style projected clustering for domain generalized semantic segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Vancouver, BC, Canada, 2023, pp. 3061–3071.

32.

Zhou

Zhao

Puig

, et al. Semantic understanding of scenes through the ADE20 K dataset. Int J Comput Vision 2019; 127: 302–321.

33.

Chen

Papandreou

Schroff

, et al. Rethinking atrous convolution for semantic image segmentation. arXiv:1706.05587, 2017.

34.

Zhou

Siddiquee

MMR

Tajbakhsh

, et al. UNet++: a nested U-net architecture for medical image segmentation. In: Proceedings of the deep learning in medical image analysis and multimodal learning for clinical decision support, Granada, Spain, 2018, pp. 3–11.

35.

Kirillov

Girshick

, et al. Panoptic feature pyramid networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Long Beach, CA, USA, 2019, pp. 6399–6408.

36.

Wang

Zhu

Green

, et al. Axial-DeepLab: stand-alone axial-attention for panoptic segmentation. In: Proceedings of the European conference on computer vision, Glasgow, UK, 2020, pp. 108–126.

37.

Myronenko

3D MRI brain tumor segmentation using autoencoder regularization. In: Proceedings of the Brainlesion: glioma, multiple sclerosis, stroke and traumatic brain injuries, Shenzhen, China, 2019, pp. 311–320.

38.

Fitzgerald

Matuszewski

FCB-SwinV2 transformer for polyp segmentation. arXiv:2302.01027, 2023.

39.

Fitzgerald

Bernal

Histace

, et al. Polyp segmentation with the FCB-SwinV2 transformer. IEEE Access 2024; 12: 38927–38943.

40.

Cordts

Omran

Ramos

, et al. The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas, NV, USA, 2016, pp. 3213–3223.

41.

Caesar

Uijlings

Ferrari

COCO-Stuff: thing and stuff classes in context. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Salt Lake City, UT, USA, 2018, pp. 1209–1218.

42.

Neuhold

Ollmann

Rota Bulo

, et al. The mapillary vistas dataset for semantic understanding of street scenes. In: Proceedings of the IEEE international conference on computer vision, Venice, Italy, 2017, pp. 4990–4999.

43.

Mottaghi

Chen

Liu

, et al. The role of context for object detection and semantic segmentation in the wild. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Columbus, OH, USA, 2014, pp. 891–898.

44.

Everingham

Van Gool

Williams

CKI

, et al. The PASCAL visual object classes (VOC) challenge. Int J Comput Vis 2010; 88: 303–338.

45.

Wang

Peng

, et al. Bisenet: bilateral segmentation network for real-time semantic segmentation. In: Proceedings of the European conference on computer vision, Munich, Germany, 2018, pp. 325–341.

46.

Gao

Wang

, et al. Bisenet V2: bilateral network with guided aggregation for real-time semantic segmentation. Int J Comput Vis 2021; 129: 3051–3068.

47.

Zhang

Huang

, et al. FastFCN: Rethinking dilated convolution in the backbone for semantic segmentation. arXiv:1903.11816, 2019.

48.

Long

Shelhamer

Darrell

Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Boston, MA, USA, 2015, pp. 3431–3440.

49.

Zhao

Shen

, et al. ICNet for real-time semantic segmentation on high-resolution images. In: Proceedings of the European conference on computer vision, Munich, Germany, 2018, pp. 405–420.

50.

Xiong

Bhattacharyya

SP.

PIDNet: a real-time semantic segmentation network inspired by PID controllers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Vancouver, BC, Canada, 2023, pp. 19529–19539.

51.

Zhao

Zhang

Liu

, et al. PSANet: point-wise spatial attention network for scene parsing. In: Proceedings of the European conference on computer vision, Munich, Germany, 2018, pp. 267–283.

52.

Zhao

Shi

, et al. Pyramid scene parsing network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Honolulu, HI, USA, 2017, pp. 2881–2890.

53.

Zhang

Wei

, et al. Side adapter network for open-vocabulary semantic segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Vancouver, BC, Canada, 2023, pp. 2945–2954.

54.

Ronneberger

Fischer

Brox

U-Net: convolutional networks for biomedical image segmentation. In: Proceedings of the medical image computing and computer-assisted intervention–MICCAI 2015, Munich, Germany, 2015, pp. 234–241.

55.

Xiao

Liu

Zhou

, et al. Unified perceptual parsing for scene understanding. In: Proceedings of the European conference on computer vision, Munich, Germany, 2018, pp. 418–434.

56.

MMSegmentation Contributors. Openmmlab. Openmmlab semantic segmentation toolbox and benchmark, https://github.com/open-mmlab/mmsegmentation (accessed 11 November 2024).

Semantic segmentation models with frozen weights for railway track detection

Abstract

Keywords

Introduction

Railway track detection

Semantic segmentation

Semantic segmentation models with frozen weights for railway detection

Experimental results

Numerical experiment results

Results in the presence of obstacle

Conclusions

Footnotes

Author contributions

Declaration of conflicting interests

Funding

ORCID iDs

References