Abstract
Pavement cracks are the most common type of distress in transportation infrastructure. Despite the robust performance of convolutional neural network (CNN)-based networks, their ability to capture fine features is significantly limited, which is important for comprehensive crack detection. Accurately capturing long-range contextual information is crucial for the automatic assessment of road conditions. To address the limitations, this study introduces an innovative architecture that synergistically combines CNN and Transformer modules in parallel branches, significantly enhancing semantic segmentation by optimizing feature extraction and bolstering the capture of long-range dependencies. The CNN branch is designed to capture pixel-level contextual representations and incorporates an additional head for generating boundary heatmaps, which facilitates enhanced regional interaction. Concurrently, the Transformer branch employs multi-head self-attention mechanisms and multilayer perceptron (MLP) modules to assimilate long-range contextual representations. A contextual attention module integrates boundary features with the normalized feature set, accentuating boundary regions and directing the model to accurately delineate overlapping objects. Comprehensive experiments demonstrate that the proposed network performs better than the state-of-the-art methods on the public data sets, achieving F1 scores of 76.36%. Our proposed model significantly reduces false detections in scenarios involving long and thin cracks while preserving its overall crack detection capabilities.
Keywords
Get full access to this article
View all access options for this article.
