Sage Journals: Discover world-class research

Abstract

Monitoring structural integrity through accurate crack detection is fundamental to ensuring the safety and longevity of civil engineering infrastructure. Vision-based methods, supported by advancements in deep learning, have gained prominence in structural health monitoring (SHM). However, these methods often suffer from limited performance due to insufficient diversity and scale in crack image datasets, which are costly and challenging to acquire. This study introduces a novel framework that integrates a text-to-image generative model with large language models to synthesize realistic crack images for training deep neural networks. A prompt engineering approach is utilized to generate high-quality textual descriptions, guiding the creation of a large-scale, diverse dataset that simulates a wide range of crack scenarios. The synthesized dataset significantly enhances model training, as demonstrated in two key SHM tasks: crack classification and crack object detection. Neural networks trained with the augmented dataset show up to a 60% improvement in precision over baseline models trained on real-world data alone. These results highlight the potential of generative models to address data scarcity in SHM, enabling more robust and accurate crack detection. This research provides a scalable and efficient solution for improving machine learning-based SHM applications and paves the way for further exploration of generative methods in structural monitoring tasks.

Keywords

Crack detection structural health monitoring synthesized image generation ChatGPT computer vision deep learning

Introduction

Regular assessments are essential to maintain the safety of civil structures and safeguard these critical assets. A key aspect of these evaluations is the detection of cracks, which is often an early indicator of potential structural issues.^1,2 Traditional visual inspection methods for detecting cracks are ineffective, time consuming, require significant manpower and are prone to human error and subjective conclusions. These methods are also costly, and often struggle with accessibility issues, especially in large-scale, long-span and high-rise structures.

Advancements in computer vision (CV) technologies have led to a shift towards using image-based methods for crack detection in structural health monitoring (SHM). These methods, which capture digital images of structures and analyse them with software to detect cracks, provide a faster, more cost-effective, and robust alternative to traditional inspection methods. Vision-based crack detection methods primarily fall into two categories: image processing and machine learning (ML). Image processing techniques employ hand crafted filters,³ morphological operations⁴ and percolation theories,⁵ among other techniques, to detect cracks without the necessity for training a model. Over the past decade, ML has emerged as a dominant technique in CV-based crack detection,⁶ known for its exceptional performance across various tasks and gaining popularity in numerous fields.^7–10 When provided with sufficient data, ML algorithms can autonomously identify complex features and hidden patterns, making them highly effective for precision-critical tasks, such as crack detection.

The task of crack detection within ML can be divided into three primary functions: crack image classification, crack object detection and crack segmentation. The goal of crack image classification^11–13 is to ascertain whether an image or image patch contains cracks. A notable study by Gopalakrishnan et al.¹³ utilized a Visual Geometry Group (VGG) model¹⁴ to detect cracks on hot-mix asphalt and Portland cement concrete surfaces, achieving successful classification of pavement crack images. Different from crack image classification, crack object detection tasks^15–19 involve generating bounding boxes around regions identified as containing cracks. Currently, prominent object detection models like Fast Region-based Convolutional Network²⁰ and Single Shot MultiBox Detector²¹ are employed for this task. Crack segmentation^22–24 provides the most detailed, pixel-level detection of images, classifying each pixel as either part of a crack or not. Among these methods, convolutional neural networks (CNNs) are popular for crack segmentation. Schmugge et al.²⁵ developed a CNN-based technique to segment cracks in nuclear power plant inspections by aggregating pixel-level classification confidence across video frames under different lighting conditions. Islam and Kim²⁶ developed a crack segmentation system using a fully convolutional network (FCN) with a VGG backbone. Trained on an open-source concrete crack dataset, the optimized FCN achieved around 92% in both accuracy and F1 scores on test data.

Despite the rapid advancements in ML methods for crack detection, the scale and diversity of training datasets remain limited. In the realm of ML, contemporary research increasingly supports that diverse large-scale datasets significantly enhance the robustness and effectiveness of Deep Neural Networks (DNNs). For instance, the popular deep learning model Segment Anything²⁷ used extensive datasets to enhance its segmentation capabilities, showing that with ample data, DNNs can achieve high-precision segmentation in various scenarios and adapt effectively to new environments. In the field of ML-based crack detection, there is a keen interest in gathering more crack images to train ML models. This demand stems from the need to train models that can accurately reflect the wide variability in crack appearances, which depend heavily on factors such as the material of structures, the types of damage they have sustained and environmental influences. However, gathering diverse crack images is challenging due to restricted access, high costs and environmental constraints. Restricted access means that safety regulations or property rights often limit entry to damaged structures. High costs arise from the need for specialized equipment and trained personnel. Environmental constraints include challenges such as lighting conditions and seasonal variations, which can affect the visibility and detectability of cracks. These challenges underscore the difficulties in collecting real crack images and the urgent need to develop methods for acquiring large-scale, diverse datasets. Consequently, enhancing existing datasets with synthesized data has become a promising area of research.

Many studies have explored the creation of synthesized datasets for training neural networks. Among these methods, physics-based approaches^28,29 employed intricate mathematical models to simulate the physical properties and behaviours of real-world structures. This often involves creating three-dimensional models of structures and then applying simulated forces (e.g., wind, earthquakes) to assess how these structures might react. Conversely, data-driven approaches, particularly generative adversarial networks (GANs),³⁰ have been employed to create visually realistic image datasets for SHM tasks.^31,32 For example, GANs have been utilized to generate synthesized data for wind turbine fault detection to enhance fault detection,³³ and for railway crack detection.³⁴ Both physics-based methods and GANs present innovative approaches to synthesized dataset generation, yet each exhibits distinct limitations. Physics-based models produce data with high physical realism and require extensive domain knowledge and significant computational resources, which can limit their scalability and practical application in large-scale dataset generation. GANs have demonstrated success in producing realistic visual patterns; however, ensuring consistent performance across highly diverse structural scenarios can be challenging, and careful training is often needed to maintain image fidelity and diversity.

In response to the challenges, this article proposes a framework for synthesized crack image dataset generation that leverages a diffusion model in combination with a large language model (LLM). Diffusion models have emerged as a powerful tool for generating high-quality and diverse images from textual prompts. Compared to GANs, diffusion models offer improved controllability and variation in output, as they can be guided by richly descriptive prompts to generate images that align with specific structural contexts. Furthermore, modern diffusion models are typically large-scale, pretrained on vast and diverse datasets. This broad training foundation equips them with strong generalization capabilities, enabling them to synthesize realistic images even for unseen or complex scenarios. These advantages make diffusion models particularly suitable for applications such as SHM, where the ability to generate visually and contextually accurate damage scenarios is critical for training robust deep learning models. Furthermore, the integration of LLM accelerates the creation of varied and precise prompts, substantially increasing the diversity of the generated images. This approach outperforms traditional physics-based methods, producing 1K high-resolution, realistic images in just a few hours with a single GPU – an achievement unattainable with conventional methods. The efficacy of the synthesized data is evaluated through two experiments: crack image classification and crack object detection, using selected public datasets as benchmarks. By adding the synthesized images to the training set, the same DNN models become significantly more robust, achieving much better evaluation results. Crack image classification precision increased from 33.6 to 93.96%, and crack object detection precision improves from 24.25 to 83.82%.

Figure 1 illustrates a comparison between crack images from public crack datasets and synthesized crack images generated using the proposed framework. The synthesized images demonstrate a high level of realism, capturing various crack types, textures and environmental conditions that closely mimic those found in real-world scenarios.

Figure 1.

Comparison of crack images from public datasets (right) and synthesized crack images generated using the proposed framework (left).

The organization of this article is as follows: the second section elaborates the methodology of the data generation framework, detailing the integration of the diffusion model and an LLM for crafting synthesized images. The third section presents the methodologies employed and the results obtained for crack image classification and crack object detection. The fourth section provides a conclusion, summarizing the principal findings and suggesting for future research.

Synthesized crack image generation

Figure 2 presents a schematic overview of the synthesized crack image generation framework, employing a large text-to-image rectified flow model (RFM)³⁵ that has been extensively trained on a vast dataset of billions of diverse text-image pairs, encompassing a wide array of scenes from urban environments to structural materials. Guided by textual prompts, the model can reconstruct images from noisy images to generate high-quality synthesized visualizations. To enrich the text prompts, the LLM Generative Pre-trained Transformer (GPT)-4³⁶ is employed to create varied and detailed text prompts. A crucial step involves a prompt engineering process that preprocesses raw text from GPT-4, selecting prompts that are preferred by the diffusion model for optimal results. These selected prompts are then processed by a text encoder, which converts the descriptions into a feature space representation that the diffusion model can utilize for image generation.

Figure 2.

Overview of the proposed synthesized crack image generation framework.

Text-to-image generation with RFM model

The diffusion process consists of two parts, one is forward diffusion, and the other is reverse diffusion. Forward diffusion is to add noise to the image, and reverse diffusion is to remove noise. Firstly, it is assumed that the image side is $x_{0}$ and the noise side is $x_{T}$ . In the forward diffusion process, noise is incrementally added to an image over a sequence of steps. Initially, the image, denoted as $x_{0}$ , is considered noise-free. As the process progresses, noise is added until the image, at step $T$ , denoted as $x_{T}$ , becomes predominantly noisy. Forward diffusion principle is expressed as:

q (x_{t} | x_{t - 1}) = N (x_{t}; \sqrt{1 - β_{t} x_{t - 1}}, β_{t} I)

(1)

In the forward diffusion process, noise is added to an image incrementally through a sequence of steps. The transformation of the state variable $x_{t}$ at each time step $t$ is a transformation from the state variable $x_{t}$ , which follows a Gaussian distribution, to the state variable at the next moment, which follows another Gaussian distribution. The conditional distribution $q (x_{t} | x_{t - 1})$ represents the Gaussian distribution of $x_{t}$ given the previous state variable $x_{t - 1}$ . The distribution $q (x_{t} | x_{t - 1})$ is defined as a Gaussian distribution with a mean of $\sqrt{1 - β_{t} x_{t - 1}}$ and a variance of $β_{t} I$ . Here, $β_{t}$ represents the variance parameter at step $t$ , and $I$ is the identity matrix, indicating that noise is added independently to each component of the image. The time variable $t$ ranges from 0 to $T$ , where $x_{0}$ is the initial, noise-free image and $x_{T}$ represents the image fully transformed by noise. The problem of forward diffusion is thus framed as determining the distribution of noise added at each step. This involves solving for the parameters of the Gaussian distributions that model the transition from $x_{t - 1}$ to $x_{t}$ over the sequence from 0 to $T$ .

The process of adding noise to an image is defined as $z_{t}$ , and it is generally assumed to follow a normal distribution, denoted by $z ~ N (μ, σ^{2})$ . This abstract process is expressed in the following form:

z = μ + σ ε

(2)

where $ε ~ N (0, 1)$ . Similarly, the sample image at time $t$ , denoted as $x_{t}$ , is rewritten as:

x_{t} = \sqrt{1 - β_{t}} x_{t - 1} + \sqrt{β_{t}} ε_{t - 1}

(3)

Assume $α_{t} = 1 - β_{t}$ , $\bar{α_{t}} = Π_{i = 1}^{t} α_{i}$ , then Equation (3) can be recursively organized as follows:

x_{t} = \sqrt{\bar{α_{t}}} x_{0} + \sqrt{1 - \bar{α_{t}}} ε

(4)

Through the application of Equation (4), the state of the sample image $x_{t}$ is obtained after the addition of noise at any time $t$ . Figure 3 illustrates an example of a forward diffusion process applied to a concrete crack image using a linear schedule.

Figure 3.

The Progressive degradation of a concrete crack image using a linear noise schedule.

The reverse diffusion process is the opposite of the forward diffusion process. It is no longer about finding the distribution of $x_{t}$ under the condition of $x_{t - 1}$ , but about finding the distribution of $x_{t - 1}$ under the condition of $x_{t}$ , that is, converting it into a mathematical problem to solve the distribution $q (x_{t - 1} | x_{t})$ . Assume that $x_{0}$ is known, that is, the real image is known. The probability distribution under the condition of $x_{0}$ is expressed as follows:

p (x_{t} | x_{t - 1}, x_{0}) = \sqrt{α_{t}} x_{t - 1} + \sqrt{1 - α_{t}} ε

(5)

p (x_{t - 1} | x_{0}) = \sqrt{\bar{α_{t - 1}}} x_{0} + \sqrt{1 - \bar{α_{t - 1}}} ε

(6)

p (x_{t} | x_{0}) = \sqrt{\bar{α_{t}}} x_{0} + \sqrt{1 - \bar{α_{t}}} ε

(7)

Equations (5)–(7) can be rewritten into the Gaussian function form as follows:

p (x_{t} | x_{t - 1}, x_{0}) \propto \exp (- \frac{{(x_{t} - \sqrt{α_{t}} x_{t - 1})}^{2}}{2 (1 - α_{t})})

(8)

p (x_{t - 1} | x_{0}) \propto \exp (- \frac{{(x_{t - 1} - \sqrt{\bar{α_{t - 1}}} x_{0})}^{2}}{2 (1 - \bar{α_{t - 1}})})

(9)

p (x_{t} | x_{0}) \propto \exp (- \frac{{(x_{t} - \sqrt{\bar{α_{t}}} x_{0})}^{2}}{2 (1 - \bar{α_{t}})})

(10)

According to the Bayesian formula, it can be derived that:

p (x_{t - 1} | x_{t}) = p (x_{t} | x_{t - 1}) \frac{p (x_{t - 1})}{p (x_{t})}

(11)

Using the Bayesian formula in Equation (11), and substituting Equations (8)–(10) into $p (x_{t} | x_{t - 1}, x_{0})$ , the following expression is obtained:

\begin{array}{l} p (x_{t - 1} | x_{t}, x_{0}) \propto e x p \\ (- \frac{1}{2} (\frac{{(x_{t} - \sqrt{α_{t}} x_{t - 1})}^{2}}{β_{t}} + \frac{{(x_{t - 1} - \sqrt{\bar{α_{t - 1}}} x_{0})}^{2}}{1 - \bar{α_{t - 1}}} - \frac{{(x_{t} - \sqrt{\bar{α_{t}}} x_{0})}^{2}}{1 - \bar{α_{t}}})) \end{array}

(12)

Rearranging Equation (12) into a general form obtains:

\begin{matrix} p (x_{t - 1} | x_{t}, x_{0}) \propto e x p (- \frac{1}{2} [(\frac{α_{t}}{β_{t}} + \frac{1}{1 - \bar{α_{t - 1}}}) {x_{t - 1}}^{2} \\ - (\frac{2 \sqrt{α_{t}}}{β_{t}} x_{t} + \frac{2 \sqrt{\bar{α_{t - 1}}}}{1 - \bar{α_{t - 1}}}) x_{t - 1} + C (x_{t}, x_{0})]) \end{matrix}

(13)

According to Equation (13), the mean and variance can be solved as follows:

σ^{2} = \frac{1 - \bar{α_{t - 1}}}{1 - \bar{α_{t}}} β_{t}

(14)

μ = \frac{\sqrt{α_{t}} (1 - \bar{α_{t - 1}})}{1 - \bar{α_{t}}} x_{t} + \frac{\sqrt{\bar{α_{t - 1}}} β_{t}}{1 - \bar{α_{t}}} x_{0}

(15)

Since it is initially assumed that $x_{0}$ is known, but in reality $x_{0}$ is unknown and only $x_{t}$ is known, an expression for $x_{0}$ needs to be constructed as follows:

x_{0} = \frac{x_{t} - \sqrt{1 - \bar{α_{t}}} ε_{t}}{\sqrt{\bar{α_{t}}}}

(16)

Substituting Equation (16) into Equation (15), the mean expression is obtained as:

μ = \frac{1}{\sqrt{α_{t}}} (x_{t} - \frac{1 - α_{t}}{\sqrt{1 - \bar{α_{t}}}} ε_{t})

(17)

Assume that the mean expression obtained by neural network learning is:

μ_{θ} = \frac{1}{\sqrt{α_{t}}} (x_{t} - \frac{1 - α_{t}}{\sqrt{1 - \bar{α_{t}}}} ε_{θ} (x_{t}, t))

(18)

After obtaining the mean, the mean square error can be used to obtain the loss function:

L_{t} = E_{t, x_{0}, ε_{t}} [‖ ε_{t} - ε_{θ} (x_{t}, t) ‖^{2}]

(19)

where $x_{t}$ is obtained according to Equation (4). According to the loss function, the mean and variance can be trained until they approach the true probability distribution $q (x_{t - 1} | x_{t})$ .

In the realm of diffusion models, the complexity of noise addition and reversal poses significant computational challenges. To address this, Rectified Flow³⁷ offers a simplified approach by employing a linear interpolation method for noise addition. Unlike traditional recursive Gaussian models, Rectified Flow introduces noise through a straightforward linear blend between the original image and noise across the diffusion timeline. This method is represented mathematically as:

z_{t} = (1 - t) x_{0} + t ε .

(20)

It simplifies both the understanding and computation of noise dynamics. The replacement of complex noise addition equations with Rectified Flow can lead to significant reductions in computational overhead and enhance the model’s transparency.

The training for text-to-image RFM, as guided by the framework in the study by,³⁸ involves using a pretrained encoder to convert both images and text into latent representations. This approach ensures that both modalities are represented in a compressed format that retains essential information while reducing dimensionality. Images are transformed into a lower-dimensional latent space, which effectively captures the visual content in a more manageable form. Similarly, text conditioning is processed through pretrained text models that convert textual data into corresponding latent representations.

The text conditioning, denoted as $c$ , is encoded using pretrained text models: CLIP-L³⁹ and CLIP-G.⁴⁰ The outputs, with dimensions $768 \times 1$ and $1280 \times 1$ respectively, are concatenated to form a conditioning vector $c_{vec}$ . The penultimate hidden representations are concatenated channel-wise to form a CLIP context conditioning vector $c_{ctxt}^{CLIP}$ . $c$ is further encoded into the final hidden representation $c_{ctxt}^{T 5}$ , derived from the encoder of a T5-v1.1-XXL model.⁴¹ To align the dimensions with the T5 representation, $c_{ctxt}^{CLIP}$ is zero-padded along the channel axis to $4096 \times 1$ dimensions. Finally, these representations are concatenated along the sequence axis to produce the ultimate context representation $c_{ctxt} \in R^{154 \times 4096}$ . A pretrained autoencoder³⁵ is used to compress RGB images into a latent space $x \in R^{h \times w \times 3}$ , where a spatial down sampling factor of 8 is applied. The forward process in the latent space is characterized by rectified flow. The latent representation $x$ is decoded back into pixel space $X = D (x)$ by a decoder $D (.)$ .³⁵

The Multimodal Diffusion Transformer³⁵ architecture, based on diffusion transformer,⁴² is used for image reconstruction. The Diffusion Transformer integrates the diffusion timestep with the class label, while the Multimodal Diffusion Transformer incorporates embeddings of both the diffusion timestep and text conditioning vectors into its process. Also, a sequence that merges embeddings from both text and image inputs is constructed. Positional encodings are incorporated, and the latent pixel data is transformed by flattening pixel blocks into a sequence of patch encodings. These patch encodings, along with the text sequence, are then adjusted to a common dimensionality and concatenated. Following the Diffusion Transformer, a series of modulated attention mechanisms and Multi-Layer Perceptron (MLPs) are implemented, which are designed separately for the text and image data due to their distinct characteristics. The designs mimic the functions of two transformers, each dedicated to one modality. Subsequently, the sequences from both modalities are combined for the attention mechanism. This allows for an integrated processing where each modality can influence and be influenced by the other.

Prompt generation and prompt engineering

Achieving diversity in generated images requires diverse prompts. Manually writing detailed prompts can be incredibly time-consuming. Fortunately, advancements in the field of natural language processing have enabled LLMs to efficiently generate vast, varied prompts. In this article, GPT-4 is employed to generate text prompts. GPT-4 is a state-of-the-art LLM developed by OpenAI, which includes numerous transformer layers⁴³ known for their ability to process and generate complex language patterns. Its proficiency in analysing vast amounts of text data enables it to produce varied text prompts, facilitating the creation of unlimited synthesized images.

Not all text prompts generated by GPT-4 can effectively guide the RFM to produce high-quality images. Although these prompts might be well-composed, they do not always resonate with the RFM’s optimal processing capabilities due to a lack of specificity in tuning to the RFM’s preferences. In this study, we use the term ‘RFM-preferred prompt’ to refer to prompts that empirically lead to consistently high-quality image outputs from the RFM. This preference is not derived from fixed linguistic rules or predefined semantic structures. Rather, it reflects the latent tendencies learnt by the RFM during pretraining on large-scale datasets. Since the internal functioning of the RFM is governed by high-dimensional neural representations, the exact preference criteria are not directly interpretable. Instead, we identify these prompts through systematic qualitative evaluation of the images they produce. Figure 4 illustrates an example of RFM-preferred and RFM-non-preferred prompts and the images they generate.

Figure 4.

Comparison of images generated using RFM. (a) shows an image generated from an RFM-preferred, while (b) shows an image generated from an RFM-non-preferred prompt. RFM: rectified flow model

To ensure that RFM can consistently generate diverse, high-quality images that reflect the reality and meet engineers’ preferences, a systematic prompt engineering approach is employed. This process includes the selection of text prompts to enhance their interpretability by the RFM. Figure 5 illustrates the prompt engineering process through which text prompts generated by GPT-4 are selected. Initially, a varied pool of raw prompts is automatically generated using GPT-4. This is done by directly interacting with GPT-4 with specific requests, such as: ‘Hi GPT-4, could you please generate 50 prompts for diffusion models? These prompts should focus on cracks in various civil engineering structures’. GPT-4 then produces a range of text prompts concerning cracks, which include both RFM-preferred and non-preferred types, collectively referred to as raw prompts. Each raw prompt generated by GPT-4 is subsequently used to produce an image using the RFM, with all hyperparameters set uniformly to maintain consistent testing conditions. After the images are generated, a rigorous quality check is performed manually by the proposed prompt engineering strategy. During this evaluation phase, each image is assessed to determine if it meets the user preferences and quality standards. Prompts that lead to the creation of images favoured by users are retained for further use, while those that result in subpar or undesirable outcomes are removed from the pool. Once the quality check is completed, the selected prompts are designated as RFM-preferred prompts and are employed for further image generation.

Figure 5.

The prompt engineering process of selecting RFM-preferred text prompts from a pool generated by GPT-4. RFM: rectified flow model; generative pre-trained transformer (GPT).

To maintain a consistent quality across generated images, each RFM-preferred prompt is used to produce multiple images by randomly varying the RFM’s seed value. The seed value determines the initial noise pattern in the image generation process, setting the starting point for the noise that the RFM manipulates during its diffusion processes. While the RFM is governed by several hyperparameters, our research indicates that the text prompt itself is the paramount hyperparameter, exerting the most significant influence on the quality of the generated images. Modifying a text prompt can dramatically impact the resultant image quality. However, if the prompt is RFM-preferred, subsequent adjustments to other hyperparameters, especially the seed value of the RFM, typically have an insignificant impact on image quality. Although changing the seed may slightly alter image details, it does not significantly affect the overall quality and style of the images. Figure 6 presents an example where the RFM-preferred prompt from Figure 4 is used, maintaining all hyperparameters at their default settings while changing the seed value. It is evident that the image quality remains consistently high, and the style is unchanged but the details, such as the shape and size of the crack, differ.

Figure 6.

The effect of changing the seed value while using the same RFM-preference prompt from Figure 4. RFM: rectified flow model.

Experimental validation of synthesized data for enhanced crack detection

Experiment setup

To evaluate the efficacy of the proposed image generation framework, two distinct experiments involving vision-based tasks, crack classification and crack object detection, are conducted. These experiments are designed to determine how synthesized data, when added to training datasets, can improve the accuracy and robustness of neural networks in detecting and categorizing crack features. The methodologies of evaluation for both the crack classification and crack object detection tasks are depicted in Figure 7. This figure outlines the workflow used for each of the two tasks, from data preparation through model training to the final testing phase.

Figure 7.

Experimental setups and workflow for the crack classification and object detection tasks.

Deep learning-based crack classification task involves training a neural network to classify images as either containing a crack (positive) or not (negative). The primary goal is to assess the network’s ability to accurately identify the presence of cracks based on the visual data provided in the images. Differing from simple classification, the DL-based object detection task requires the trained model not only to detect the presence of a crack but also to pinpoint its exact location within the image. The output includes bounding boxes around each detected crack, accompanied by confidence scores. To perform these evaluations, two public datasets, the Crack Classification Dataset CCLA⁴⁴ for crack image classification and the Comprehensive Crack Detection Dataset (CCDD)^45–47 for crack object detection are utilized, alongside a synthesized dataset by the proposed framework and the Web-sourced Crack Evaluation Dataset (WCED). The CCLA dataset consists of images of concrete structures from various buildings, with half of the images tagged as negative and the other half as positive. These images are captured using a 16-MP Nikon camera, strategically positioned at a working distance of 500 mm to ensure consistent image quality and scale. Each image within this dataset maintains a resolution of 256 × 256 pixels. The CCDD is compiled by aggregating several public datasets from Roboflow⁴⁸ for crack object detection. This compilation draws from three datasets,^45–47 summing up to 1020 positive images for training. CCDD and CCLA datasets serve as the base training datasets for the crack classification and crack detection neural networks.

To test the generalization and robustness of the trained models, 400 diverse, real-scenario images from the web are collected and annotated, forming the WCED dataset, which are used only for testing the trained neural networks. WCED dataset comprises 200 negative images of undamaged structures such as buildings and infrastructure, representing scenarios with no crack, and 200 positive images that include a range of crack manifestations, such as small cracks, occluded cracks, cracks obscured by moss and so on. This diverse testing dataset presents a challenging environment for the trained neural networks in detecting and classifying cracks under varied conditions. A few positive images from WCED are displayed in Figure 8.

Figure 8.

Examples of positive images from the WCED dataset. WCED: Web-sourced Crack Evaluation Dataset.

4K synthesized images are generated using the proposed framework as discussed in the second section, 2K positive and 2K negative, to augment the training data. For the crack classification task, these images are integrated with the CCLA dataset. For the object detection domain, 1200 positive synthesized images, also labelled using Roboflow,⁴⁸ are merged with the CCDD dataset. The integration followed a staged approach, allowing the evaluation of the impact of the synthesized data on model performance. The effectiveness of this strategy is determined by observing any improvements in the models’ performance when testing on the same web-sourced images (WCED), before and after the integration of synthesized data. A noticeable improvement in model performance upon integrating the synthesized images would validate the inclusion of the synthesized dataset and confirm the effectiveness of the proposed synthesized data generation framework.

Synthesized training data generation

For synthesized training data generation, the framework proposed in the second section is used which integrates GPT-4 and RFM.³⁵ The synthesized crack dataset (SCD) generation begins with the creation of 600 raw prompts related to cracks using GPT-4. From them, 200 RFM-preferred prompts are selected manually through the proposed prompt engineering method introduced in “Prompt generation and prompt engineering” section to optimize prompt suitability for the RFM. Additionally, 600 raw prompts are created for images depicting undamaged structures, from which 200 are chosen following the same prompt engineering process. Upon establishing these 400 RFM-preferred prompts, all hyperparameters are fixed to ensure uniformity across data generation. For each of the selected prompts, 10 unique images are generated by changing the seed value across 10 distinct random numbers. As a result, a total of 2K positive images depicting various crack scenarios and 2K negative images showcasing various undamaged structures are produced. Figure 9 presents three examples produced using the positive prompts. Figure 9(a) displays images generated from a prompt designed for concrete, illustrating a ‘cluster of small, shallow cracks on a concrete wall’. Figure 9(b) shows images from a prompt for asphalt, depicting an ‘asphalt Road surface with a branching crack filled with small vertical pebbles’. Figure 9(c) presents images of a ‘concrete bridge with visible cracks and misalignments from recent earthquake damage, under emergency inspection lights’. These images collectively demonstrate the capability of the proposed synthesized image generation framework to create detailed and context-specific images that reflect different materials and damage scenarios under various conditions.

Figure 9.

Examples of images from the synthesized training dataset generated using RFM with three distinct prompts. (a) Cluster of small, shallow cracks on a concrete wall, (b) asphalt road surface with a branching crack filled with small vertical pebbles, and (c) concrete bridge with visible cracks and misalignments from recent earthquake damage, under emergency inspection lights.

Dataset diversity evaluation

Ensuring dataset diversity is essential for training deep learning models that can generalize effectively to real-world SHM tasks. To evaluate how well different datasets capture variations in crack patterns, we compare the real training dataset (CCDD) and the synthetic dataset (synthetic positive images) against a subset of the real-world testing data (WCED positive images). To quantify the differences between these datasets, we leverage two widely used diversity assessment metrics: Fréchet Inception Distance (FID)^49,50 and Structural Similarity Index Measure (SSIM).⁵¹ FID measures the similarity between two datasets in the feature space of a pretrained neural network.⁵² A lower FID score indicates that the dataset is more similar to real-world data. SSIM analyses the structural consistency between images by comparing luminance, contrast, and texture, with a higher score indicating greater similarity to real-world images.

Table 1 presents a quantitative comparison of dataset similarity using FID and SSIM. The results show that CCDD versus WCED (Pos) achieves an FID of 194.9 compared to Synthetic (Pos) versus WCED (Pos) with an FID of 232.5, suggesting that CCDD is more aligned with WCED (Pos) in feature space. However, SSIM, which evaluates structural consistency in the image domain, exhibits the opposite trend. Synthetic (Pos) versus WCED (Pos) achieves a higher SSIM score (0.0427) than CCDD versus WCED (Pos) (0.0244), indicating better local structural similarity to WCED. This apparent contradiction arises from the fundamentally different sensitivities of the two metrics. FID captures overall image structure but is relatively insensitive to fine-grained texture. In contrast, SSIM captures local structural details, such as crack edges and surface patterns, which are critical for SHM. Therefore, while FID suggests greater semantic divergence, the higher SSIM indicates that the synthetic data better reflects the fine-grained textures seen in real crack images.

Table 1.

Diversity metrics comparison.

Metric	CCDD vs. WCED (Pos)	Synthetic (Pos) vs. WCED (Pos)
FID	194.9	232.5
SSIM	0.0244	0.0427

CCDD: Comprehensive Crack Detection Dataset; WCED: Web-sourced Crack Evaluation Dataset; FID: Fréchet Inception Distance; SSIM: Structural Similarity Index Measure.

Moreover, the absolute values of both metrics (e.g., FID ∼195–230, SSIM ∼0.024–0.043) fall outside conventional interpretability ranges. This is likely due to the fact that both metrics are originally designed for tasks requiring near-exact image reproduction (e.g., super-resolution or denoising), whereas our goal is to generate diverse, structurally realistic, but not identical images to real data. Additionally, FID’s reliance on ImageNet-trained networks can underrepresent intra-domain relevance for crack imagery, as these networks are not optimized for civil infrastructure patterns. To mitigate these limitations and provide a more domain-relevant visual interpretation, we present a t-distributed Stochastic Neighbor Embedding (t-SNE) visualization of the image embeddings.

Figure 10 visualizes dataset clustering in feature space using t-SNE. The t-SNE visualization results indicate that Synthetic (Pos) versus WCED (Pos) demonstrates better clustering and alignment compared to CCDD versus WCED (Pos). This aligns with the higher SSIM score observed for Synthetic (Pos), suggesting that it maintains a closer resemblance to WCED (Pos) in terms of structural consistency. Conversely, CCDD versus WCED (Pos) shows a more dispersed distribution in the t-SNE space, which is consistent with its lower SSIM score. These results suggest that, despite its higher FID score, Synthetic (Pos) outperforms CCDD in both SSIM and t-SNE clustering, indicating a stronger overall alignment with WCED (Pos).

Figure 10.

t-SNE visualization of feature distributions, comparing CCDD and Synthetic (Pos) against WCED (Pos). CCDD: Comprehensive Crack Detection Dataset; WCED: Web-sourced Crack Evaluation Dataset.

While these metrics provide valuable insights, they have inherent limitations in capturing dataset diversity, particularly given the limited size of the real testing dataset. To further evaluate the effectiveness of the synthesized dataset, we assess its role in improving crack classification and detection performance.

Synthesized data evaluation – crack classification

The model employed for the crack classification task is a VGG-16 architecture,¹⁴ initially trained on the ImageNet dataset.⁵³ The architecture of the neural network is shown in Figure 11. The base model consists of five convolutional blocks, each followed by max pooling layers, effectively capturing hierarchical features from the input images. The first two blocks contain two convolutional layers with 64 and 128 filters, respectively, while the subsequent blocks have three convolutional layers each, with 256, 512 and 512 filters. These layers are activated by the Rectified Linear Unit (ReLU) function and progressively reduce the spatial dimensions of the input through max pooling. The flattened output from the last max pooling layer, a one-dimensional vector of 25088 elements, is fed into two fully connected layers, each with 4096 units and the ReLU activation. Finally, a dense layer with a single unit and the sigmoid activation function is added to output a probability value for binary classification.

Figure 11.

The architecture of VGG-16 neural network. VGG: Visual Geometry Group,

The CCLA dataset is initially employed to train the neural network, establishing a baseline performance. Subsequently, the synthesized dataset from the proposed framework, comprising both positive and negative images, is progressively integrated with the CCLA dataset. This integration follows a staged approach: starting with an addition of 500 positive and 500 negative images, followed by increments to 1K positive and negative, then 1.5K, and finally culminating with 2K of each. The progressive integration aims to quantitatively evaluate the effect of augmentation with purposely-generated synthesized images. All model evaluation is conducted on the WCED dataset. The performance of the models is assessed using three metrics: Precision, Recall and the F1-score. Precision measures the accuracy of the positive predictions for the cracked class:

Precision = \frac{TP}{TP + FP},

(21)

where True Positives (TP) represents the number of correct identifications of cracks, and False Positives (FP) denotes the model incorrectly identified cracks in non-cracked images or locations. Recall is used to evaluate the model’s effectiveness in identifying all actual instances of cracks:

Recall = \frac{TP}{TP + FN} .

(22)

where False Negatives (FN) are cases where the model fails to identify actual cracks. The F1-score combines both Precision and Recall into a single metric by calculating their harmonic mean, providing a balanced measure of the model’s accuracy and completeness:

F_{1} = 2 \times \frac{Precision \times Recall}{Precision + Recall} .

(23)

A summary of the model performances is given in Table 2. Across different stages of synthesized data augmentation, a significant improvement in performance is yielded. The baseline model, trained solely on the CCLA dataset, achieves an F1-score of 0.347, indicating modest performance that may stem from a limited diversity of training examples, potentially hindering the model’s ability to handle complex crack patterns. The introduction of 500 positive and 500 negative synthesized images (denoted as SCD(P&N)) markedly enhances the model’s capabilities, as evidenced by the F1-score rising to 0.740. This substantial increase suggests that even a moderate addition of the synthesized images from the proposed framework to the training images can significantly boost the model’s ability to generalize across different scenarios. As more synthesized data are progressively added, the F1-scores continue to climb to 0.897 and 0.921, respectively. This trend underscores the enhanced Precision and Recall, illustrating the model’s improving proficiency in accurately identifying and classifying crack images. The optimal performance is observed when the dataset is expanded to include 2K positive and 2K negative synthesized images, culminating in an F1-score of 0.932. The clear benefits in Precision, Recall and overall F1-score through the synthesized data integration affirm the effectiveness of the proposed image generation framework in training better neural network models for crack image classification task.

Table 2.

Performance of VGG-16 across different stages of synthesized data integration.

Training data	Testing data	Precision	Recall	F1-score
CCLA	WCED (P&N)	0.3360	0.3587	0.347
CCLA + 500 SCD(P&N)	WCED (P&N)	0.7596	0.7215	0.740
CCLA + 1K SCD(P&N)	WCED (P&N)	0.8996	0.8952	0.897
CCLA + 1K SCD(P&N)	WCED (P&N)	0.9215	0.9213	0.921
CCLA + 2K SCD(P&N)	WCED (P&N)	0.9396	0.9242	0.932

VGG: Visual Geometry Group; SCD: synthesized crack dataset; WCED: Web-sourced Crack Evaluation Dataset.

Synthesized data evaluation – crack detection

To further evaluate the efficiency of the proposed dataset generation framework, the synthesized dataset is employed to train a crack object detection model. Different from the crack image classification, object detection not only identifies the presence of cracks but also predicts bounding boxes around the cracks, making it a more complex challenge. For this purpose, the YOLOv5⁵⁴ neural network is utilized. YOLOv5 is a popular object detection model renowned for its speed and accuracy. The architecture of YOLOv5 is based on the Darknet model,⁵⁵ but it has been optimized and improved to enhance performance. It features a backbone designed for feature extraction, a neck responsible for feature integration, and a head for bounding box prediction. The backbone is based on the CSPDarknet53,⁵⁶ which reduces the computational complexity and improves the learning capability of the model through cross-stage partial connections that allow for more efficient feature propagation. In the neck, YOLOv5 uses a series of feature pyramid networks⁵⁷ and path aggregation networks⁵⁸ to merge features from different stages of the backbone, enhancing the model’s ability to detect objects at various scales.

Similar to the crack classification experiment, the object detection experiment begins by utilizing the CCDD to fine-tune a YOLO, establishing a baseline crack detection model. Subsequently, 200 positive images from the WCED are annotated and used to conduct an initial evaluation of the model’s performance, providing base metrics for comparison. Following this initial assessment, synthesized images are progressively integrated into the training dataset in stages: 200, 400, 600, 800 and up to 1.2K images. To evaluate the accuracy of the neural network for crack detection, four key metrics are utilised: Precision, Recall, mAP-0.5 and mAP-0.5:0.95. The metrics mAP-0.5 and mAP-0.5:0.95 are for evaluating the precision and reliability of object detection models, particularly in their ability to accurately predict bounding boxes around detected objects. mAP-0.5 measures how closely the model’s predicted bounding boxes align with the ground truth boxes, requiring at least a 50% overlap. It calculates the average Precision (AP) for each class at this threshold, where precision is the proportion of correct positive predictions and recall is the detection of actual positives. The mAP-0.5:0.95 offers a more stringent assessment by averaging APs across a range of Intersection over Union (IoU) thresholds, from 0.5 to 0.95 in increments of 0.05. This metric evaluates the model’s accuracy at various levels of strictness in bounding box overlap, from the minimum acceptable overlap to near-perfect precision.

Table 3 and Figure 12 depict the performance of a YOLOv5 neural network fine-tuned on an incrementally enlarged SCD dataset. Each training configuration within the table is differentiated by the number of synthesized images added to the initial CCDD dataset. All trained models are consistently tested on positive images from the WCED dataset. The consistent improvement in Precision and Recall metrics with the addition of synthesized data indicates that the extra images significantly enhance the model’s ability to accurately identify and classify crack features. The mAP at various IoU thresholds provides further insights into the model’s accuracy. Both mAP metrics show marked improvements as more synthesized data is integrated, highlighting a better alignment of predicted bounding boxes with actual crack positions. The progression of model performance metrics, from the baseline using only the CCDD to subsequent stages with added synthesized images, shows significant enhancements. Notable performance gains are observed with the first additions of synthesized data (200 and 400 images), though the rate of improvement slows slightly as more data is added. Figure 13 showcases some examples of crack detection performance on the WCED dataset. In these images, a label of ‘0’ indicates a detected crack, followed by a confidence value. For all analyses, a confidence threshold of 0.5 is established; only detections exceeding this threshold are considered valid. The depicted cracks are initially undetectable by the YOLOv5 model when solely fine-tuned on the CCDD dataset. Subsequent fine-tuning of YOLOv5 with the addition of 1200 positive images from the SCD dataset markedly enhanced the model’s detection capabilities.

Table 3.

Performance of YOLOv5 across different stages of synthesized data integration.

Training data	Testing data	Precision	Recall	mAP-0.5	mAP-0.5:0.95
CCDD	WCED(P)	0.2425	0.3818	0.2983	0.0762
CCDD + 200 SCD(P)	WCED(P)	0.5053	0.5636	0.5075	0.2632
CCDD + 400 SCD(P)	WCED(P)	0.6700	0.6873	0.6599	0.4346
CCDD + 600 SCD(P)	WCED(P)	0.7185	0.7818	0.7429	0.5252
CCDD + 800 SCD(P)	WCED(P)	0.7656	0.8009	0.7946	0.6325
CCDD + 1K SCD(P)	WCED(P)	0.8305	0.8272	0.8118	0.6644
CCDD + 1.2K SCD(P)	WCED(P)	0.8382	0.8454	0.8374	0.6797

CCDD: Comprehensive Crack Detection Dataset; WCED: Web-sourced Crack Evaluation Dataset; SCD: synthesized crack dataset.

Figure 12.

Line graph showing the progression of Precision, Recall and mAP at different stages of synthesized data integration in YOLOv5 training. mAP: mean Average Precision.

Figure 13.

Examples of enhanced crack detection performance on the WCED dataset using YOLOv5, post-integration of the entire SCD dataset for fine-tuning. WCED: Web-sourced Crack Evaluation Dataset; SCD: synthesized crack dataset.

Conclusion and future work

In this study, a novel framework for synthesized image dataset generation is proposed to enhance the capabilities of DNN models in structural crack detection. This framework combines the RFM with GPT-4, along with a prompt selection process to generate an ideally unlimited number of synthesized crack images. It enables the simulation of a broad spectrum of crack scenarios (from minor surface cracks to major structural damages) that can be inaccessible under normal operational conditions. Additionally, the image generation process is highly efficient, producing thousands of images in just a few hours hence significantly speeding up the training and development cycles for SHM applications. Experiments demonstrate significant improvements in crack image classification and object detection performance metrics. These enhancements validate the usage of the synthesized datasets in addressing the challenges associated with the limited and homogeneous training data typically available in SHM. Despite the high efficiency of image generation, manual annotation of the synthesized images remains essential. Future advancements in automating the annotation process, such as for crack segmentation tasks, could significantly streamline the workflow. This would help reduce both time and labour costs, ultimately enhancing the overall effectiveness and scalability of SHM systems.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The support from Australia Research Council Laureate Fellow project FL180100196 is acknowledged.

ORCID iDs

Jun Li

Qilin Li

References

Wang

Leng

Zhang

A weakly-supervised transformer-based hybrid network with multi-attention for pavement crack detection. Constr Build Mater 2024; 411: 134134.

Xiong

Zayed

Abdelkader

EM.

A novel YOLOv8-GAM-Wise-IoU model for automated detection of bridge surface cracks. Constr Build Mater 2024; 414: 135025.

Salman

Mathavan

Kamal

, et al. Pavement crack detection using the Gabor filter. In: 16th international IEEE conference on intelligent transportation systems (ITSC 2013). 2013, pp. 2039–2004. The Hague, Netherlands: IEEE.

Zhang

, et al. Automatic crack detection and classification method for subway tunnel safety monitoring. Sensors 2014; 14(10): 19307–19328.

Yamaguchi

Nakamura

Hashimoto

An efficient crack detection method using percolation-based image processing. In: 2008 3rd IEEE conference on industrial electronics and applications, 2008, pp. 1875–1880. Singapore: IEEE.

Hsieh

Y-A

Tsai

YJ.

Machine learning for crack detection: review and model performance comparison. J Comput Civ Eng 2020; 34(5): 04020038.

Wang

Shao

, et al. A comparative study on the most effective machine learning model for blast loading prediction: from GBDT to Transformer. Eng Struct 2023; 276: 115310.

Shao

, et al. Computer vision based target-free 3D vibration displacement measurement of structures. Eng Struct 2021; 246: 113040.

Cheng

Chen

Hao

, et al. Prediction of BLEVE-induced response of road tunnel using Transformer network with modified self-attention (SAMT). Eng Struct 2024; 314: 118415.

10.

Shao

, et al. Out-of-plane full-field vibration displacement measurement with monocular computer vision. Autom Constr 2024; 165: 105507.

11.

Zhang

Yang

Zhang

, et al. Road crack detection using deep convolutional neural network. In: 2016 IEEE international conference on image processing (ICIP), Phoenix, AZ, 2016, pp. 3708–3712. USA: IEEE.

12.

Hoai

Samaras

. Large-scale continual road inspection: visual infrastructure assessment in the wild. In: Proceedings o British machine vision conference (BMVC), London, UK, 4–7 September 2017. Durham, UK: BMVA Press.

13.

Gopalakrishnan

Khaitan

Choudhary , et al. Deep convolutional neural networks with transfer learning for computer vision-based data-driven pavement distress detection. Constr Build Mater 2017; 157: 322–330.

14.

Simonyan

Zisserman

, Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.

15.

Nie

Wang

. Pavement distress detection based on transfer learning. In: 2018 5th International conference on systems and informatics (ICSAI), 2018, pp.415–419. Nanjing, China: IEEE.

16.

Xue

A fast detection method via region-based fully convolutional neural networks for shield tunnel lining defects. Comput Aided Civ Infrastruct Eng 2018; 33(8): 638–654.

17.

Cha

Choi

Büyüköztürk

Deep learning-based crack damage detection using convolutional neural networks. Comput Aided Civ Infrastruct Eng 2017; 32(5): 361–378.

18.

Mandal

Uong

Adu-Gyamfi

Automated road crack detection using deep convolutional neural networks. In: 2018 IEEE international conference on big data (Big Data), Seattle, WA, 2018. USA: IEEE.

19.

Cheng

Wang

Automated detection of sewer pipe defects in closed-circuit television images using deep learning techniques. Autom Constr 2018; 95: 155–171.

20.

Girshick

. Fast r-CNN. In: Proceedings of the IEEE international conference on computer vision, 2015, pp. 1440–1448. Santiago, Chile: IEEE.

21.

Liu

Anguelov

Erhan

, et al. SSD: single shot multibox detector. In: Computer vision–ECCV 2016: 14th European conference, Amsterdam, The Netherlands, 11–14 October 2016, Proceedings, Part I 14, 2016. Cham, Switzerlandz: Springer.

22.

Wang

Shao

, et al. A novel transformer-based semantic segmentation framework for structural condition assessment. Struct Health Monit 2024; 23(2): 1170–1183.

23.

Guo

Qian

Liu

, et al. Pavement crack detection based on transformer network. Autom Constr 2023; 145: 104646.

24.

Ali

Chuah

Talip

MSA

, et al. Structural crack detection using deep convolutional neural networks. Autom Constr 2022; 133: 103989.

25.

Schmugge

Rice

Lindberg

, et al. Crack segmentation by leveraging multiple frames of varying illumination. In: 2017 IEEE winter conference on applications of computer vision (WACV), Santa Rosa, CA, 2017, pp. 136–144. USA: IEEE.

26.

Islam

Kim

J-M.

Vision-based autonomous crack detection of concrete structures using a fully convolutional encoder–decoder network. Sensors 2019; 19(19): 4251.

27.

Kirillov

Mintun

Ravi

, et al. Segment anything. In: Proceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4015–4026. Paris, France: IEEE.

28.

Narazaki

Hoskere

Yoshida

, et al. Synthetic environments for vision-based structural condition assessment of Japanese high-speed railway viaducts. Mech Syst Signal Process 2021; 160: 107850.

29.

Narazaki

Hoskere

Chowdhary

, et al. Vision-based navigation planning for autonomous post-earthquake inspection of reinforced concrete railway viaducts using unmanned aerial vehicles. Autom Constr 2022; 137: 104214.

30.

Goodfellow

Pouget-Abadie

Mirza

, et al. Generative adversarial networks. Commun ACM 2020; 63(11): 139–144.

31.

Shao

, et al. 3DGEN: a framework for generating custom-made synthetic 3D datasets for civil structure health monitoring. Struct Health Monit 2024; 24(5): 2801–2817.

32.

Branikas

Murray

West

A novel data augmentation method for improved visual crack detection using generative adversarial networks. IEEE Access 2023; 11: 22051–22059.

33.

Liu

Hong

, et al. A small-sample wind turbine fault detection method with synthetic fault data using generative adversarial nets. IEEE Trans Ind Inform 2018; 15(7): 3877–3888.

34.

Kangwei

Xin

Qiushi

, et al. Application of improved least-square generative adversarial networks for rail crack detection by AE technique. Neurocomputing 2019; 332: 236–248.

35.

Esser

Kulal

Blattmann

, et al. Scaling rectified flow transformers for high-resolution image synthesis. In: Proceedings of the 41st international conference on machine learning (ICML 2024), Vienna, Austria: PMLR.

36.

Bubeck

Chandrasekaran

Eldan

, et al. Sparks of artificial general intelligence: early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.

37.

Liu

Gong

Liu

Flow straight and fast: learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003, 2022.

38.

Rombach

Blattmann

Lorenz

, et al. High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022.

39.

Radford

Kim

Hallcy

, et al. Learning transferable visual models from natural language supervision. In: International conference on machine learning, 2021, pp. 8748–8763. Virtual: PMLR.

40.

Cherti

Baeumont

Wightman

, et al. Reproducible scaling laws for contrastive language-image learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 2818–2829. Vancouver, Canada: IEEE.

41.

Raffel

Shazeer

Roberts

, et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn Res 2020; 21(140): 1–67.

42.

Peebles

Xie

Scalable diffusion models with transformers. In: Proceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4196–4206. Paris, France: IEEE.

43.

Vaswani

Shazeer

Parmar

, et al. Attention is all you need. Adv Neural Inform Process Syst 2017; 30: 5998–6008.

44.

Özgenel

ÇF

Sorguç

. Performance comparison of pretrained convolutional neural networks on crack detection in buildings. In: Proceedings of the ISARC the international symposium on automation and robotics in construction, 2018. IAARC Publications.

45.

Jimin

Crack detection v2 dataset, Roboflow, Editor. Roboflow, 2022.

46.

project-mb5rm. Detection of crack dataset, Roboflow, Editor. Roboflow, 2023.

47.

SCITSKKU. RI dataset dataset, Roboflow, Editor. Roboflow, 2023.

48.

Dwyer

Nelson

Hansen

, et al. Roboflow (version 1.0), https://roboflow.com (2024, accessed 15 February 2024).

49.

Zhang

Huang

, et al. Robust multitask compressive sampling via deep generative models for crack detection in structural health monitoring. Struct Health Monit 2024; 23(3): 1383–1402.

50.

Heusel

Ramsauer

Unterthiner

, et al. GANs trained by a two time-scale update rule converge to a local nash equilibrium. Adv Neural Inform Process Syst 2017; 30: 6629–6640.

51.

Wang

Bovik

Sheikh

, et al. Image quality assessment: from error visibility to structural similarity. IEEE Trans Image Process 2004; 13(4): 600–612.

52.

Szegedy

Vanhoucke

Loffe

, et al. Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016.

53.

Deng

Dong

Socher

, et al. ImageNet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition, Miami, FL, 2009, pp. 248–255. USA: IEEE.

54.

Jocher G. Ultralytics YOLOv5, (version 7.0). Ultralytics. https://github.com/ultralytics/yolov5 (2020).

55.

Redmon

Farhadi

YOLO9000: better, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition, Honolulu, HI, 2017, pp. 7263–7271. USA: IEEE.

56.

Bochkovskiy

Wang

C-Y

Liao

H-YM

. Yolov4: optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934, 2020.

57.

Lin

T-Y

Dollár

Girshick

, et al. Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Honolulu, HI, 2017, pp. 2117–2125. USA: IEEE.

58.

Zhang

Wen

Bian

, et al. Single-shot refinement neural network for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Salt Lake City, UT, 2018, pp. 4203–4212. USA: IEEE.

Advancing crack detection with generative AI for structural health monitoring

Abstract

Keywords

Introduction

Synthesized crack image generation

Text-to-image generation with RFM model

Prompt generation and prompt engineering

Experimental validation of synthesized data for enhanced crack detection

Experiment setup

Synthesized training data generation

Dataset diversity evaluation

Synthesized data evaluation – crack classification

Synthesized data evaluation – crack detection

Conclusion and future work

Footnotes

Declaration of conflicting interests

Funding

ORCID iDs

References