Sage Journals: Discover world-class research

Abstract

A large quantity of civil infrastructure in North America is near the end of their design life. Consequently, the routine visual structural inspection is increasingly necessary to ensure the safety and efficient management of the infrastructure stock. The increasing need for inspections and the laborious nature of the work has caused strain on the inspection industry. To improve inspection efficacy, various researchers have proposed novel deep learning methodologies to automatically classify, detect, and segment structural defects from images. After the defects are identified, it is often desirable to quantify the size of the defect, for severity classification and repair cost estimation. Yet, the measurement from a single image for quantification is not a trivial task, requiring supplementary data or sensor inputs, which may not be practical or economical in the current inspection process. In this study, we propose to recover the three-dimensional geometry of a scene from a single image, by using deep learning-based monocular depth estimation. The monocular depth estimation field has made great progress by leveraging deep learning and a plethora of open red, green, blue, and depth (RGB-D) datasets. However, there has not been a publicly available in situ Light Detection and Ranging (LiDAR) RGB-D dataset for the civil engineering domain, which is a barrier for researchers to develop and evaluate spatial computer vision methods in the civil engineering context. To bridge this gap, we build a LiDAR-based RGB-D dataset for training monocular depth estimators. Then using the civil RGB-D dataset, we test a solution for the real-world application of monocular depth estimation to quantify defects in civil infrastructure.

Keywords

Visual inspection defect measurement monocular depth estimation RGB-D dataset

Introduction

In North America, a large quantity of infrastructure was built in the mid-20th century and has reached the end of its design life. Presently, the deterioration of infrastructure is outpacing our ability to replace them. This has greatly increased the need for routine inspections to monitor the condition of the infrastructure to inform maintenance schedules and replacement decisions. The most common inspection method employed is the visual inspection since external changes in the appearance of a structure can often be a good indicator of structural deterioration. By exploiting recent developments in computer vision and deep learning, numerous researchers have sought to automate visual inspection, which can improve inspection outcomes and reduce costs.¹

A typical visual inspection involves inspectors studying the visual appearance of the infrastructure. The inspectors then note any defects and if applicable make measurements or visual estimates of the defect for records and repair estimates. Researchers have proposed classification algorithms to identify defects in an image.^2–7 To localize the defects in images, researchers have proposed various detection algorithms.^8–11 To quantify the precise dimension of defects in images (e.g., crack length/width), researchers have proposed various semantic segmentation algorithms.^12–16

Automated visual inspection methods typically quantify defects in image coordinates (i.e., pixels). Converting defect measurement from pixels to actual (metric) units is challenging, since the inverse projection from image to world requires a scale prior (i.e., depth). The most direct way to get depth information is to utilize specialized depth sensors such as stereo cameras, Light Detection and Ranging (LiDAR), laser scanners, and so on.¹⁷ However, depth sensors are often expensive and may not always be available or practical. A popular alternative to depth sensors is by leveraging correspondences between multiple images using structure from motion (SfM).¹⁸ SfM is a very popular method for large-scale civil survey and asset inspection (i.e., building façade, bridges, and dams), typically paired with unmanned aerial vehicles (UAVs) for data collection.¹⁹ Then using the UAV’s inertial measurement units and real-time kinematic global positioning system, it’s possible calculate scale information to transform the SfM reconstruction to real scales. Thus, provided an image and depths from SfM reconstruction, it’s possible to quantify various defects.^19–23 However, in practice, depth sensors and UAV-based SfM reconstructions are often not practical or economical for many inspections such as short-medium span overpass bridges.

Currently, the Ministry of Transportation Ontario (MTO) in Canada requires inspectors to capture an image of each defect.²⁴ In order to include a scale prior in the defect image, it is common for inspectors to include a known-size (reference) object (i.e., pen, notebook, ruler, etc.) near or co-planer to the defect.⁷ The reference object method allows inspector to quickly collect data in the field, which reduces costs incurred due to traffic disruptions and the inspector’s time in field. However, the reference object method can be ergonomically challenging when the defect is large (i.e., it is challenging to fit a large spalling in camera frame while holding a measuring stick to the structure) and requires inspectors to manually inspect each image to find the appropriate pixel-to-metric scaling factor. Thus, it is desirable to infer metric depth from a single image, without reference objects, which can be directly utilized to quantify defects, similar to depths from sensors or UAV SfM reconstructions.

A promising field of research which attempts to infer depth from single images is monocular depth estimation, which learns depth cues such as known objects, vanishing lines, textures, and so on to regress depth. Numerous studies have shown that given an image, a deep monocular depth estimation model can produce a depth map of the scene, like that from a depth sensor. In contrast to the reference object method, monocular depth estimation for defect quantification can significantly reduce data collection effort, costs and eliminate the variability from estimating the sizes of defects.

Developments in monocular depth estimation methods are enabled by the availability of abundant red, green, blue, and depth (RGB-D) datasets, where in addition to the red-green-blue channel of a color image, there is a depth channel, such that the intensity of the pixel indicates the distance between the imaged scene and the camera center. RGB-D data are valuable for civil inspection, for example, to quantify structural deformations.²⁵ Additionally, RGB-D data have been shown to benefit semantic defect detection research through RGB-D fusion convolutional neural networks (CNNs).²⁶ However, RGB-D dataset in the civil domain is rare, as a result researchers are commonly forced to use RGB-D datasets from adjacent domains or create an ad-hoc depth data standard which may not be applicable to other works.^26–28 Currently, to the best knowledge of the authors, there is no LiDAR RGB-D dataset of in situ civil infrastructure scenes,²⁹ which have significantly limited researchers’ ability to benchmark deep monocular depth estimation methods for the civil domain.^27,28

The purpose of this work is to evaluate the real-world application of monocular depth estimation for defect quantification on civil infrastructure, enabled by a novel LiDAR civil RGB-D dataset. First, a custom scanner is created, which is used to efficiently collect civil RGB-D dataset of five railway overpass bridges, with distinct construction styles. Second, the scan data are processed to produce densified RGB-D frames through scan accumulation. This is accomplished by utilizing local poses generated from simultaneous localization and mapping (SLAM) and a depth buffer. Finally, the collected civil RGB-D frames are used to evaluate the applicability of popular monocular depth estimation methods for monocular defect quantification. Due to large variety of civil infrastructures and challenges to collect additional civil RGB-D data, model generlizability is proposed as an important performance indicator. The monocular depth estimation methods used in this work are evaluated on an unseen bridge, in zero-shot and few-shot configurations; their effect on model performance and spalling defect quantification accuracy are used as a case study.

The main contribution of this work is the development of an in situ LiDAR civil RGB-D dataset, which we use to fine-tune monocular metric depth estimation methods for the civil domain, specifically for the objective to measure structural defects from single images. Our study finds a direct correlation between quantification accuracy and the fidelity of depth estimation, which implies that depth estimation performance metrics may be used as proxy to compare models’ defect quantification accuracy. However, due to the high cost of collecting civil RGB-D data, depth estimation models with high generalizability are necessary to enable monocular defect quantification. In this study, we find that pre-tuned vision transformer monocular depth estimation models greatly outperformed fully convolutional neural network-based methods, potentially indicating an efficient path to create robust monocular defect quantification and other civil applications of monocular depth estimation.

Method overview

The objective of the monocular defect quantification method is to enable quantitative visual inspections only from a single image, without relying on extra sensor data. The basic premise is that provided an image with sufficient contextual information, a deep learning method can be trained to estimate the three-dimensional (3D) geometry of the scene. Then, given the defect locations on the image, its measurement can be obtained from that estimated 3D geometry. The benefit of this method is allowing the real size of the defect to be rapidly and directly estimated from a single image, such that the proposed method can readily be incorporated with existing vision-based semantic visual inspection algorithms, with the goal of a fully automated quantitative structural defect inspection procedure.

Figure 1 presents an overview of the proposed monocular defect quantification methodology. As a preliminary step, it is necessary to calibrate the camera used to collect inspection images (RGB). Camera calibration ensures the images from the camera are minimally distorted and allows the calculation of the camera intrinsic matrix ( $K$ ), which is necessary to utilize the pin-hole camera model to reproject pixels from images to 3D points. Camera calibration can be done using the checkerboard procedure, or utilizing a computer vision camera (e.g., Intel Realsense), which features known camera calibrations.^30,31 First, a user collects an image containing the region of interest (like defects) with enough spatial context. Here, the spatial context refers to various learnable depth cues, such as vanishing lines and scale-informative features (e.g., sidewalks, street signs, lamp posts) Therefore, the user should strive to include as many features in the image as possible that provide information about the scale. Next, the RGB image is fed into the monocular depth estimation model, which outputs a dense depth image of which each pixel has a depth value predicted. The depth value is the length of the ray in real scale (i.e., millimeters) from the camera center of the image to the scene point corresponding to the pixel. The depth image needs to be inferenced once and the distance between any two valid depth points can be measured using that depth image.

Figure 1.

3D defect measurement through monocular depth estimation. 3D: three-dimensional.

For actual measurement, the pixel coordinates selected are reprojected in 3D via a pinhole camera geometry. In Equation (7), the 3D coordinate can be computed by premultiplying its homogenous pixel coordinate vector ( $[x, y, 1]^{T}$ ) with the inverse camera intrinsic matrix ( $K^{- 1}$ ) and its depth ( $D$ ). Figure 1 shows a pair of points ( $p_{1}$ , $p_{2}$ ) which correspond to the largest dimension of a spalling defect and retrieving the predicted depth values at image point locations ( $D_{1}$ , $D_{2}$ ) allows the projection of ( $p_{1}$ , $p_{2}$ ) to 3D coordinates ( $P_{1}$ , $P_{2}$ ). The Euclidean distance between ( $P_{1}$ , $P_{2}$ ) becomes the metric length.

The ability to recover the 3D information at each pixel can be used to obtain the quantification of the defect. As a potential application, inspectors quantify spalling using extreme points (i.e., maximum length and width), inspired by bridge inspection manual, Ontario Structural Inspection Manual (OSIM) pages 1–2–6, utilized by MTO. Inspectors manually select pairs of extreme points of a defect from the RGB image, projected at the points in 3D, and return their Euclidean distance.²⁴ The severity of the defect is classified based on these dimensions. Multiple defects can be measured if they are present in the same image.²⁴ Using the precomputed depth image, additional measurements can be made at low marginal computational cost. Thus, in addition to the maximum dimension required by OSIM, the users can also calculate a defect’s shorter dimension (i.e., secondary dimension), which is useful to estimate repair cost, shown in “Experiment” section.

Related work

The critical component of the proposed methodology involves utilizing monocular depth estimation, which estimates depth from a single two-dimensional (2D) image. While it is well understood that binocular vision enables depth perception, humans still navigate their surroundings, though at a reduced capacity using one eye. The widely accepted explanation is that humans can learn visual cues that provide insight about the depth of the scene. Many researchers have attempted to tackle monocular depth estimation and often to different ends. This typically impacts what “depth” means, such as relative (i.e., relatively consistent depths but not up to scale such as from SfM), ordinal (i.e., foreground verse background), or metric (i.e., meters). Since the proposed application in this study is to estimate the real size of the defect; thus, only metric depth estimation methods are reviewed.

The monocular depth estimation field took off after the introduction of deep CNNs. Eigen et al utilized a coarse-to-fine scale CNN network, trained with a scale-invariant loss.³² Li et al. attempted to improve the pixel-level depth estimation through local feature similarity by first estimating super-pixel level, then pixel-level surface normal and depths, which are postprocessed with conditional random fields (CRFs).³³ Xu et al. utilized CRFs to integrate multiple-scale intermittent depths during depth up-sampling.³⁴ CRFs are a great way to enforce consistency and continuity in-depth predictions. Laina et al. proposed utilizing convolutional residual networks (i.e., ResNets) instead of AlexNet-inspired architecture first proposed by Eigen et al.^35,36 The residual connections allowed the implementation of deeper networks and make the residual network able to learn depth up-sampling without CRFs for postprocessing.³⁵ Recently, deep residual networks have become the de facto standard for monocular depth estimation, with many researchers proposing incremental improvements. Fu et al. proposed to re-pose depth regression as an ordinal classification.³⁷ Yin et al. incorporated geometric constraints through an additional loss term that penalized the difference in the normal between some randomly chosen three points.³⁸ Lee et al. proposed a local planar guidance (LPG) layer module, which helps up-sampling encoded features with local planer assumption.³⁹

Most recently, transformer models, like many other computer vision tasks, have also begun to proliferate the monocular depth estimation studies.^40–42 Transformer models have many advantages over deep CNNs mainly being able to learn long-range dependencies (i.e., global receptive field) through tokenization and attention mechanism, and a greater capacity to encode large amounts of data.⁴³ Where previously, CNN researchers proposed changes to models’ architectures, Ranftl et al. proposed a data-centric perspective for monocular depth estimation.⁴⁰ Such that instead of training and testing on a single dataset, Ranftl et al. set state-of-the-art benchmarks by training with various public RGB-D datasets (up to 12). To accomplish this, they first normalized depths of datasets and used a multiobjective loss function, which is necessary to overcome loss instability due to differences in the depths between datasets.⁴⁰ A limitation of the method is that at inference their model can only output relative depth maps, which have limited utility. Recently, Bhat et al. extended the work by Ranftl et al., by only training with relative depth datasets, with a metric-bin decoder which allows the fine-tuning to a metric depth dataset.⁴²

A challenge with using larger models with an increased number of parameters is the risk of overfitting. In the context of a vision model, a sufficiently large model can “memorize” the instances in the training set. Overfitting can result in poor model performance on unseen data as it is said that the model fails to generalize to the data. The ability of a model to generalize, known as “genralizability,” is important to its real-world applicability of monocular depth estimation in the civil domain, due to the sometimes bespoke nature of civil infrastructure and the high cost to collect additional data. For example, structural designs depend on local demand, collecting spatial data requires mobilization of sensors to site and site conditions may require additional safety and regulatory considerations. Consequently, the lack of civil RGB-D datasets has necessitated civil researchers in depth estimation to study more constrained problems and utilize smaller models.^27,28

Conventionally, in defect detection (i.e., defect detection and segmentation), researchers fine-tune a large model by fixing early parameters and allowing latter parameters to be changed from the fine-tuning dataset. However, the conventional ideas of fine-tuning are not applicable to metric monocular depth estimation.⁴⁴ This is because metric depth values can depend heavily on the camera parameters and differences in distributions of depths between datasets can create unstable losses, which are detrimental to model training.⁴²

We recognize that the civil RGB-D dataset presented in this study is relatively small and limited in its range, particularly when compared to more established RGB-D benchmarks. Gathering a comprehensive RGB-D dataset for civil structure is inherently challenging due to the vast scale of structures and their restricted accessibility. This limitation could cause the risk of overfitting, especially when employing large, state-of-the-art transformer-based models for monocular depth estimation. Therefore, for domain-specific tasks, it may be more suitable to train a smaller, CNN-based model for monocular depth estimation. However, Bhat et al. have recently suggested the feasibility of “fine-tuning” a model designed for relative depth estimation to perform metric depth estimation.⁴²

Thus, in this work, the proposed civil RGB-D dataset is used to compare the model and generalizability performance between a representative CNN model and “Zero-shot Transfer by Combining Relative and Metric Depth: ZoeDepth” (ZOE), proposed by Bhat et al.. As a representative CNN-based monocular depth estimation model, we select “Big to Small: Multi-scale Local Planar Guidance for Monocular Depth Estimation,” proposed by Lee et al., popularly known as Big To Small (BTS). BTS is a great choice for learning monocular depth due to its strong performance on key benchmarks and ease of training due to its lightweight model. Additionally, BTS is uniquely suitable for civil applications due to its LPG module, which is a compatible inductive bias that can aid it the prediction of large planar surfaces, common of civil structures. In contrast, ZOE does not make any depth consistency assumption and requires much greater computational power to train and inference, yet, as it is able to leverage a much greater corpus of training data, it should exhibit greater model generalization than BTS. Thus, ZOE can exhibit greater model generalization than representative CNN methods, which have profound positive impact on real-world applicability of monocular depth estimation such as monocular defect quantification.

BTS uses a typical encoder–decoder architecture, where a CNN is used for feature extraction, and squeezed features are recovered using atrous spatial pyramid pooling and residual connections from the CNN encoder. At various stages of the decoder (scale), the up-sampled feature maps are used to predict depths via the LPG module.³⁹ The purpose of the LPG module is to convert the intermittent decoder features at each scale to the full input resolution, by explicitly regressing the four-dimensional plane coefficients at each pixel. The LPG module can better regularize depth predictions, with fewer learned parameters, which may be beneficial to model generalization. Finally, the full-resolution depth maps at each scale are concatenated, and a final convolutional layer is used to output the final depth prediction.³⁹

In contrast, ZOE utilizes a transformer encoder-decoder architecture as proposed by Ranftl et al., followed by their proposed metric depth learning module (metric bins).⁴² The metric bins module utilizes multiscale features from the decoder and a multilayered perceptron (MLP) at each scale to predict which discrete depth range (depth bin) each pixel belongs. Such that for each pixel its coarse depth is the center of the depth range or depth bin center. To refine the depth prediction, Bhat et al. adjust bin center predictions with attractor points using another MLP along the depth interval. Finally, the adjusted depth bin centers are linearly combined by their probabilities, calculated using the log of the binomial probability predicted for each bin center. The metric bins module essentially poses depth estimation as a scale and shift problem, assuming the model has a good relative depth prediction. This allows ZOE to learn the meta scale and shift parameters of a particular dataset, and thus avoiding the potentially unstable losses during fine-tuning using metric depth data. Such that ZOE can leverage greater spatial understanding, gained from pretraining on a large corpus of diverse RGB-D datasets. In effect, a relatively small metric dataset can be used to fine-tune ZOE for a domain specific monocular metric depth estimation application.

RGB-D data collection

Hardware selection

Civil engineers inspect a wide variety of structures including bridges, culverts, dams, buildings, parking garages, and so on. We consider a hardware solution that may be practical for collecting spatial data in a wide range of inspectable structures. For example, the spatial sensor should be able to function both indoor, when inspecting from within, and outdoor, when inspecting from the exterior of a structure. Where indoor may have low-light and feature-scarce scenes, and inspecting from a structure’s exterior can present issues about access, which necessitates greater sensor range. Lastly, the sensor be economical and easy to utilize for rapidly generate RGB-D frames from multiple perspectives.

Among the many considered spatial sensing paradigms, such as structured light, stereo camera, SfM, and time-of-flight sensors, it was found that time-of-flight sensors are most able to meet the requirements outlined previously. For example, structure light sensors are not suitable for outdoor use, while stereo cameras and SfM rely heavily on image features, which may be unreliable for civil scenes.^18,45,46 Additionally, there are two popular types of time-of-flight sensors: mobile LiDAR and static terrestrial laser scanners.²⁹ Terrestrial laser scanners are highly accurate and have a long sensing range; however, they are expensive and unsuitable to rapidly scan structures with complex geometries, due to the time-consuming task of relocating the scanner to obtain optimal coverage and diverse perspectives to create an RGB-D dataset for training.^47,48 In contrast, mobile LiDAR is lower cost and can rapidly collect spatial data from the perspectives of the inspector, which makes it most suitable for creating an RGB-D dataset.^49,50 With these considerations, we build a custom mobile LiDAR scanner to collect the RGB-D data from real-world structures.

Mobile scanning system

An objective of this work is to reduce the barrier for researchers to replicate the hardware and contribute their own scan data. To this end, only commercially available sensor and hardware are used, the spatial sensors used in this work the Livox Avia LiDAR and Intel Realsense D455 camera, shown in Figure 2. The Intel Realsense camera is mounted on top of the Livox Avia LiDAR, using a custom 3D printed bracket, additionally a bottom plate is attached below the LiDAR with a standard camera mount thread, allowing the sensor to be easily adaptable to different platforms. Figure 2 shows the scanner in a handheld and backpack configuration, which is ideal for the scanning of short-medium span overpass bridges, like those in “Experiment” section.

Figure 2.

User carrying LiDAR scanner in backpack format (top). Custom-built handheld scanner equipped with Intel Realsense D455 camera and the Livox Avia for collecting dense depth images (bottom).

The Intel Realsense D455 camera is a highly modular, economical, robotic operating system compatible, computer vision camera.⁵¹ The Intel Realsense’s RGB camera is capable of capturing images with a resolution of $1, 280 \times 720$ pixels, corresponding to an 87° horizontal and 58° vertical field of view. Additionally, the Intel Realsense RGB camera has a global shutter with a maximum framerate of 30 frames-per-second, this is necessary for postprocessing in “Postprocessing” section. Lastly, a neutral density filter (film) was taped to the RGB camera to limit infrared interference from sunlight that may affect the image color quality.

To collect the depth data, the Livox Avia LiDAR sensor was used. The Livox Avia, like other LiDARs (e.g., Ouster Lidars), have a range of 3–200 m with an average depth measurement error of 1–2 cm.⁵² Differently from rotating LiDARs, the Livox Avia is a semisolid-state LiDAR with a 70° field-of-view and scans in a nonrepeating rosette pattern. The nonrepeating rosette scanning pattern of the Livox Avia scans more points in the camera’s field-of-view, comparable to much more expensive spinning LiDARs. The high overlap between the camera and LiDAR field-of-view greatly aids in creating denser depth images after postprocessing in “Postprocessing” section.

For the camera and LiDAR to work as a unit, they need to have extrinsic calibration. In this work, extrinsic calibration was done after affixing the camera to the LiDAR using a custom 3D printed bracket, which bolts to existing mounting points on the LiDAR and camera housings. This allows the extrinsic calibrations to be embodied in the 3D printed mount, and thus researchers can directly 3D print and connect camera and LiDAR without additional calibrations steps.⁵³ Finally, the sensors were plugged into an Intel NUC mobile computer for collecting and storing data. In the backpack configuration as seen in Figure 2, the NUC, batteries, and miscellaneous hardware components are stowed in the backpack with a single wiring loom going over the user’s right shoulder, allowing for single-handed operation of the scanner. This setup enables a single operator to collect high-density RGB-D data of many types of civil infrastructure, for training monocular depth estimation in the civil domain.

Postprocessing

A challenge of integrating camera and time-of-flight sensors is that cameras take a near instantaneous snapshot of a scene, while LiDARs continuously send and evaluate the time to return of each laser. Where camera sensor’s resolutions are measured in millions of pixels a LiDAR sensor’s speed is measured in hundreds of thousand points per second. Thus, the disparity between LiDAR’s sampling rate and pixels per image results in very high sparsity in a depth image, which is unusable for deep learning tasks since sparse depth maps make inefficient supervision targets, resulting in poor training outcomes. To address this, issue depth accumulation was used, where a stream of LiDAR scan points is gathered to render a depth image corresponding to an RGB image.⁵⁴

Since the pose of the sensor also changes between sequential LiDAR scans, it is necessary to determine the relative poses of the sensor at each scan. To obtain the relative pose of the sensor, an open-source SLAM algorithm called R3LIVE was employed, which leverages the LiDAR scans, images, and IMU measurements to create a point cloud in the map frame.⁵⁵ A big problem with SLAM methods is drift, where small errors are compounded over time resulting in an incorrect odometry and point cloud map.⁵⁵ A popular way to mitigate drift is using loop-closure, where the user revisits a scanned location (e.g., the starting point), then the difference between relocalization pose and odometry are resolved using pose-graph optimization.⁵⁶ However, relocalization in civil environments is challenging and closing loops during inspection may not be practical (e.g., crossing the street).⁵⁷ Thus, to mitigate risk of drift, only the most recent LiDAR scans are considered to render the depth map, which assumes that local relative poses are relatively accurate.⁵⁴ In practice, the local point cloud map is a queue of some fixed size, where new points displace the longest existing points in the queue. In this work, the latest 250,000 points from the latest scans prior to a RGB image’s timestamp are stored, this is referred to as depth accumulation. This ensures that there are minimal errors between the local LiDAR point cloud and camera poses due to drift. Finally, the resulting accumulated LiDAR point cloud is projected onto the image plane of the RGB camera, assuming the pinhole camera model, using the camera pose estimated from SLAM. Figure 3 shows a visual representation of the accumulated LiDAR point cloud being projected on to an image frame.

Figure 3.

Depth image rendering from accumulated LiDAR point cloud using the pinhole camera model.

To render a depth image, we try to find the shortest length of a ray ( $D$ ) from the camera center, passing through each pixel location ( $[x, y]^{T}$ ) and intersecting the point cloud at the 3D point ( $[X^{m}, Y^{m}, Z^{m}]^{T}$ ). If the camera matrix ( $K$ ) is known and the extrinsic parameters ( $[R | t]$ ) are available from SLAM, the depth image can be efficiently calculated in three steps: (1) transform the point cloud from the map frame to camera frame using Equation (1); (2) project each point, in camera frame, to the image frame using Equation (2); and (3) resolve multiple depth values at any image coordinate using a depth buffer.

First, the accumulated point cloud needs to be transformed from the map frame to the camera frame. This transformation requires the extrinsic parameters, composed of the $3 \times 3$ rotation matrix ( $R$ ) and $3 \times 1$ translation vector ( $t$ ), which defines the pose of the camera frame ( $c$ ) relative to the map frame ( $m$ ). In this work, the $R$ matrix and $t$ vector are obtained from SLAM. Then, the homogenous vector in the camera frame ( $[X^{c}, Y^{c}, Z^{c}, 1]^{T}$ ), are the homogenous coordinate of the point in the map frame ( $[X^{m}, Y^{m}, Z^{m}, 1]^{T}$ ) transformed by the $3 \times 4$ extrinsic matrix ( $[R | t]$ ), as shown in Equation (1):

{[X^{c} Y^{c} Z^{c} 1]}^{T} = [R | t] {[X^{m} Y^{m} Z^{m} 1]}^{T}

(1)

The transformation in Equation (1) ensures that the XY-plane is parallel to the image plane and thus the Z component represents the depth in the image. Thus, $Z^{c}$ becomes $D$ as shown in Figure 3.

Second, it is necessary to project each point in the camera frame to the image plane, per the pinhole (projective) camera model to obtain its depth. This projection requires $K$ , which can be obtained through camera calibration or provided by the manufacturer. Then, the homogenous image coordinate vector ( $[x, y, 1]^{T}$ ) are defined as the 3D points in the homogenous camera coordinate vector ( $[X^{c}, Y^{c}, Z^{c}, 1]^{T}$ ) projected by $[K | 0]$ , where $[0]$ is a $3 \times 1$ zero column vector, as shown in Equation (2):

{[x y 1]}^{T} = [K | 0] {[X^{c} Y^{c} Z^{c} 1]}^{T}

(2)

For each 3D point, the resulting x and y coordinates in the image plane were then rounded to the nearest integer such that each depth value is assigned to a specific pixel coordinate in the image.

Lastly, since the mapping between the set of 3D points in the camera frame to image pixel coordinates is not bijective, it can create an ambiguity about the depth in the depth image. For example, multiple 3D points could be projected to one pixel point. This is typically caused by occlusions due to ego-motion of the sensor. Then to resolve the correct depth image, we employed the depth buffer algorithm to guarantee that when multiple 3D points project onto the same image pixel, the one with the smallest depth value gets displayed.⁵⁸ This approach prioritizes selecting 3D points closer to the camera over those further away when producing the depth image.

This three-step process is repeated for each image with estimated pose from SLAM, which generates a dense depth image for each RGB image. We note that there are many configurations for postprocessing, such as number of accumulated points, points from past and future scans, image size, which can affect the density of the depth image and computational effort to render. These specific configurations will depend on the user’s application and thus we will provide the raw scan data, and the postprocessing scripts, along with this dataset.

For training monocular depth estimation in this work, images from the Intel Realsense are downscaled by a factor of 0.6, yielding a resolution of $768 \times 448$ , such that the total pixels in the image is roughly the same as the number of accumulated LiDAR points (250,000), accounting for the difference between the LiDAR and camera field of views. Smaller RGB-D images require fewer LiDAR points, which reduce computational costs for postprocessing and training of deep monocular depth estimation models, in “Model development” section.

Experiment

Bridge RGB-D datasets

The custom mobile scanning hardware was used to collect the data from five in-service bridges near downtown Kitchener, Ontario, Canada. These bridges support an elevated railway running East-West, crossing over Belmont St., Iron Horse Trail, Park St., King St., and Weber St. shown in Figure 4 (The order of the photos is from top to bottom and from left to right). These bridges are labeled as B1-B5.

Figure 4.

Five bridges utilized to construct the RGB-D dataset for this study.

The data collection took place from pedestrian-accessible paths and crossings surrounding each bridge. An effort was made to point the scanner to the bridge and to avoid scanning areas less than the minimum range of the LiDAR sensor (3 m). The sensor was held in front of the user at chest height, typically pointed in the direction of travel, the user can pitch the sensor, at the elbow, to ensure coverage of the bridge soffit. Images were collected at 30 Hz for the duration of each scanning session, which ranged from 3 to 7 min to scan a single bridge. In total, approximately 40,500 RGB-D frames were processed from all five bridges (B1: 7,294 frames, B2: 4,169 frames, B3: 7,024, B4: 10,523 frames, B5: 11,474 frames).

Figure 5 shows a sample RGB-D frames that include color and depth images. Where the lighter intensity pixels denote closer objects and vice versa, and the gray regions are pixels without depth data. The circular pattern of depth points results from the scanning pattern of the Livox Avia LiDAR, which utilizes a nonrepeating rosette pattern. Due to a small difference in the field of view between the Livox Avia LiDAR, 70° and the Intel Realsense D455, 90° there may be a lack of depth values at the border of the depth image (left and right of the depth image). This is most apparent when the sensor is stationary and is less apparent as the user moves and rotates the scanner to scan the structure.

Figure 5.

Sample RGB-D frames (RGB and depth images).

Due to the high sampling rate of the scanning system, the collected RGB-D frames may contain a significant number of redundant frames, such as subsequent frames with substantial overlap. Redundant frames do not significantly improve the accuracy of the depth estimation model and only increase the training time.⁴⁰ To address this issue, the frames from each bridge are down-sampled based on the location of the frame estimated from SLAM. Thus, to prepare the dataset, only those frames, where the relative displacement exceeded 10 cm from a previously chosen frame, were selected. Consequently, 7748 RGB-D bridge frames were selected to use for model development, in “Model development” section.

Model development

After postprocessing in “Post processing” section, the RGB-D images of $768 \times 448$ are obtained, subsequently, during data augmentation, the images were randomly cropped to a model input size of $640 \times 384$ to prevent overfitting. We trained BTS and fine-tuned ZOE as prescribed by their respective authors. BTS was trained using the KITTI configurations, and we followed ZOE’s default metric-depth fine-tuning configuration.^39,42 Due to the different computational requirements for BTS and ZOE, model training was done on two different workstations. BTS was trained using a workstation with a single Nvidia Titan V GPU with 12 GB of VRAM, while ZOE was fine-tuned on a workstation with a single Nvidia A5000 GPU with 24 GB of VRAM. Table 1 summarizes some of the common hyperparameters between BTS and ZOE.

Table 1.

Key parameters for model training.

Model	BTS	ZOE
Encoder	Densenet161	BEiT_L_384
Input size	$640 \times 384$	$640 \times 384$
Batch size	4	1
Epochs	50	5
Learning rate	1e-4	1e-4
Weight decay	1e-2	1e-2

The models were evaluated using standard depth estimation metrics that have been recommended in various previous studies: average Absolute Relative error (AbsRel) in Equation (3), root mean square error (RMSE) in Equation (4), average log10 error (Log10) in Equation (5), and delta under thresholds (%) in Equation (6)³⁹:

\frac{1}{T} \sum_{\tilde{D} \in T} \frac{| \tilde{d} - d |}{d}

(3)

\sqrt{\frac{1}{T} \sum_{\tilde{D} \in T} ∥ \tilde{d} - d ∥^{2}}

(4)

\frac{1}{T} \sum_{\tilde{d} \in T} | \log_{10} \tilde{d} - \log_{10} d |

(5)

% of \tilde{d} s . t . \max (\frac{\tilde{d}}{d}, \frac{d}{\tilde{d}}) = δ < thr

(6)

Here, $T$ is the set of pixels with valid depths, $d$ and $\tilde{d}$ are the ground-truth and predicted depth, respectively. These metrics are necessary as it is difficult to understand the goodness of a monocular depth prediction. In this work, AbsRel, in Equation (3), is used to create heatmaps to convey the relative error between the predicted and ground-truth depths, like in Figure 6. In contrast to AbsRel, the RMSE in Equation (4) provides error in real units; however, this metric is not suitable to compare model performance between datasets because the distribution of depths in the dataset can heavily bias this metric (i.e., RMSE can be simply reduced by reducing the maximum depth in the dataset). Log10 in Equation (5) mitigates issues about the magnitude of depth magnitude by comparing the ground-truth and predicted depth in log space, but has limited interpretability. Lastly, the percent under threshold metric, differs from the previous metrics by attempting to provide a coarse understanding of the error distribution ( $δ < 1.25$ : percentage with error under $25 %$ , $δ < 1 . 25^{2}$ : percentage with error under $~ 56 %$ , $δ < 1 . 25^{3}$ : percentage with error under $~ 95 %$ ). While these thresholds are broad, this metric is very intuitive and is often used as a rough proxy for “accuracy” of the depth estimation model.

Figure 6.

Comparison of AbsRel heatmaps for four spalling defects on bridge B3 using few-shot ZOE models.

Model generalizability

The main challenge with the bridge RGB-D dataset is that there are relatively few sequences and they are visually and structurally distinct. This must be expected because infrastructure is typically built to site requirements, and additional bridge RGB-D data are expensive and logistically challenging to obtain. Therefore, it is advantageous for a monocular depth model to learn robust depth cues that are relevant across various types of infrastructure. Essentially, the ability of the model to generalize is a crucial factor for its practical effectiveness in the real-world application of monocular defect quantification.

To evaluate the generalizability of BTS and ZOE in the civil domain, we test depth estimation performance in “zero-shot” and “few-shot” configurations using bridge RGB-D dataset. In the zero-shot configuration training was done (B1, B2, B4, and B5) sequences and inference on an unseen (B3) testing sequence. This is done to showcase the capacity of the model to learn robust depth cues for adapting to new scenes. With the zero-shot performance as a baseline, a smaller number of samples (10, 20, and 30 RGB-D frames) from the testing set (B3) into the training process, which is termed to “few-shot” in this work. The underlying idea is that the few-shot performance can further demonstrate the depth estimation capabilities of the model, by extrapolating from few testing samples (approximately 0.5, 1, and 1.5% of the B3 sequence), mixed into the training set, to infer the remainder of the testing sequence. This can simulate the effect of training with a more comprehensive bridge RGB-D dataset and provide insights about the applicability of monocular depth estimation models to inspections of more regular infrastructure (e.g., highway overpass bridge, parking garages).

Table 2 shows the model performance evaluation using the metrics shown in Equations (3)–(6) for BTS and ZOE in zero-shot, 10-shot, 20-shot, and 30-shot configurations. From Table 2, it is clear that ZOE greatly outperforms BTS model by demonstrating superior generalizability between the training sequences and testing sequence (zero-shot) and ability to generalize to the testing sequence with few samples (few-shot). Comparing the zero-shot performance, ZOE demonstrates a 4% reduction (i.e., approximately 20% improvement) in AbsRel error and a 13% increase in percentage under the 25% threshold. Similarly, with the few-shot models, ZOE is observed to perform significantly better than BTS. For example, from Table 2, ZOE 10-shot outperforms BTS 30-shot, and the ZOE 30-shot model nearly halves the AbsRel and Log10 error compared to the BTS 30-shot model. This can also be observed in the performance improvement between zero-shot and 30-shot performance, for example, the AbsRel error of BTS showed a 35.7% improvement, while ZOE showed a 52.3% improvement, between zero-shot and 30-shot. These findings indicate that ZOE is much more efficient and applicable for learning robust depth cues in civil engineering settings, like bridges, which are diverse in construction and have a high cost for data collection. This indicates that ZOE-like methods could be valuable in developing monocular depth estimation applications, such as defect quantification, within the civil engineering domain.

Table 2.

Evaluation of the model performance using various metrics in Equations (4)–(6).

Model	Lower is better			Higher is better
	$AbsRel$	$RMSE$	$\log_{10}$	$δ < 1.25$	$δ < 1 . 25^{2}$	$δ < 1 . 25^{3}$
BTS zero-shot	0.224	3.950	0.0986	0.619	0.876	0.959
BTS 10-shot	0.223	3.883	0.0945	0.638	0.890	0.965
BTS 20-shot	0.156	3.391	0.0690	0.779	0.934	0.981
BTS 30-shot	0.144	3.239	0.0650	0.806	0.945	0.985
ZOE Zero-shot	0.181	3.259	0.0726	0.749	0.934	0.976
ZOE 10-shot	0.138	2.736	0.0549	0.829	0.952	0.981
ZOE 20-shot	0.097	2.395	0.0404	0.904	0.972	0.989
ZOE 30-shot	0.086	2.292	0.0369	0.926	0.976	0.990

RMSE: root mean square error.

Defect quantification

The testing bridge B3 is a concrete slab bridge, with a center pier, built in 1931. A major issue for B3 is spalling of concrete on the abutment wall, central pier, and soffit. Spalling defects have the potential to expose underlying steel reinforcement in the structure, which results in accelerated structural degradation. Inspectors typically use a ruler to measures the length and width of the spalling defect, where the maximum length or width is used to severity, and the total area is used to create repair estimates.²⁴ However, this manual method can be complex and costly, in this case requiring traffic control and elevating equipment for inspectors to safely access and evaluate the damage. In contrast, we utilize monocular depth estimation to quantify spalling defects, from single images taken from the sidewalk.

From a preliminary walk-through, four spalling defects were identified in bridge B3, labeled S1-S4, as depicted in Figure 6. In order to find a relationship between model performance and spalling quantification accuracy, the ZOE few-shot models are used. Pairs of 2D points on the testing images, corresponding to the largest primary and secondary dimension of the spalling, corresponding to its length and width, are manually selected from images. Such that the same 2D points are projected into 3D using ground-truth and predicted depth images, to ensure consistency. The largest dimension is marked as the primary (in red), and the smaller as the secondary (in yellow). Then, these points’ pixel coordinates are converted into 3D coordinates in metric units using the inverse projection formula described in Equation (7). In Equation (7), $K$ is the pinhole camera matrix for the test image, where $x$ and $y$ are the pixel coordinates of the chosen point, and $D$ is its depth value from the image’s depth map. This calculation yields the 3D vector $[X, Y, Z]^{T}$ of the point in a metric camera coordinate system. The spalling measurement is then the Euclidean distance between pairs of points in the camera coordinates. For comparison, ground-truth measurements (G.T.) are obtained using the same process but are based on the depth maps from the bridge RGB-D dataset:

{[X^{c} Y^{c} Z^{c}]}^{T} = D * K^{- 1} {[x y 1]}^{T}

(7)

Table 3 reports the measurements of S1–S4 using ZOE models described in the preceding section. Differently from the global depth metrics in Table 2, each defect measurement is the result of sampling two pixel values in the depth prediction, which can sometimes result in outlying defect measurement results. However, on average measurement accuracy of ZOE zero-shot, 10-shot, 20-shot, and 30-shot models are 32, 20, 13, and 6%, respectively. We believe these results can be meaningfully improved with further research, as encouragingly even with 10 testing samples, at 20% average measurement error, the ZOE model can provide a rough estimate of the size of spalling damage, which may be useful with quantized damage severity and estimation methods. For example, OSIM classifies spalling severity in four categories by their maximum measurement: light (less than 150 mm), medium (between 150 and 300 mm), severe (between 300 and 600 mm), and very severe (more than 600 mm).²⁴ Moreover, for spalling repair estimates, it is common for the defect area to be rounded up to a convenient dimension (e.g., the nearest half meter (500 mm)), since loose concrete around the spalling may need to be chipped prior to the repair.⁵⁹

Table 3.

Spalling measurements in (mm) using various ZOE models.

Spalling		G.T.	Zero-shot	10-Shot	20-Shot	30-Shot
S1	Primary	4280	4867	4012	3707	3966
	Secondary	1234	909	1118	1189	1267
S2	Primary	1016	1149	1083	1037	932
	Secondary	391	440	427	407	390
S3	Primary	1244	1688	1484	1530	1419
	Secondary	787	1108	881	1033	817
S4	Primary	459	712	668	525	471
	Secondary	178	287	275	200	193
AVG error (%)			0.32	0.20	0.13	0.06

To further understand the potential cause of measurement errors, the AbsRel heatmaps are plotted, which provide the per-pixel absolute percentage difference between the ground-truth depth and depth predictions in Figure 6. Where regions of low AbsRel error is shown in dark blue and clipped at a maximum AbsRel error of 100% in red.

The average defect measurement error is roughly correlated with the AbsRel metric of the model in Table 3. This indicates that a model’s performance can be used as a proxy for the quantification accuracy. However, it is also observed, from Figure 6, that depth prediction errors are not equally distributed. Where the sidewalk, roadway, bridge slab, and pier are generally well estimated, there are some error prone regions for example, at sharp depth transitions (i.e., bridge soffit to sky, bridge pier to abutment, and vegetation) and the bridge soffit. Particularly, the models have difficulty to inference the correct angle between the bridge soffit and pier, such that if the soffit and pier are not perfectly square, the greater the error in the soffit as it diverges from the ground-truth, as seen in the scene S2 and S4, in Figure 6. The spatial context of the bridge soffit is challenging to learn since images of the soffit typically contain fewer depth cues, and this difference can be observed when comparing the predictions between spalling S3 and S4, from Figure 6. Where spalling S3, near the side of the soffit, is more accurately measured compared to spalling S4 in the middle of the soffit. Encouragingly, S4 appears to be learnable; however, requiring additional samples compared to S1, S2, and S3.

Lastly, we utilized the ZOE models to measure the same spalling defects, S1–S4, from images collected with a personal smartphone, since they are commonly used by inspectors to collect images of defects. In this experiment, a Google Pixel 6a was used; however, any modern device may be utilized, as long the camera intrinsic can be obtained from its manufacturer. To control for metric depth ambiguity caused by difference in focal lengths, the cellular images have been scale- and crop-augmented such that the transformed image intrinsic ( $f = 381 px, Cx = 384 px, Cy = 224 px$ ) matches those of the Intel Realsense training images. This is done by first resizing the Pixel 6a image by ratio of its and Intel Realsense intrinsics and center cropped to such that the dimensions of the resized image and Intel Realsense are the same.⁶⁰ Figure 7, shows the images of defects collected from the Google Pixel 6a, and the defect quantification performance is shown in Table 4.

Figure 7.

Raw (top) and augmented (bottom) images collected from Google Pixel 6a: S1-S4 are the spalling defects for evaluating quantification performance

Table 4.

Spalling measurements in (mm) from Google Pixel 6a images using ZOE.

Spalling		G.T.	Zero-shot	10-Shot	20-Shot	30-Shot
S1	Primary	4280	4606	3656	3758	3929
	Secondary	1234	872	839	993	1155
S2	Primary	1016	1197	1256	1087	1047
	Secondary	391	442	451	498	388
S3	Primary	1244	1782	1564	1630	1630
	Secondary	787	1127	686	690	597
S4	Primary	459	583	602	529	493
	Secondary	178	203	206	200	191
AVG error (%)			0.24	0.21	0.17	0.11

Despite previously transforming the camera intrinsics of the Pixel 6a, there exists many perceivable (i.e., sharpness, contrast, lighting, etc.) and imperceivable (i.e., rolling shutter, postprocessing artifacts) differences between Intel Realsense images. Due to the challenges to synchronize the camera a LiDAR, the training set only contains Intel Realsense RGBD frames, so we expect the depth estimation and thus defect measurement on Pixel 6a images to be worse. On average, we found there is a minor increase in measurement error between the Intel Realsense and Pixel 6a images. However, there are many confounding variables such as the time of day and relative position of the camera that can produce unexpected results. For example, we found that the zero-shot and 10-shot prediction of S4 is improved when using the Pixel 6a, which we believe is likely due to the greater dynamic range of the image and the image containing more of the pier and road, which likely helped the depth estimation.³⁹ This implies that a higher-quality camera can be beneficial to the future development of the mobile LiDAR system for collecting civil RGB-D data.

In summary, novel developments in monocular depth estimation can be an effective method to measure defects from single images with considerable accuracy. In this work, we focus on the ability of a monocular depth estimation model to generalize, as we believe generalizability is the biggest challenge for real-world application of monocular depth estimation in the civil domain. This is because there is a great diversity of structures in the civil domain and often a high cost to collect spatial data of those structures. In other words, models with high generalizability allow us to make the most of limited data, and we are optimistic that through collaboration with the research community, we can develop a more extensive civil RGB-D dataset. Such that, through this effort, we can facilitate a broad range of monocular depth estimation applications and research within the civil sector.

Conclusion

Defect quantification is an important part of visual structural inspections as it informs defect severity and repair cost estimates. In this work, it is shown that monocular depth estimation can be used to quantifing spalling defects from only a single images. This is enabled by a civil RGB-D dataset, collected using a custom 3D mobile LiDAR scanner. The civil RGB-D dataset is used to evaluate model generalizability performance of two metric monocular depth estimation models: BTS and ZOE. It was experimentally found that ZOE, based on a large vision-transformer model, significantly outperformed BTS, a CNN model, in zero-shot and few-shot configurations. The high generalizability characteristics of ZOE make it most suitable for civil domain applications, where RGB-D data of civil infrastructures is expensive to collect, and the diversity of the infrastructures is large. Lastly, the few-shot ZOE models were used to measure the size of the spalling damages, which were compared to the LiDAR measured ground-truth. While the measurement accuracies are fit for function, there is room for improved with additional training sequences, ZOE-like models with higher generalizability, and better hardware.

Depth estimation has many applications in the civil domain, such as semantic segmentation, path planning, localization. We hope that other researchers will share their scan data to create a comprehensive spatial civil infrastructure dataset to advance the development of vision-based infrastructure inspections.

Supplemental Material

sj-pdf-1-shm-10.1177_14759217251316532 – Supplemental material for Learning monocular depth estimation for defect measurement from civil RGB-D dataset

Supplemental material, sj-pdf-1-shm-10.1177_14759217251316532 for Learning monocular depth estimation for defect measurement from civil RGB-D dataset by Max Midwinter, Zaid Abbas Al-Sabbag, Rishabh Bajaj and Chul Min Yeum in Structural Health Monitoring

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: We acknowledge the support from Rogers Communications, Mitacs through Mitacs Accelerate Program, and the Natural Sciences and Engineering Research Council of Canada [RGPIN-2020-03979].

ORCID iD

Chul Min Yeum

Supplemental material

Supplemental material for this article is available online.

References

Spencer

Jr Hoskere

Narazaki

Advances in computer vision-based civil infrastructure inspection and monitoring. Engineering 2019; 5(2): 199–222.

Gao

Mosalam

KM.

Deep transfer learning for image-based structural damage recognition. Comput Aided Civ Infrastruct Eng 2018; 33(9): 748–768.

Lin

Nie

Hw.

Structural damage detection with automatic feature-extraction through deep learning. Comput Aided Civ Infrastruct Eng 2017; 32(12): 1025–1046.

Rafiei

Adeli

A novel machine learning-based algorithm to detect damage in high-rise building structures. Struct Des Tall Build 2017; 26(18): e1400.

Yeum

Dyke

Benes

, et al. Postevent reconnaissance image documentation using automated classification. J Perform Constr Facil 2019; 33(1): 04018103.

Yang

, et al. Automatic pixel-level crack detection and measurement using fully convolutional network. Comput Aided Civ Infrastruct Eng 2018; 33(12): 1090–1109.

Park

Liu

Yeum

, et al. Multioutput image classification to support postearthquake reconnaissance. J Perform Constr Facil 2022; 36(6): 04022063.

Cha

Choi

Suh

, et al. Autonomous structural visual inspection using region-based deep learning for detecting multiple damage types. Comput Aided Civ Infrastruct Eng 2018; 33(9): 731–747.

Charron

McLaughlin

Phillips

, et al. Automated bridge inspection using mobile ground robotics. J Struct Eng 2019; 145(11): 04019137.

10.

Chen

Gupta

. An implementation of faster rcnn with study for region sampling. arXiv preprint arXiv:170202138, 2017.

11.

Yeum

Choi

Dyke

SJ.

Automated region-of-interest localization and classification for vision-based visual assessment of civil infrastructure. Struct Health Monit 2019; 18(3): 675–689.

12.

Kim

Cho

Automated vision-based detection of cracks on concrete surfaces using a deep learning technique. Sensors 2018; 18(10): 3452.

13.

McLaughlin

Charron

Narasimhan

Automated defect quantification in concrete bridges using robotics and deep learning. J Comput Civil Eng 2020; 34(5): 04020029.

14.

Al-Sabbag

Yeum

Narasimhan

Interactive defect quantification through extended reality. Adv Eng Inform 2022; 51: 101473.

15.

Wang

, et al. Multi-defect segmentation from façade images using balanced copy–paste method. Comput Aided Civ Infrastruct Eng 2022; 37(11): 1434–1449.

16.

Çelik

König

A sigmoid-optimized encoder–decoder network for crack segmentation with copy-edit-paste transfer learning. Comput Aided Civ Infrastruct Eng 2022; 37(14): 1875–1890.

17.

Beckman

Polyzois

Cha

YJ.

Deep learning-based automatic volumetric damage quantification using depth camera. Autom Constr 2019; 99: 114–124.

18.

Snavely

. Megadepth: learning single-view depth prediction from internet photos. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2041–2050. Salt Lake City, UT: IEEE Computer Society.

19.

Zhang

Zou

Wang

, et al. Towards fully automated unmanned aerial vehicle-enabled bridge inspection: Where are we at? Constr Build Mater 2022; 347: 128543.

20.

Jahanshahi

Masri

Padgett

, et al. An innovative methodology for detection and quantification of cracks through incorporation of depth perception. Mach Vis Appl 2013; 24: 227–241.

21.

Dhiman

Chien

Klette

. A multi-frame stereo vision-based road profiling technique for distress analysis. In: 2018 15th International symposium on pervasive systems, algorithms and networks (I-SPAN), 2018, pp. 7–14. Yichang, China: IEEE Computer Society.

22.

Chen

Laefer

Mangina

, et al. Uav bridge inspection through evaluated 3d reconstructions. J Bridge Eng 2019; 24(4): 05019001.

23.

Ayele

Aliyari

Griffiths

, et al. Automatic crack segmentation for uav-assisted bridge inspection. Energies 2020; 13(23): 6250.

24.

MTO. Ontario structure inspection manual (OSIM). St. Catharines, ON (Canada): Ministry of Transportation Ontario, 2008.

25.

Shao

, et al. 3dgen: a framework for generating custom-made synthetic 3D datasets for civil structure health monitoring. Struct Health Monit 2024: 14759217241265540.

26.

Mondal

Jahanshahi

MR.

Fusion of color and hallucinated depth features for enhanced multimodal deep learning-based damage segmentation. Earthq Eng Eng Vib 2023; 22(1): 55–68.

27.

Iqbal

Chawla

Varma

, et al. Ai-driven road maintenance inspection v2: reducing data dependency & quantifying road damage. arXiv preprint arXiv:221003570, 2022.

28.

Park

Yeum

Hrynyk

TD.

Learning-based image scale estimation using surface textures for quantitative visual inspection of regions-of-interest. Comput Aided Civ Infrastruct Eng 2021; 36(2): 227–241.

29.

Lopes

Souza

Pedrini

A survey on rgb-d datasets. Comput Vision Image Underst 2022; 222: 103489.

30.

Wang

Zheng

. A camera calibration technique based on opencv. In: The 3rd international conference on information sciences and interaction sciences, 2010, pp. 403–406. Chengdu, China: IEEE Computer Society.

31.

Fetić

Jurić

Osmanković

. The procedure of a camera calibration using camera calibration toolbox for matlab. In: 2012 Proceedings of the 35th international convention MIPRO, 2012, pp. 1752–1757. Opatija, Croatia: IEEE Computer Society.

32.

Eigen

Puhrsch

Fergus

. Depth map prediction from a single image using a multi-scale deep network. In: Proceedings of the 27th international conference on neural information processing systems – Volume 2. NIPS’14, pp. 2366–2374. Cambridge, MA: MIT Press.

33.

Shen

Dai

, et al. Depth and surface normal estimation from monocular images using regression on deep features and hierarchical crfs. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1119–1127. Boston, MA: IEEE Computer Society.

34.

Ricci

Ouyang

, et al. Multi-scale continuous crfs as sequential deep networks for monocular depth estimation. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), 2017, pp. 161–169. Los Alamitos, CA: IEEE Computer Society.

35.

Laina

Rupprecht

Belagiannis

, et al. Deeper depth prediction with fully convolutional residual networks. In: 2016 Fourth international conference on 3D vision (3DV), Stanford, CA, 2016, pp. 239–248. IEEE.

36.

Zhang

Ren

, et al. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778. Las Vegas, NV: IEEE Computer Society.

37.

Gong

Wang

, et al. Deep ordinal regression network for monocular depth estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2002–2011. Salt Lake City, UT: IEEE Computer Society.

38.

Yin

Liu

Shen

, et al. Enforcing geometric constraints of virtual normal for depth prediction. In: Proceedings of the IEEE/CVF international conference on computer vision, Seoul, Korea, 2019. pp. 5684–5693. IEEE.

39.

Lee

Han

, et al. From big to small: Multi-scale local planar guidance for monocular depth estimation. arXiv preprint arXiv:190710326, 2019.

40.

Ranftl

Lasinger

Hafner

, et al. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Trans Pattern Anal Mach Intell 2020; 44(3): 1623–1637.

41.

Chen

Liu

, et al. Depthformer: exploiting long-range correlation and local information for accurate monocular depth estimation. Mach Intell Res 2023; 20(6): 837–854.

42.

Bhat

Birkl

Wofk

, et al. Zoedepth: zero-shot transfer by combining relative and metric depth. arXiv preprint arXiv:230212288, 2023.

43.

Dosovitskiy

Beyer

Kolesnikov

, et al. An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:201011929, 2020.

44.

Roussel

Van Eycken

Tuytelaars

. Monocular depth estimation in new environments with absolute scale. In: 2019 IEEE/RSJ international conference on intelligent robots and systems (IROS), 2019, pp. 1735–1741. Macau, China: IEEE Robotics and Automation Society.

45.

Cho

Min

Kim

, et al. Deep monocular depth estimation leveraging a large-scale outdoor stereo dataset. Expert Syst Appl 2021; 178: 114877.

46.

Dai

Chang

Savva

, et al. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 5828–5839. Honolulu, HI: IEEE Computer Society.

47.

Schops

Schonberger

Galliani

, et al. A multi-view stereo benchmark with high-resolution images and multi-camera videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3260–3269.

48.

Vasiljevic

Kolkin

Zhang

, et al. Diode: A dense indoor and outdoor depth dataset. arXiv preprint arXiv:190800463, 2019

49.

Uhrig

Schneider

, et al. Sparsity invariant cnns. In: 2017 international conference on 3D vision (3DV), 2017, pp. 11–20. Qingdao, China: IEEE Computer Society.

50.

Ramezani

Wang

Camurri

, et al. The newer college dataset: handheld lidar, inertial and vision with ground truth. In: 2020 IEEE/RSJ international conference on intelligent robots and systems (IROS), 2020, pp. 4353–4360. Las Vegas, NV: IEEE Robotics and Automation Society.

51.

Curto

Araujo

An experimental assessment of depth estimation in transparent and translucent scenes for intel realsense d415, sr305 and l515. Sensors 2022; 22(19): 7378.

52.

Kang

Wang

Semantic segmentation of fruits on multi-sensor fused data in natural orchards. Comput Electron Agric 2023; 204: 107569.

53.

Yuan

Liu

Hong

, et al. Pixel-level extrinsic self-calibration of high resolution lidar and camera in targetless environments. IEEE Robot Autom Lett 2021; 6(4): 7517–7524.

54.

Geiger

Lenz

Urtasun

. Are we ready for autonomous driving? the kitti vision benchmark suite. In: 2012 IEEE conference on computer vision and pattern recognition, 2012, pp. 3354–3361. Providence, RI: IEEE Computer Society.

55.

Lin

Zhang

. R 3 live: a robust, real-time, rgb-colored, lidar-inertial-visual tightly-coupled state estimation and mapping package. In: 2022 International conference on robotics and automation (ICRA), 2022, pp. 10672–10678. Philadelphia, PA: IEEE Robotics and Automation Society.

56.

Shao

Vijayarangan

, et al. Stereo visual inertial lidar simultaneous localization and mapping. In: 2019 IEEE/RSJ international conference on intelligent robots and systems (IROS), 2019, pp. 370–377. Macau, China: IEEE Robotics and Automation Society.

57.

Al-Sabbag

Yeum

Narasimhan

Enabling human–machine collaboration in infrastructure inspections through mixed reality. Adv Eng Inform 2022; 53: 101709.

58.

OpenGLWiki. Depth buffer precision – opengl wiki [Internet]. [place unknown: Khronos]; 2012 July 23 [cited 2023 Feb 9]. https://www.khronos.org/opengl/wiki/Depth_Buffer_Precision

59.

AASHTO. Manual for bridge element inspection. Washington, DC: American Association of State Highway and Transportation Officials, 2019.

60.

Forsyth

Ponce

Computer vision: a modern approach. 2nd ed. Boston: Pearson, 2012.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.17 MB

0.00 MB