Abstract
The accuracy of underwater target recognition by autonomous underwater vehicle (AUV) is a powerful guarantee for underwater detection, rescue, and security. Recently, deep learning has made significant improvements in digital image processing for target recognition and classification, which makes the underwater target recognition study becoming a hot research field. This article systematically describes the application of deep learning in underwater image analysis in the past few years and briefly expounds the basic principles of various underwater target recognition methods. Meanwhile, the applicable conditions, pros and cons of various methods are pointed out. The technical problems of AUV underwater dangerous target recognition methods are analyzed, and corresponding solutions are given. At the same time, we prospect the future development trend of AUV underwater target recognition.
Keywords
Introduction
With the development of technology and the continuous growth of military power, countries around the world are shifting their military priorities to the ocean. Limited by the natural conditions of the ocean and the physical limits of human beings, it is obviously impossible to exploit marine resources solely by humans. Autonomous underwater vehicle (AUV) is capable of performing underwater tasks independently. Therefore, AUV equipped with visual image acquisition equipment is often used for real-time detection in the underwater environment, which has also made autonomous underwater robots widely used in military fields, for example, mine detection, intelligence collection, and offshore defense.
In 1997, based on the neural network classifier k-nearest neighbor attractor and the optimal discriminatory filter classifier, Naval Surface Warfare Center (Dahlgren, Virginia, USA) extracted and classified the features of each detected minefield. 1 This method reduced false alarms and laid the foundation for deep learning in underwater dangerous target recognition. In 2003, for improving the efficiency of clearing landmines and unexploded ordnance, Carnegie Mellon University proposed a new method to deal with sensor uncertainty, which uses geometrical and topological features instead of sensor uncertainty models. 2 Therefore, it speeds up the demining process. To reduce the impact of the complex and changeable underwater environment, Cao et al. 3 proposed a method named stacked autoencoder (SAE)-softmax that joint SAE and softmax classifying underwater targets. This method yielded the highest recognition rate up to 94.12% out of radial basis function support vector machines (SVM) and probabilistic neural network (PNN) methods. Based on transfer-reinforcement learning, Cai et al. 4 proposed a multi-AUV target recognition approach, which reduces the impact of complex environment while improving the efficiency of target recognition. The average recognition accuracy is 82.82% out of other six methods in the case of turbid water quality, object occlusion, insufficient light, complex background, and overlapping targets.
Recently, deep learning has gained great attention in the field of target recognition. The recognition accuracy is improved through a training model on a large number of samples. However, the underwater target recognition technology is different from the land or air target recognition method. It is affected by the scattering effect of the water medium on the target information, resulting in the blurred or distorted target information. At the same time, affected by the complex underwater environment, such as time-varying ocean currents, uneven illumination, turbid water quality, and so on, it is difficult to collect target image information. Moreover, the appearance and shape of hostile dangerous targets are diverse, resulting in insufficient samples for training model and reducing recognition accuracy.
At present, for the lack of dataset, transfer learning 5 –7 can be used to train model on a dataset with a large number of land or air targets and then transfer the model to the underwater target field. Generative adversarial network (GAN) 8,9 is a new method, which can autonomously generate underwater target images to increase the number of samples. Image preprocessing, image restoration, and reinforcement learning also can be used to reduce the impact of the underwater environment interference.
Most researchers conducted in-depth research on image processing to improve the accuracy of target recognition. However, the information extracted from a single image is limited. Aiming at the insufficient acquisition of target information, Cai et al. 10 –12 introduced multiview light field reconstruction into the target recognition field. The target information can be collected through multiple views, 13 that is, multi-AUV is used to recognize underwater dangerous targets. Luo et al. 14 introduced the GAN network to the field of multi-AUV target recognition, which not only increased the accuracy of target recognition but also reduced the impact of underwater complex environment.
The above methods and studies mainly introduce deep learning algorithm into the field of underwater target recognition. But various difficulties are encountered in the process of recognizing dangerous targets in a real underwater environment, such as environmental problems, interference problems, information collection problems, and sample information missing. In this article, we will mainly discuss about different types of underwater dangerous target recognition technology, summarize the existing methods, sum up the problems and technical difficulties of various technologies in the process of underwater dangerous target recognition, and look forward to the future development direction.
Underwater dangerous target recognition technology
Mine recognition technology
Mines are widely used in modern naval battles and play an important role. Mines can not only strike submarines and block maritime traffic routes but can also cause serious psychological burdens on enemy personnel. The application of advanced technology in mine weapons makes modern mines more concealed and intelligent. It is very difficult to accurately find and eliminate them in the vast sea area.
The automatic mine recognition is the current development trend. 15 In 1997, Shin et al. 16 proposed a method of integrated wide-band compression and mine detection in shallow water areas. This method combines the target recognition algorithm and image compression to achieve excellent detection performance while minimizing the computational complexity of the algorithm. Gleckler and Fetzer 17 used an integration method of an underwater laser rangefinder and a digital camera to detect and measure the mine information. It can locate dangerous targets and recognize them. Miao et al. 18 introduced an approach of mine target recognition based on basic vision. This idea came from the essential shape characteristics of the mine target. According to the physical meaning of the geometric moment, it combines the regional feature and the boundary feature to construct three shape descriptors suitable for the mine target. It uses the threshold judgment method to realize the mine target recognition. This method has higher accuracy (more than 94%) and better stability than the method based on moment invariants. It is more suitable for the recognition of underwater targets with specific shapes and the circumstance when the targets are partially occluded.
With the development of deep learning in the target recognition field, the recognition accuracy is getting higher and higher, which has become one of the main methods of target recognition. Some researchers 19 used unsupervised processing technology to detect mine-like targets on the collected image. The AUV equipped with sonar detection equipment is used to detect the changes in image texture and image intensity in the area so as to determine the mine target buried under the sea, as shown in Figure 1. Although this method can detect mine targets through unsupervised training, the error rate is relatively high.

Mine target recognition based on sonar image. The figure shows a vehicle turn and two mine-like features.
Williams and Fakiris 20 constructed a set of classifiers, controlling the relative importance of each target in the learning phase of a given classifier through the modulation factor. They inferred the quantity of classifiers and all the other relevant model parameters from the training dataset automatically. This method improves the utilization rate of underwater target information and significantly improves the accuracy of target recognition, as shown in Figure 2. To extract multilayer features from sonar image, Guo and Chen 21 proposed the naïve Bayes Poisson gamma belief network (PGBN) model based on PGBN and Bayes’ theorem, which improve the training efficiency of the model. Moreover, the recognition accuracy can reach 93.85%, which is better than PGBN and other models, such as three-layer restricted Boltzmann machine, similarity deep belief network (DBN), DBN, SVM, and kernel SVM.

Example synthetic aperture sonar image chip of a truncated-cone-shaped target (a) on the seabed, (b) on the board of the flat seabed and ripples, and (c) on seabed characterized by sand ripples.
In the process of underwater target detection, how to reduce background interference is also extremely challenging. Based on the unsupervised network, Xie et al. 22 proposed a feature extraction approach to extract the mine intrinsic attributes. They constructed a spectrum regularization unsupervised network (SRUN) to distinguish target information from background information. Target detection is not only based on image features but also on the basis of the difference between the known target spectrum and the collected information pixel spectrum. Figure 3 shows the schematic of the proposed approach, which comprises the following steps. First, the SRUN was proposed to extract compact features in hyperspectral images. Then, the effective nodes are selected and further weighted adaptively. Finally, the background information is suppressed to gain the detection map. Experimental results on several datasets indicate that the proposed SRUN-based target detection algorithm is more suitable for targets at the subpixel level and those with structural information.

Schematic of the SRUN-based target detection method. SRUN: spectrum regularization unsupervised network.
To increase the accuracy of mine target recognition, Giovanneschi et al. 23 proposed drop-off minibatch decentralized online dictionary learning. It takes an advantage of the fact that a large number of the training data may be correlated. With this method, they trained the model on a small batch iterative manner and deleted samples that are no longer relevant. This method is faster and retains similar classification performance as the classical online dictionary learning and online dictionary learning correlation-based variant methods.
Most of the above methods are based on the physical meaning of geometric moments to recognize targets. But the shape characteristics of the mines are more prominent. Researchers can train a better model by combining the description area and boundary characteristics to establish a descriptor suitable for the shape of the mine targets or using the deep learning algorithm. In the future, researchers should focus on improving the accuracy of various shapes of mine target recognition as well as excellent anti-interference ability and timeliness.
Underwater manmade target recognition technology
At present, in addition to lethal mines, there are also many manmade devices with detection, inspection, and strike capabilities. How to accurately recognize underwater manmade equipment is one of the current key research directions. Olmos et al. 24 proposed an approach for detecting the manmade targets in unconstrained underwater videos. This algorithm can only detect targets with known contours. But when the image quality is poor, it directly reduces the recognition accuracy of the target.
In recent years, scholars used deep learning technology for underwater target recognition, which can improve target recognition accuracy and recognize more types of targets. For the purpose of reducing the impact of different environments on target recognition, Parma University used multiple datasets to study the potential of vision-based target detection algorithms in underwater scenes. 25 Through the training of multiple datasets, the algorithm can accurately recognize targets in different underwater environments and provide new ideas for subsequent research on multidata information fusion. Yu et al. 26 built a model composed of five convolutional layers and three fully connected layers based on convolutional neural network (CNN) deep learning theory. In the training procedure, both labeled in-air images and unlabeled underwater images are used to train the model. In the last two layers, the maximum mean distance feature metric is added to regularize. This method shows good robustness when recognizes underwater targets, with a recognition accuracy of up to 55.07%. The specific process is shown in Figure 4.

The training process of CNN-based target recognition. Conv means the convolutional layer and fc means the fully connected layer. CNN: convolutional neural network.
In the underwater target recognition procedure, accurate extraction of target feature information is the main factor that affects the recognition accuracy. Hussain and Zaidi 27 deblurred the image by reducing the noise in the image and predicting the Euclidean shape. Ma et al. 28 extracted the targets of interest in underwater images by applying color-based algorithms. Then, they used the improved two-dimensional (2D) Otsu algorithm to remove the background color noise. Furthermore, a robust algorithm based on shape signature was used to recognize the shape type of a regular object. The experimental results indicate an ideal outcome with an average recognition rate of shape type (approximately 90%). For the purpose of improving the real-time performance of the underwater target recognition algorithm, Qing et al. 29 proposed a new method based on wavelet transform and improved Hough transform. According to the experimental results, the proposed algorithm can accurately detect the straight lines that existed on manmade objects in complex underwater background. It has excellent real-time performance, that is, only 17.22 ms per image of the best result.
In the process of recognizing dangerous targets, it not only needs the target be accurately recognized but also needs to calculate the target’s status information, such as position, movement direction, and travel speed. Chen and Xu 30 established a DBN model and a stacked denoising autoencoder model. They compared the underwater acoustic simulated data of different types of targets and different states of one target and experimental data of different states of one target recognized by DBN and other models (SVM, GRNN, PNN, and SDAE). Table 1 presents the detail of experiments results.
The experiments results of different algorithms.
SVM: support vector machine; GRNN: general regression neural network; PNN: probabilistic neural network; SDAE: stacked denoising autoencoder.
The introduction of deep learning technology promotes the development of underwater target detection research. On this basis, numerous researchers have proposed more powerful models. All of the above algorithms are used for specific application scenarios, which have certain universalities but also have limitations. We summarize the advantages and disadvantages of the above algorithms here, as given in Table 2.
Comparison of underwater dangerous target recognition methods.
CNN: convolutional neural network.
Few-shot target recognition
Due to the diverse shapes of underwater artificial devices in the real environment, it is hard to collect target images and train a satisfactory model. These factors lead to low target recognition accuracy in the real environment. At present, transfer learning theory can effectively transfer the source domain training model to the target domain. Because of the convenience of collecting samples on land and in the air, the trained model can be transferred to underwater targets by training on existing targets. Based on this theory, Xiamen University integrated deep learning and transfer learning to recognize underwater manmade targets. 26 This method is superior to traditional methods in underwater manmade target recognition tasks. It is suitable for long-term research and development.
Based on a cycle-consistent adversarial network and a conditional generation adversarial network, Li et al.
31
proposed a trainable end-to-end system of an underwater multistyle generation adversarial network to solve the problem of fewer underwater image dataset. The system can generate diverse underwater images from aerial images using hybrid countermeasures and unpaired methods. Chen et al.
32
proposed a new two-level feature alignment method. With it, a typical deep domain adaptation network can deal with the domain shift problem between two modalities in data generating process. For evaluating the quality of the generated images, Liu et al.
33
used similarity values, structural similarity index, and multiscale structural similarity index to calculate the color and structure similarity level. Rao et al.
34
introduced a multimodal model, which can complete the recognition task based on experience in the case of fewer training sample images. The method proposed by Cho et al.
35
can generate simulated images through simple underwater images, which makes target recognition more accurate. To compute the similarity between the template image and sonar image, they define a correlation array of
where
The problem of few-shot image recognition can be solved not only by generating new samples but also by transfer learning. Jin and Liang 36 proposed a framework for underwater few-shot image recognition based on transfer learning. Firstly, an improved median filter was used to suppress the noise of fish images. A classical operation is used to describe the denoising results quantitatively. The peak signal-to-noise ratio (PSNR) for RGB images is computed using the standard formula
where x is a 2D spatial coordinate that belongs to the image domain
Then, the neural network is pretrained by the ImageNet that is the largest image recognition database in the world. Finally, they used the preprocessed target image to fine-tune the pretrained neural network. Thus, the recognition accuracy on the test dataset reaches 85.08%, which has made a significant improvement.
Traditional point-based feature methods often perform poorly because of biofouling, corrosion, and other effects that lead to dramatic changes in target visual appearance. Li et al. 37 used supervised learning to relearn the target and combined the particle filtering framework to automatically recognize the target. The solutions for few-shot target recognition are given in Table 3.
Comparison of different methods for few-shot underwater target recognition.
Target recognition under environmental interference conditions
Due to the harsh underwater environment, the quality of the collected target images is poor. The change of target state and the object shelter also has a huge impact on the target recognition procedure. Zhou et al. 38 introduced a compound convolutional neural network based on shared latent sparse feature and DBN. Experimental result shows that this approach is more stable for different dataset and has the highest accuracy of up to 93.34%. Experimental result is presented in Table 4.
Comparison of five methods on different datasets.a
CSDN: compound convolutional neural network; VGG: visual geometry group; DBN: deep belief network; SSD: single shot multibox detector; RFCN: region-based fully convolutional network; SCDAE: stacked convolutional denoising auto-encoder; CNN: convolutional neural network.
a Dataset A is collected in the Philippine Sea, includes the air gun samples and bomb samples with depths of 50 and 220 m. Dataset B is collected in the South China Sea, only contains bomb samples with the depths of 7, 50, and 300 m.
To effectively recognize targets of different depths and reduce radiation noise, Yang et al. 39 combined deep long short-term memory network (DLSTM) and deep autoencoder neural network together. They used pretrained DLSTM model and softmax classifier to classify ship radiation noise. Based on the long short-term memory network, Zhang and Xing 40 proposed a novel method, which integrates multiple feature data and softmax classifier to effectively remove underwater noise interference. In multiple experiments, the best results reach the accuracy of 97%. The feature fusion schematic is shown in Figure 5.

Mutilclass feature fusion recognition.
In the underwater target recognition process, the light intensity changes greatly as the depth increases. When the illumination of the target surface is uneven, shadows will be generated, which will cause a part of appearance information loss. Zhang and Negahdaripour 41 combined shadow information to reconstruct the shape of three-dimensional targets to minimize the impact of shadows. Song et al. 42 used the AUV equipped with visual image acquisition equipment to compensate the target with different light intensity so that the algorithm could extract the image features and color features of the target image. In this way, it can reduce the influence of uneven illumination on target information.
Aiming at the shortcomings of traditional backpropagation (BP) neural network, such as slow convergence and tending to local minima, Tang et al. 43 proposed a novel approach of BP neural network design based on immune genetic algorithm. This algorithm overcomes the problems of genetic algorithm in search efficiency, individual diversity, and premature. It effectively improves the convergence performance.
Because of the turbidity, absorption, and scattering of the water, the images collected underwater become blurred, which greatly affects the accuracy of target recognition. To reduce these impacts, Li et al. 44 proposed an effective defogging model to restore the visibility, color, and natural appearance of underwater images. Ding et al. 45 proposed a new underwater image enhancement strategy combining adaptive color correction and model-based defogging. By contrast with original underwater images, enhanced images help to reveal more feature points. This strategy effectively improves the quality of underwater images and makes the algorithm more accurate to recognize underwater target.
Due to the problems of low contrast, blue–green projection and low visibility, the captured underwater environment images appear green and blue. 46 This leads to distortion of the captured images. Ahn et al. 47 proposed a data enhancement method based on the principle of retina to promote the visibility of captured images. Chuang et al. 48 used feature learning technology and error-proof classifiers to preprocess the collected images to improve the image clarity. Zhang et al. 49 applied visual inspection to underwater image feature extraction. Before underwater image preprocessing, dark channel is applied to eliminate haze and enhance the contrast of underwater images. Robustness and real time of the algorithm have been greatly improved. Yu et al. 50 proposed a novel framework named underwater GAN for image restoration. It uses a convolutional patchGAN classifier to learn structure loss. Based on the underwater image generator model, a more realistic image is generated through simulation. The influence of abnormal image contrast in the target recognition process is reduced.
Since the underwater environment is accompanied by time-varying ocean currents, the signal obtained by the imaging sensor has a corresponding relationship with time. When the variable ocean currents cause fluctuations in the image refractive index in the imaging path, the task of target recognition is more difficult. 51 Florida Atlantic University 52 used compressed line sensing to reconstruct images after ocean current interference so that the imaging system can recover target information under various turbulence intensities. The network in the literature 53 refers to the network structure of Kupyn et al. 54 It restores the underwater distorted image sequence through GAN. Moreover, the training process is directed by the Wasserstein distance. Smaller the distance means higher similarity between real and fake image. This method can effectively restore the distorted images and make the images restoration degree higher. It reduces the impact of time-varying ocean currents on target information collection. The network architecture is shown in Figure 6.

Network structure of He’s method.
In the future, multi-AUV underwater target recognition will inevitably develop in the direction of real-time, high accuracy, and autonomy. The accuracy of target recognition in complicated underwater environments needs to be further improved. Table 5 summarizes the classification of target recognition methods to reduce the impact of the complex underwater environment.
Summarize of target recognition method in complex underwater environment.
Different algorithm under the same dataset
With the development of deep learning, target recognition technology has also made considerable progress. Deep learning has a strong learning ability. It can learn useful information in images from a large amount of training dataset and effectively detect objects in images. Since R-CNN 55 was proposed by Girshick, the field of target recognition has gained great attention and become an emerging research hotspot. Since then, many new models have been proposed, such as fast R-CNN, 56 faster R-CNN, 57 FPN, 58 YOLO, 59 SSD, 60 and so on. These algorithms have their own advantages and disadvantages. Some researchers apply them to the same dataset to verify the capabilities of different algorithms.
Wang et al. 61 proposed a new underwater target detection dataset, called UDD, which contains three categories (sea cucumber, sea urchin, and scallop) with a total of 2227 images. YOLOv3, RentinaNet, and other networks were selected for comparison. The comparison results are given in Table 6. To make a fair comparison, all models were trained from scratch with the same hyperparameters and data augmentation methods were used with the same parameter settings.
Comparisons for different algorithm on UDD dataset.
As presented in Table 6, CenterNet was the best performer at 33FPS, followed by YOLOv3 and Foveabox at 32FPS and 28FPS, respectively. In terms of accuracy, YOLOv3 performed best at 46.8%, followed by RPDet at 45.1% and FCOS at 44.9%. Overall, YOLOv3 performed best, ranking among the best in terms of accuracy and detection speed.
Underwater Robot Picking Contest in 2018 provided an underwater object target detection dataset, including sea urchins, sea cucumbers, scallops, and starfish. To test the detection effect of different algorithms, Zhang et al. 71 tested the regular faster R-CNN, FPN, and R-FCN. They also tested faster R-CNN and R-FCN with deformable convolution. All the models used the same hyperparameters and were tested on the same computer. The experimental results are presented in Table 7.
Comparisons for different algorithms on URPC 2018 dataset.
In the traditional target detection algorithm, R-FCN has the best performance, with mAP reaching 66.5%, which is much better than faster RCNN and FPN. Moreover, compared with the original network, the performance of the network is improved after the new deformable convolution is adopted. The best performer was deformable R-FCN at 85.7%.
In recent years, the field of underwater target recognition has developed rapidly, and new algorithms are proposed every year. These algorithms have fast detection speed and high accuracy, but they are generally targeted at specific application scenarios. The poor universality of algorithms is always a big problem in this field. In the future, the research direction should focus on developing the algorithm with strong universality.
Summary
With the continuous development of underwater weapons and equipment, researchers pay more attention to underwater safety issues. The underwater dangerous target recognition has become one of the focuses of research. Due to the vast sea area and the complex underwater environment, it is difficult to collect dangerous target images. To solve these problems, many scholars have conducted research on few-shot target recognition, such as transferring the trained models of aerial or land targets to the field of underwater dangerous targets through transfer learning, increasing the number of samples through reinforcement learning, and using GAN to perform dangerous target image generation. The target recognition accuracy of deep learning can be improved by increasing the training samples. Some scholars used methods, such as target reconstruction, image defogging, and image restoration, to reduce the actual underwater interference environment (such as uneven illumination, turbid water quality, and time-varying ocean currents, etc.) on the target image to improve the accuracy of target recognition.
With the development of the cluster system, multiple AUVs are used for collaborative work to collect target information from different angles and reduce the limitation of collecting information from a single perspective. The comprehensive utilization of various marine information can offset or reduce the impact caused by the special underwater environment of the ocean. It will be an important research direction to further improve underwater target recognition.
For the development of diversified shapes of underwater dangerous targets, as well as the shapes of unknown enemy dangerous targets, the accuracy of target recognition cannot be guaranteed only by training dataset. Current metalearning can make algorithms to have learning capabilities. Target recognition methods based on metalearning may enable the higher recognition accuracy of underwater dangerous targets.
In summary, the underwater target recognition method will develop in the direction of intelligence, autonomy, high precision removal rate, strong robustness, and real-time performance. It will play a more powerful role in the military and civilian fields.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Key Research and Development Project [2019YFB1311000].
