Abstract
A deep learning algorithm tracks an object’s movement during object tracking and the main challenge in the tracking of objects is to estimate or forecast the locations and other pertinent details of moving objects in a video. Typically, object tracking entails the process of object detection. In computer vision applications the detection, classification, and tracking of objects play a vital role, and gaining information about the various techniques available also provides significance. In this research, a systematic literature review of the object detection techniques is performed by analyzing, summarizing, and examining the existing works available. Various state of art works are collected from standard journals and the methods available, cons, and pros along with challenges are determined based on this the research questions are also formulated. Overall, around 50 research articles are collected, and the evaluation based on various metrics shows that most of the literary works used Deep convolutional neural networks (Deep CNN), and while tracking the objects object detection helps in enhancing the performance of these networks. The important issues that need to be resolved are also discussed in this research, which helps in leveling up the object-tracking techniques.
Introduction
Object detection is a crucial and challenging subject of research in the domains of computer vision and digital image processing [29]. The goal of object tracking is to automatically locate the target across the entire video. In the realm of aerial and satellite image processing, object detection is crucial but challenging [34]. For a human brain, this process of object detection in an image might be quite simple, but not for a machine. Computer vision, which has the ability of the machine to process the data that demands a huge memory and various graphic capabilities, is required for a machine to distinguish things from an image [2]. The term “object detection” refers to the process of identifying objects based on characteristics like color, shape, and size. Typically, to recognize vehicles, vehicle queues from satellite images are extracted [52]. As a result of the continued advancement of improved compressed sensing technologies in recent years, satellite video is increasingly used in several circumstances, including humanistic surveys, education, emergency rescue, disaster assistance, and military objectives [21]. Some approaches consider object detection as a classification problem and have demonstrated good performance for several specific object recognition tasks in recent years due to the advancement of machine learning techniques, notably the sophisticated feature representations and classifiers [44].
Long-term monitoring is impossible for human operators. It might be challenging to identify items like vehicles, trucks, planes, and ships in high-resolution satellite images. There is no universally accepted solution to this problem, even though numerous ways seek to tackle it [42]. CNNs, a type of neural network, can be used to detect cars, and other vehicles [52, 56, 51, 48, 32, 28, 17]. The primary difficulty in tracking moving vehicles is the small size of the vehicle, which makes stable tracking features impossible. Vehicles have mostly been the targets of moving object detection and tracking utilizing satellite films up to this point [6]. It is extensively employed in numerous application domains, including robotic navigation, intelligent video surveillance, industrial detection, aerospace, military surveillance, homeland security, transportation planning and management, and intelligent traffic guidance systems, among others [29, 4]. It is challenging for conventional detectors and trackers to make the targets visible in satellite images and high-altitude drones [35, 37]. As deep CNNs have attained better performance for image classification [42], CNN-based approaches for object identification have been drawing more and more attention from researchers [15, 47, 24, 20, 3, 30, 38, 54, 5, 18, 33, 14, 12, 8, 45, 13, 19, 7, 49, 55, 27, 9, 51].
Multiple types of research have emerged in recent times for the detection of objects along with classification and tracking. Accurate detection and classification are more important for applying the research in real-time scenarios. For the developed models there is a necessity to prove their sufficiency using multiple datasets. For developing a novel framework relevant to object detection there is a necessity to gain deeper insight into the available techniques for the detection. This research aims to provide a systematic review of the methods for answering the research questions that are interpreted below.
RQ1: What are the techniques that are widely used for the tracking of objects in recent research? RQ2: Does efficient object detection and classification play an efficient role in object tracking mechanisms? RQ3: What are the challenges that are associated with the object tracking mechanisms?
A systematic literature review of the object detection techniques is carried out in this research and the papers from the years 2015 to 2022 are analyzed and the observations are made. The study aims to provide information about the recent advancements made in the object tracking mechanisms and the methods that are used in software alone are concentrated in the research.
The research questions are formulated relying upon the information accessed from the various articles and the formulated questions are described as follows,
RQ1: What are the techniques that are widely used for the tracking of objects in recent research?
What are the methods that have been utilized for tracking the objects since 2015? What are the metrics that are used for proving the efficacy of the model?
RQ2: Does efficient object detection and classification play an efficient role in object tracking mechanisms?
What are the methods that are utilized for object detection and classification? How does the existing work deal with the issues arising in object tracking?
RQ3: What are the challenges that are associated with the object tracking mechanisms?
What are the challenges that exist in the research? What are the challenges that are resolved by the researchers and what are the factors that need improvement should be analyzed?
The organization of the paper is enumerated as follows: the related works relevant to object detection, classification, and tracking are enumerated in Section 2. The various analyses relevant to the various metrics are interpreted in Section 3. The potential challenges to be overcome in the research are interpreted in Section 4 and the research is concluded in Section 5.
The works undertaken based on the object detection, classification, and tracking methods are reviewed in the below sections. The systematic flow of the research relying on different techniques is described in Fig. 1.
Taxonomy representation.
Kamini Goyal and Dapinder Kaur [52] implemented a deep neural network (DNN) model for traffic surveillance, utilizing a Median filter for salt and pepper noise removal and reducing Descriptor size through non-negative matrix factorization (NMF). The Hybrid DNN employed advantageous pedestrian classification; however, the method’s performance is hindered by the isolated working of the detection and tracking networks. Other approaches, such as that of Alisa Makhmutavo et al. [15], addressed occlusion issues but are limited to non-moving objects. ShiJie Sun et al. [47] used deep learning for crowded vehicle tracking, yet the separate operation of detection and tracking networks negatively impacted overall efficiency. Yujia Guo et al. [10] introduced an algorithm combining a correlation filter and Kalman filter, enhancing tracking speed but facing occasional low-confidence scores. Bo Du et al. [20] fused the Kernel Correlation Filter with a three-frame-difference algorithm, improving small object detection but with potential limitations in tracking efficiency.
Da Zhang et al. [30] applied deep reinforcement learning for offline object tracking, ensuring suitability beyond real-time scenarios. Xingping Dong et al. [54] implemented a real-time tracking method with a kernel classifier, addressing drifting issues but encountering challenges with low confidence scores. Kuan Fang et al. [33] executed an autoregressive method with internal and external memory, demonstrating high robustness in occluded and crowded areas. Ahilan Appathurai et al. [14] developed a hybrid method using Artificial Neural Network and an oppositional Gravitational search optimization algorithm, enhancing performance with optimally selected weight values. Guanghan Ning et al. [6] tracked visual objects using a recurrent network, effectively handling occlusion challenges. Xu Chen and Haigang Sui [35] presented an efficient method for detecting moving objects in real-time satellite videos, utilizing a Discriminative Correlation filter and Kalman filter for precise position detection. Fahime Farahi and Hadi Sadoghi Yazdi [45] employed a probabilistic Kalman Filter for improved tracking estimation, demonstrating the ability to handle occlusion and track abnormal behavior. Jahongir Azimjonov et al. [19] used You look only once (YOLO) and Kalman filters for vehicle detection and tracking, achieving effective tracking but with lower accuracy in estimating trucks. Xu Chen et al. [9] introduced an adaptive motion separation and differential accumulated trajectory for moving vehicle detection in satellite videos, improving accuracy in distinguishing moving vehicles from pseudo-motion backgrounds. Shiyu Xuan et al. [11] developed a novel tracking algorithm combining correlation filter and motion estimation, overcoming challenges in tracking fast-moving objects. Renxi Chen et al. [36] employed adaptive filtering and lightweight CNN models for moving vehicle detection, achieving noise reduction but experiencing a slight degradation in recall.
Shiyu Xuan et al. [22] addressed tracking a moving rotating object from satellite video using an adaptive correlation filter tracking algorithm, demonstrating efficiency in handling changes in bounding boxes due to rotation. Bing Sui et al. [16] presented a lightweight network for object detection in satellite videos, overcoming the limitations of CNN-based trackers and achieving efficient object tracking with parameter identification and network parameter identification. Niharika Goswami et al. [39] utilized the U-set deep learning system for object detection in high-resolution satellite images, demonstrating simplicity but facing challenges with low scores in detecting certain objects. Xiaofeng Li et al. [46] presented a real-time algorithm for tracking vehicles from aerial images, employing image offset calibration, transfer learning, and filter set construction for accurate target motion detection. Hyungjun Kim [41] developed a traffic monitoring system using CNN for vehicle type classification, background modeling, edge detection, and object tracking, effectively tracking vehicles but facing low-score issues for certain objects. Eric Price et al. [43] expressed real-time continuous DNN-based tracking and detection from multiple cooperating robots, utilizing optimal control problem (OCP) and graphics processing units (GPUs) for multi-robot cooperative detection and tracking. Sayed Majid Azimi et al. [53] addressed multi-object tracking from aerial images using a Siamese neural network, long and short-term memory, and a graphical convolutional network, achieving accurate and stable tracking. Zhaopeng Hu et al. [25] extended deep learning for object tracking in satellite videos using a regression network, integrating gradient descent algorithm, convolutional layer, and regression model for improved performance, leveraging the Visual Geometry Group (VGG-16) network for effective feature extraction.
Object detection
Chenchen Jiang et al. [24] introduced the You Only Look Once (YOLO) model for object detection from Unnamed Aerial Vehicle (UAV) using Thermal Infra-Red images (TIR) and videos, performing multi-scenario object detection with various YOLO models. YOLOv5 demonstrated efficiency in detecting small objects in real-time at frequently changing and complex backgrounds on UAV and TIR videos however, a limitation exists as it can only work within certain viewing angles. Peng Ding et al. [26] enhanced deep CNN for optical remote sensing by employing dilated convolution and Online Hard Example Mining (OHEM) for efficient bootstrapping. The Faster R-CNN technique, combined with an enhanced VGG16-net, improved accuracy in detecting objects; nonetheless, the detection ability of the network is lower. Ying Ya et al. [29] utilized an arbitrary-oriented region CNN along with a fusion object detection framework for detecting objects from satellite images. The method involved pan-sharpening methods for fusing multi-source images and Faster R-CNN for detecting large-scale satellite images. However, the complexity of deep learning techniques, with multi-layer models, poses computational challenges.
Atakan Korez and Necaattkin Barisci [34] presented a multi-scale Faster R-CNN method for a graphic processing unit (GPU) system, utilizing the Weight Standardization (WS) technique for weight calculation in normalization. The combination of deformable convolution and ResNet-50 extracted high-resolution features efficiently, but the model is applicable only for small batch-sized images. Wenming Cao et al. [3] developed a method for real-time video object detection using fast DNN with knowledge-guided training. The deep NN is trained with a cross-network knowledge projection framework and Support Vector Machine (SVM) for low-complexity object detection; however, the model’s applicability is limited to small batch-sized images. Ivan V. Saetchnikov et al. [38] compared various CNN methods for object detection, highlighting YOLO v3’s superior performance. The inclusion of additional dropout layers with empirical optimization mitigated over-learning during the segmentation task, but the methods apply only to a limited dataset. Gong Cheng et al. [44] introduced a Rotation Invariant CNN (RICNN) model for improved object detection in very high-resolution (VHR) optical remote sensing images. Utilizing a simple rotation function and generic object proposal detection, RICNN efficiently detected vehicles; however, the model’s effectiveness is not guaranteed for all scenarios.
Joshua Bapu et al. [2] employed an Adaptive CNN for spatial object recognition with N-gram, using SOBEL edge detection and gray-level Co-occurrence matrices (GLCM) for object detection from satellite images. The complexity of deep learning techniques and the need for multiple processes to reduce noise may impact computational efficiency. Junfeng Lei et al. [21] presented a method for detecting tiny vehicles in satellite video using spatial-temporal information. The use of a Gaussian filter for the detection process and constraints aimed at terminating false detections adds complexity, requiring more constrained details. Xiaofei Liu et al. [5] used a CNN-based method for real-time ground vehicle detection in infrared images, capturing a greater number of features in infrared imagery. However, manually labeling training samples resulted in increased processing time. Tao Yang et al. [18] presented a detection method for small moving vehicles from satellite video in urban areas. The use of the saliency background model improved accuracy in moving vehicle detection; however, the model’s effectiveness depends on pre-segmented regions, reducing false detections.
Saleh Javadi et al. [12] introduced a method for heavy vehicle detection from aerial images using DNN and depth maps. While achieving improved detection through depth map analysis, the model’s effectiveness is contingent on the modified CNN architecture and selected detector network. Yuanlin Zhang et al. [11] utilized a Hierarchical and Robust NN for enhanced object detection accuracy in remote sensing images. HRCNN efficiently performed four tasks using a combination of the greedy algorithm, AlexNet for feature extraction, and Support Vector Machine; however, the model’s applicability may be limited to specific datasets. Gong Cheng et al. [37] presented a Rotation Invariant and Fisher Discriminative CNN (RIFD-CNN) for object detection with improved performance. The optimization of a new objective function applied to rotation-invariant regularize and fisher discrimination regularize on CNN demonstrated efficiency, particularly in datasets where rotation-invariance is essential. Yapeng Guo et al. [7] introduced the orientation-aware feature fusion Single-stage Detection (OAFF-SSD) deep learning technique for dense construction vehicle detection from unmanned aerial vehicles (UAV). The model’s incorporation of multilevel feature extraction and orientation-aware bounding box regression contributed to more precise detection.
Wenhua Zhang et al. [49] presented the Laplacian Feature Pyramid Network (LFPN) for combining low and high-frequency features, enhancing object detection performance in very high-resolution optical remote sensing (VHR-ORS) images. The use of the Feature Pyramid Network (FPN) and CNN demonstrated efficiency in the NWPU VHR-10 dataset. Ali Tourani et al. [40] presented a Faster Region-based CNN (Faster R-CNN) for vehicle detection from video, using a low-pass filter in image pre-processing. The residual learning framework with ResNet-50 and Faster R-CNN demonstrated effectiveness in vehicle detection, though further improvement is required. Yongzheng Xu et al. [1] introduced the Faster R-CNN method for detecting cars from low-altitude UAV imagery. The method’s two-module approach, incorporating a Fast R-CNN detector and Region Proposal Network (RPN), demonstrated high-speed vehicle detection but with potential limitations in completeness.
Object classification
S. Vasavi et al. [42] presented a neural network-based classification method that overcome the overfitting and low-performance problem of the deep learning technique. The appearance-based multi-block local binary pattern and model-based algorithm were implemented and the objects present in satellite images were detected and classified. The concept of invariant features with the dark net architecture of YOLO is added and consolidated with Faster Region-Based CNN (Faster RCNN) at different spatial locations that counted the total number of vehicles. Prediction of more classes of vehicles and small object detection were the advantages of combining YOLO with Faster RCNN.
Object detection and tracking
Hyochang Ahn et al. [4] used a knowledge-based CNN that tracked the objects using an optical flow algorithm. The position of the objects is frequently updated from the frame and the more accurate features are extracted but the processing time is considerably high. Chandan G. et al. [55] detected and tracked the objects using a faster RCNN. This model can be utilized in different situations to locate, follow, and react to the targeted objects in the video surveillance. The trained model produced good detection and tracking outcomes. Camlo Aguilar et al. [27] explained the method of tracking and detecting the objects from satellite videos based on motion CNN with two steps. Initially, the rough location target was identified with a lightweight motion detection operator and the detected results were refined and combined with CNN. Probability Hypothesis Density (PHD) filter changed detection for tracking the vehicles. Multi-object Bayesian data-association framework performed well by continuous tracking of the missed target over different Bayesian filters.
Muhammad Rashid et al. [13] presented a CNN and Scale Invariant Features Transform (SIFT) that overcomes complex backgrounds, congested situations, and similarity problems. VGG and Alex Net the deep CNN models extracted its features after that DCNN pooling and SIFT point matrix were implemented by Reyni entropy-controlled method to select its robust features. These are aligned into a matrix given to the ensemble classifier for recognition and are analyzed by Barkley 3D, Caltech101, and Pascal 3D datasets.
Others
Jiasong Zhu et al. [31] presented a method for UAV by introducing the deep learning-based detection, tracking, and counting of vehicles for estimating the traffic in urban areas. The counting Framework included two parts, they were deep learning-based detection and identification of single shot multibox detector (SSD), vehicle tracking, and counting meanwhile, these were experimented with in the UAV city Traffic Video Dataset (UAV CT). Seonkyeong Seong et al. [23] presented a method that tracked the vehicle direction involved by optical bounding box applied using CNN. Together with intersecting the received image from the camera vehicle trajectory was extracted as the YOLOv2 model algorithm that was applied in object detection meanwhile Intersection-over-union (IOU) tracker and Kalman filter vehicle tracking algorithm trajectory were estimated. Debojit Biswas et al. [50] stated a method to detect the speed of multiple moving objects from a UAV platform, which included three steps to detect and track an object. Faster R-CNN was applied for the detection of objects where channel and spatial reliability tracking (CSRT) including discriminative correlation filter applied to track objects. Similarly, to get the object location for each frame Feature-based image alignment (FBIA) was used.
Bibliographic analysis
The analysis based on methods and the analysis based on the dataset are performed and that are as follows.
Analysis based on methods
This literature intends to give new researchers the necessary backing for a better comprehension of the methods for object tracking that are now being developed. For scholars and researchers, this document covers the most prevalent methods available for object detection, classification, and tracking. This section examines 50 cutting-edge deep learning methods. Because every study tries to focus on a distinct set of parameters, it is challenging to pinpoint which approaches are preferable. To respond to that query, we will first examine each study’s methodology before identifying the strategies that are the most effective overall. Each paper’s architecture information is extracted based on its framework. The interpretations are shown in Table 1 and Fig. 2.
Analysis concerning methods
Analysis concerning methods
Analysis based on dataset
Analysis concerning methods.
The analysis is performed relying upon the various databases and is used to provide the details about the availability of the datasets. Most of the methods utilized data from standard available repositories some of them used manually collected data and others didn’t provide information about the data which is interpreted in Table 2.
Potential challenges
Handling occlusions between frames, especially in scenarios involving complex motions, poses a significant challenge for both linear and non-linear models [47]. Object detection is hindered by the presence of complex scene information, low resolution, and the absence of publicly available datasets and training models, making it a challenging task [24]. Achieving accurate dense item detection with a strong classifier proves to be a demanding task, requiring robust methods [42]. Recognizing objects in vehicles is complicated due to obscured objects and shadow zones, adding a layer of difficulty to the object recognition process [42]. Tracking objects in satellite videos is challenging due to factors such as the tiny size of moving objects, lack of texture, and background similarity [10]. Object detection in remote sensing images, applied in various fields including agriculture, city monitoring, and traffic monitoring, faces challenges such as limited datasets and high costs [34]. The scarcity of datasets for specific object classes presents an obstacle to object detection, with aerial imaging facing additional restrictions due to high costs [38].
Despite their adaptability, online classifiers often struggle with the drifting issue caused by noisy updates [54]. Poor resolution in aerial images and the complexity of vehicle recognition make it difficult to extract notable features, handle stance variations, view changes, and manage ambient radiation [5]. Accurately recognizing moving vehicles while suppressing false alarms from objects of the same size remains a challenging task [18]. In some scenarios, tracking algorithms must rely on visual cues rather than bounding box motions when pedestrians in front of the camera move at similar speeds and sizes [33]. Addressing occlusion and significant appearance fluctuation in visual object tracking poses a challenge due to the difficulty of evaluating unknown features [17]. Rapidly and precisely identifying cars in aerial images remains a challenging aspect of object tracking [12]. Visual tracking in computer vision is challenged by target deformations, lighting variations, size changes, rapid motions, occlusions, motion blur, object deformation, and backdrop clutters [6]. Object tracking encounters difficulties in crowded environments and complicated backgrounds, where distinguishing between various objects becomes challenging [13]. The threshold used to differentiate foreground and background in various satellite video images poses a significant detection challenge [9].
Conclusion
The main goal of this systematic literature review is to give new researchers a starting point for their object-tracking study. After a detailed filtration, 50 research articles relevant to object detection, classification, and tracking are analyzed and the conclusion is made that the deep CNN network is widely used for these object-tracking mechanisms. The object detection and classification play a major role in the tracking of objects from the satellite images. The analysis is made based on various factors such as the methods availed, the published journal, the year published, metrics used the potential challenges associated with object tracking is also interpreted. The superiority of the papers is also analyzed based on the citations provided by the number of researchers. Interestingly, the majority of research papers in this domain were published during the year 2019, underscoring a pivotal year for advancements in Object Tracking Mechanisms Based on Deep Learning. These findings underscore the central role of Stanford University Dataset, CNN, and precision in the evolution of this field, offering valuable insights into the trajectory of research in this area. In the future, the techniques based on optimization will also be included for reviewing the techniques available for the object tracking mechanisms.
Footnotes
Author’s Bios
