Adaptive human-in-the-loop multi-target recognition improved by learning

Abstract

Machine learning algorithms have been designed to address the challenge of multi-target recognition in dynamic and complex environments. However, sufficient high-quality samples are not always available for training an accurate multi-target recognition classifier. In this article, we propose a generic human-in-the-loop multi-target recognition framework that has four collaborative autonomy levels, and it allows adaptive autonomy level adjustment based on the recognition task complexity as well as the human operator’s performance. The human operator can intervene to relabel the collected data and guarantee the recognition accuracy when the trained classifier is not good enough. Meanwhile, the relabeled data are used for online learning which further improves the performance of the classifier. Experiments have been carried out to validate the proposed approach.

Keywords

Multi-target recognition human-in-the-loop surveillance online learning autonomy level adjustment

Introduction

Multi-target recognition is an essential problem in intelligent systems.^1,2 It is useful for a variety of relevant tasks including surveillance, event analysis, and decision-making.^3
–5 In the field of computer vision, machine learning algorithms such as SVM⁶ and deep concurrent neural networks^7,8 have been developed to train multi-target recognition classifiers. However, these state-of-the-art algorithms require a large amount of high-quality labeled data, which make the training process very expensive and time-consuming in real-world tasks, for example, in the search and rescue scenario using unmanned aerial vehicles (UAVs) with vision sensors.⁹ In addition, the trained classifiers can hardly recognize all the targets in dynamic and complex environments,¹⁰ especially when the targets are high speed,¹¹ partly occluded,¹² or varying in scales^13,14 (see Figure 1).

Figure 1.

Challenging multi-target recognition cases where the targets are: (a) high speed, (b) occluded, and (c) varying in scales.

Human–automation collaboration is a promising way to solve tasks that are too difficult to be done by human or automation alone. By evaluating the capabilities and limitations of both the human operator and the automation, tasks have been reallocated to the human and the automation.^15,16 For example, a tele-operated UAV with a low-quality camera can navigate in cluttered indoor environments, where the obstacle recognition task is assigned to the human operator and the path planning task to the UAV.¹⁶ In a multi-target surveillance task, the targets are classified as threats and sent to the human operator for confirmation using an object-oriented task model.¹⁵ Considering shared situation awareness, six autonomy levels have been proposed for human–automation teamwork.¹⁷ However, it remains a challenge to adaptively switch between the autonomy levels based on the dynamic evaluation of both the human operator and the automation.¹⁴

In the literature, human-in-the-loop recognition systems have been designed to address the problems of fine-grained visual recognition,¹⁸ semi-supervised clustering,¹⁹ and attribute-based image classification.²⁰ Human annotation is a popular way to provide the machine learning algorithms with high-quality training data.²¹ Human annotators are required to draw bounding boxes on a subset of images to generate training data for active learning.^22,23 In order to reduce the workload of human annotators, bounding boxes are produced automatically by the learning algorithm for human verification. Human efforts have been further amplified to leverage the deep learning algorithm through a partially automated labeling scheme.²⁴ These methods have taken into account the human as an additional source of training data provider. However, the human operators behave differently in different task settings, and the provided data can be unreliable and harmful for the learning algorithm.²⁵ Therefore, the human performance has to be considered from the perspective of human factors,²⁶ separately from the evaluation of machine learning algorithms.

Human factor researchers have used task models to capture the descriptive and normative features of the human–automation systems.^15,27 The performance of the human–automation systems can be guaranteed by integrating automation, human–device interfaces, human task behavior, human cognition, and environmental conditions.^28,29 An enhanced operator function model (EOFM)³⁰ has been proposed to design and analyze human–automation interaction using formal methods.³¹ But constructing the state representation and exploring the state space remain a challenge. Based on a human–robot collaboration model to evaluate the task,^32,33 different human–robot collaboration levels are designed for complex unstructured environment. However, this work has mainly stay on the simulation and has not considered promoting the automatic algorithms by the interaction.

In this article, we propose an adaptive human-in-the-loop multi-target recognition framework that has four collaboration levels considering both the automation and the human factors, as shown in Figure 2. We aim to enable the fast recognition of specific targets such as cars and pedestrians.^4,34 Different from the previous research, we treat the recognition algorithms and human operators as two kinds of independent detectors, which has taken inspiration from the signal detection theory.³⁵ We have integrated the human operators into the task loop to improve the multi-target recognition classifiers through online learning with high-quality training data. The proposed human–automation collaboration model has four autonomy levels that can be adaptively switched in dynamic task environments based on both the workload of the operators and the performance of the classifiers. Our model is generic enough for any multi-target recognition problem by choosing a target type and a multi-target recognition algorithm.

Figure 2.

The overview of the improved human-in-the-loop multi-target recognition framework.

The remainder of this article is organized as follows. “Methods” section introduces the human-in-the-loop multi-target recognition framework and its components in detail, followed by experiments and discussion in “Experiments” section. Finally, conclusion is discussed in the last section.

Methods

Overview

We design four collaboration levels³⁶ to collaborate with the HOG-SVM classifier.³⁷ The collaboration levels are mainly different in the confidence to the operator and the classifier. In the real-time surveillance task, we dynamically switch the collaboration levels depending on the evaluation of the condition of the human operator, the classifier adaption, and the environment, which are quantified by a human–classifier collaboration model.^32,33 The experts score the pretrained classifier offline by testing it with the true or the possible task environment images and label them before our task, based on which we evaluate the adaption of the pretrained classifier in the real task online. We use an eye tracker to monitor the human operator’s abnormal behaviors. The saliency method calculates the number of interesting targets. During the interaction that consists of four collaboration levels, the classifier is trained. The difference between this process and active learning is that we aim to get correct labels for a specific set of scene images rather than learning a generalizable classifier. In other words, we overfit a recognition model to guarantee the precision in a given task without considering generalizability outside the task. It is acceptable at least in a short period of time during the surveillance task, and when the scene changes, we can switch the collaboration level and retrain the classifier.

Four collaboration level

Under the condition of current technology, the UAV systems are not able to be completely autonomous. However, human operators are subject to fatigue, low precision, and obviously different opinions on even the same object.³⁸ By human–classifier collaboration in deep detail, we can make the whole system more efficient. Based on different dependencies on the operator and the classifier, we design an interface with four collaboration levels and choose the SVM classifiers. The collaboration levels are based on four degree of autonomy, which are defined according to Sheridan et al.’s³⁹ scale of “action selection and automation of decision.” Simply, they can be described as the operator recognizes the objects independently, the operator recognizes the objects with the assistance of the classifier, the classifier recognizes the objects with the supervision of the operator, and the classifier recognizes the objects independently.

H: The operator selects the targets using the bounding box compatible with level 1 on Sheridan’s scale. As shown in Figure 3, the orange bounding boxes are drawn by the operator to select the targets.

HR: The operator acknowledges the classifier’s correct detections by clicking, ignores the false detections, and selects the targets missed by the classifier using bounding boxes compatible with levels 3 and 4 on Sheridan’s scale. As shown in Figure 4(a), the green bounding boxes with a question mark are the recommendation given by the classifier. There are missed targets, false detection, and correct detection. The operator identifies them by clicking and draws bounding boxes on the missed ones, as shown in Figure 4(b). The recommended bounding boxes without the operator’s verification are the false ones.

RH: The operator vetoes the classifier’s false detections by clicking (otherwise means acceptance) and selects the targets missed by the classifier using bounding boxes compatible with levels 5–7 on Sheridan’s scale. As shown in Figure 5(a), the blue bounding boxes are the targets selected by the classifier. There are missed targets, false detection, and correct detection, and the operator cancels false detections by clicking and draws bounding boxes on the missed, as shown in Figure 5(b). The selected bounding boxes without the operator’s cancel mean acceptance.

R: The targets are automatically selected by the classifier completely compatible with level 10 on Sheridan’s scale. As shown in Figure 6, the red bounding boxes are drawn by the classifier.

Figure 3.

H collaboration level.

Figure 4.

HR collaboration level. (a) The initial result produced by the classifier and (b) the result after the operator’s participation.

Figure 5.

RH collaboration level. (a) The initial result produced by the classifier and (b) the result after the operator’s participation.

Figure 6.

R collaboration level.

Human–classifier collaboration model

The objective function in the model³² is defined as

V_{I S} = V_{H s} - V_{M s} - V_{F A s} + V_{C R s} - V_{T s}

V_Hs is the gain of hit, V_FAs is the penalty of the false alarm, V_Ms is the penalty of the miss, V_CRs is the gain of the correct rejection, and V_Ts is the cost, including the cost of time and operation. The values above have the same dimension.

Based on the signal detection theory, the object function for each collaboration level V_IS (H), V_IS (HR), V_IS (RH), and V_IS (R) can be transferred into function with variables: human sensitivity $d_{h}^{'}$ , classifier sensitivity $d_{r}^{'}$ , and targets number N × P_S . And we can switch our collaboration level depending on the object function and the evaluation at real time.

Online evaluation

The model is well adapted to a fuzzy evaluation of the condition, the chosen collaboration level is a tendency to promote the system’s efficiency, and each collaboration level is able to accomplish the task alone in the whole process, which lowers the requirements for the evaluation methods. We try to evaluate the three variables online and dynamically switch our collaboration level to adapt to complicated changing task environment.

Evaluating the classifier sensitivity: The adaption of the pretrained classifier

In a multi-object recognition task, collected small samples are usually insufficient to train an accurate classifier at the early stage of the task. We have to use a pretrained classifier based on another big data set (with the same type of targets). The results are usually not satisfactory for the complicated environment and huge scale changing.

Figure 7(a) is the result tested in PETS09-S2L1 by an HOG-SVM classifier trained by 1200 negative samples and 2400 positive samples in the INRIA Person data set. Figure 7(b) is the result tested in ADL-Rundle-6 data set by the HOG-SVM classifier trained by enough samples in INRIA Person data set.

Figure 7.

The results of a pretrained classifier.

Small samples in the task environment can provide us with the information of the adaption of our pretrained classifier. So, we can choose the most possible or actual task scene images to pretest our classifier and we can label them. In the real task environment, we make use of this information.

As shown in Figure 8, an experienced operator scores the classifier by watching the marking results and uses proper Canny detectors⁴⁰ to label them, building a connection between the score and the scene. The score is transferred into a specific value of $d^{'}_{r}$ from 0 to 3,⁴¹ and we can evaluate the classifier online when faced with real similar task scenes.

Figure 8.

The process of evaluating the classifier sensitivity.

Table 1 shows the results of the expert’s score on the HOG-SVM pretrained in the INRIA Person data set. The testing eight data sets consist of six data sets from MOT Challenge⁴² data set and two data sets collected by ourselves. The second column is the canny labels. It indicates that our method is able to distinguish different scene images with specific score.

Table 1.

The score and labels for different task scene images.

Data set	Canny/mean	The expert score
ADL-Rundlee-6	18–20	3
PETS09-S2L1a TUD-Stadtmitte ETH-Sunny day PETS09-S2L1b AVG-Town Centre The canteen The playground	>37 34–37 27–35 22–23 12–14 >78 38–50	2 4 3 3 3 1 1

Evaluating the operator’s sensitivity: Monitor human operator’s abnormal event by an eye tracker

It is difficult to accurately evaluate the state of the operator. However, we can find the abnormal events by monitoring the operator’s eye fixation, for example, the operator is fatigue or absent. We can score the human operator’s state online.

Figure 9(a) is the Tobii eye tracker we used, and Figure 9(b) is the eye closing time test and the four abnormal points illustrate a closing eye event. It will make $d^{'}_{h}$ very small at a short time, and our collaboration level will switch to RH or R which gives more confidence to the classifier.

Figure 9.

The Tobii eye tracker and eye closing time detection (four abnormal points).

Estimating the number of targets: Calculate the number of interesting areas based on saliency

The number of the possible targets is also a great factor that affects the choosing collaboration level. The saliency detection method^43,44 is able to generally find the targets of interest which provides us with important information. Compared with the fast Fourier transform method, image signature method⁴⁵ uses discrete cosine transform to transform the image into frequency domain to detect salient regions and speedup. The results of the prediction tested on AVG-Town Centre data set are shown in Figure 10, meeting our demand.

Figure 10.

The results of saliency detection. The circles in (a) represent the possible targets, and the red point in (b) is the number of interesting areas detected by the SIG method at real time, which can be used to predict the number of targets. SIG: image signature.

Online learning using the data collected from human–classifier interaction

During the period of interaction, we collect positive samples, negative samples, and hard examples (the wrong classifier results corrected by the operator) to retrain our classifier online, which improves the classifier, especially when the scenes are generally stable. The scale of the targets is an important factor that affects the process of feature extraction. And the information of the bounding box made by the human operator or the classifier’s correct bounding box provides useful scale criterion information.

We try to collect precise positive samples and especially negative samples as many as possible. Our experiments have proved that the lack of negative samples will bring lots of unpalatable false guides. Our strategies are shown in Figure 11 as follows:

Positive samples: The green (classifier recognition correctly) and yellow bounding box (the missed targets selected by the operator using bounding boxes) in the image. The bounding box restrains the scale of the positive samples.

Hard examples: The green bounding box with red circle (the classifier’s false detection canceled by the operator.

Negative samples: The red bounding boxes (the similar size bounding box between all the positive samples and the hard examples). We automatically collect negative samples by traversing the area between the positive samples and the hard examples with the same scale of the positive samples in the image.

Figure 11.

The collecting of samples in HR and RH levels.

We chose 10 images from PETS09-S2L1 data set and 10 images from our own playground data set. The pedestrians in the two data set are in different scale. We interact with 20 images from the two data sets to train the SVM classifier and tested it with 20 images, respectively. The result meets the requirement in stable period in the task. The results are shown in Tables 2 and 3.

Table 2.

The test results in the PETS09-S2L1 data set.

	HOG-SVM	Our methods
Total targets	60	60
True positives False positives False negatives Precise Recall	29 1 31 0.97 0.48	33 0 27 1 0.55

Table 3.

The test results in the playground data set.

	HOG-SVM	Our methods
Total targets	84	84
True positives False positives False negatives Precise Recall	26 8 58 0.76 0.31	72 7 12 0.91 0.86

Note. The bold values show our method are better in precise and recall than HOG-SVM method.

In the PETS09-S2L1 data set, our method performs a little better than the pretrained SVM. As the target scale changes, for example, in the playground data set, our method keeps the good performance which is much better than the pretrained SVM with the same parameter setting. In other words, our method shows a good adaption to different scenes.

Experiments

We have conducted two experiments with 16 participants in total. The first experiment with five participants aims to validate the effectiveness of switching method. Using the eye tracker, the second experiment with 11 participants aims to validate that the learning during the interaction obviously promotes the ability of the classifier and meanwhile the whole system.

Experiment 1

Data set and process

We choose pedestrians’ recognition as our task to validate the proposed method because the samples of the pedestrian with a wide variety are easily available. The images are from six classic pedestrian detection data sets, such as ADL-Rundle-6, ETH-Sunnyday, PETS09-S2L1, PETS09-S2L2, TUD-Sadtmitte, and AVG-TownCentre, included in the MOTChallenge data set. In addition, we collect samples in the playground and canteen to make another two data sets, which increases the diversity in obscuration and shelter, as shown in Figure 12.

Figure 12.

Images from used data sets. Six data sets are from the MOTChallenge data set and two data sets are prepared by ourselves.

Before starting the experiment, each operator experiences the four collaboration levels for total 1 min and the images are non-experimental images but from the same data set. Our 40 experimental images are various with complicated background and many are shot from a height to meet the simulation demand of the UAV’s surveillance task situation. The interval time between each image is 7 s and the operator tries to recognize all the targets in the H, HR, RH, R, and our switching method. The total time is about 5 min for each operator and the interval for each operator between different levels is 20 min to remove their memory on the images.

Results and discussion

The HOG-SVM classifier is trained on the INRIA Person data set and we test the classifier with 40 our experimental images. The results are shown in Table 4.

Table 4.

The results of the HOG-SVM classifier tested in our data set.

Operator	Accurate rate	Mistake rate	Miss rate
R	0.64	0.10	0.36

There are totally 371 targets and we count the number of the correct recognition, the false recognition, and the missing targets of each operator by an expert with the same standard. And the results are shown in Table 5. Table 5(a) shows the number of correct recognition, Table 5(b) shows the number of the false recognition, Table 5(c) shows the number of the missed targets, and Table 5(d) is the comparison in the metric of precision and recall; 1–5 are five operators and H, HR, RH, and R are different collaboration levels. Switching is our proposed method.

Table 5.

The results of the targets recognition.

	H	HR	RH	R	Switching
(a) The number of correct recognition
1	251	279	360	238	338
2	330	313	334	238	322
3	320	329	357	238	338
4	235	278	305	238	305
5	281	288	297	238	277
(b) The number of false recognition
1	8	7	19	38	11
2	10	9	19	38	15
3	3	5	18	38	14
4	6	20	6	38	6
5	6	4	12	38	9
(c) The number of missed targets
1	120	92	11	133	33
2	41	58	37	133	49
3	51	42	14	133	26
4	136	92	65	133	66
5	90	83	74	133	94
(d) Collaboration level	Precision	Recall
H	0.977	0.764
HR	0.970	0.802
RH	0.958	0.891
R	0.862	0.642
Switching	0.967	0.852

Note. The bold values show our switching method and RH collaboration level are much better in correct recognition and missed targets. In all, both of them perform much better in recall with a little lower precise.

We count the average number over five operators, and the results are shown in Figure 13. Compared with the H and R levels, we can see that human operator is much better than the classifier not only in the Precision but the Recall. Particularly, we set the limit that the interval between each image is 7 s, which increases the missed targets of the operator and decreases his recall. However, with the assistance of the classifier (collaboration of HR and RH), we can see obvious increase in the hit number and the decrease in the missed number in a limited time, with a little more false recognition.

Figure 13.

The average results. (a) to (c) The average hit numbers, false recognition numbers, and miss numbers among the five operators, respectively. The x-coordinate is the five different collaboration levels and switching is our new method. The y-coordinate is the numbers counted.

In addition to R, RH performs the best in the hit number and the missed number but the worst in false recognition. H performs the best in the false recognition but the worst in the hit number and the missed number. HR is a great compromise and R is able to obviously decrease the operator’s workload, and in some specific scene that the classifier is confident, it does perform well. We see the necessity of switching.

In the hit number and miss number, RH and switching method are highly better and switching method performs better than HOR in false rate. In terms of the recall, HOR and switching method are highly better. Our switching method increases the hit rate by 30% and decreases the miss rate by 7%, with the miss rate by 21% compared with the single SVM classifier recognition. It increases the hit rate by 9% and decreases the miss rate by 10% with a cost of 1% increase in false rate compared with a single operator. Among all the metrics, our switching method is a good compromise and reflects great adaption.

Experiment 2

Data set and process

We choose 10 images from PETS09-S2L1 data set and 10 images from our own playground data set. There are 11 participants (i.e. 4 females and 7 males) recruited for this experiment. The participants’ age ranges from 24 to 30. There are three teams of experiments, E1, E2, and E3, with an eye tracker recording the points of fixation. The participants are asked to “recognize all the pedestrian targets with or without the aiding of the classifier” and they use the spacebar to control if one image is finished.

E1: The participant acts in the H collaboration level by clicking the targets without the aiding of the classifier to recognize all the targets in 20 images.

E2: The participant acts in the RH collaboration level with a pretrained HOG-SVM classifier to recognize all the targets in 20 images.

E3: The participant acts in the RH collaboration level with HOG-SVM classifier trained by data from human–classifier interaction to recognize all the targets in 20 images.

There are six participants conducting the experiments in the order of E1, E2, and E3 and five conducting in the order of E3, E2, and E1. The interval between each experiment for the operator is 20 min. The images showed in the experiment are random.

Results and discussion

The average operation time of E1, E2, and E3 is 138.18, 122.55, and 99.73 s, respectively, as shown in Figure 14. The operation with classifier’s aiding performed better than single operation. And our online learning method performs the best for its classifier provided more effective aiding than E2, for example, the less false recognition and more correct hit.

Figure 14.

The average operation time. The x-coordinate reflects the three different operations and the y-coordinate reflects the average time among the 11 operators.

For more detailed analysis, we set interesting areas around all the pedestrians’ targets, as shown in Figure 15, to analyze two metrics: first fixation time and fixation time.

Figure 15.

The interesting areas setting and the recording of the fixation. The colorful rectangles are interesting area set and the purple circles with numbers are the fixation points recorded by the Tobii X120 eye tracker in chronological sequence. The size of the circles reflects the fixation time.

First fixation time is the time when the fixation first enters into the interesting areas. The mean of first fixation time reflects the average time when the operator first fixes on every interesting area in one test image. The minimum of the first fixation time reflects the time when the operator first fixes on any interesting area in one test image.

Fixation time is the operator’s fixation duration in each interesting area.

We count the number of times that one collaboration level defeating the others in the 20 test images. For two participants’ eye tracker data are not complete, we analyze nine participants’ data. What’s more, we increase both operations’ defeating times if they had the same metric value, so the total times may not be exactly 20 for some operators. The results are shown in Figures 16 and 17, and the average number of times among the nine operators is shown in Tables 6 and 7.

Figure 16.

The number of times that one collaboration level defeating the others in the minimum of the fist fixation in interesting areas. The x-coordinate is the operator 1–9 and the y-coordinate is the times defeating the others among 20 experimental images.

Figure 17.

The number of times that one collaboration level defeating the others in the mean of the fist fixation in interesting areas. The x-coordinate is the operator 1–9 and the y-coordinate is the times defeating the others among 20 experimental images.

Table 6.

The average number of times that one collaboration level defeating the others in the minimum of the fist fixation in interesting areas.

Operator	E1	E2	E3
Average number of times	7.1	6.4	8.3

Table 7.

The number of times that one collaboration level defeating the others in the mean of the fist fixation in interesting areas.

Operator	E1	E2	E3
Average times	3.2	7.3	9.4

In terms of the first fixation time, on average times, E3 performs the best. As for the minimum of the first fixation in interesting areas, which reflects the time that the operator uses to find the first target. E1 performs better than E2 (7.1 times > 6.4 times), and the advantage of E3 is not obvious (E3 = 8.3 times > E1 = 7.1 times) because both E2 and E3 have false alarms, but E2 provides more, which gives a false guide. What’s more, the images with a lot of bounding boxes are more confusing. But the goal of E1 is more simple and straight: finding the pedestrian targets at the very beginning. However, calculated in the mean of the first fixation in all the interesting areas, E2 and E3 have made great progress compared with E1 (E3 = 9.4 > E2 = 7.3 > E1 = 3.2).

The average fixation duration time for each operator is shown in Table 8. The average fixation time for E1, E2, and E3 is 0.38, 0.31, and 0.27 s, respectively. Our online learning method performs faster.

Table 8.

The fixation time in each interesting area (seconds) among 20 images.

Operator	E1	E2	E3
1	0.44	0.36	0.31
2	0.26	0.24	0.28
3	0.37	0.28	0.25
4	0.19	0.28	0.29
5	0.37	0.32	0.26
6	0.35	0.20	0.24
7	0.48	0.39	0.29
8	0.55	0.37	0.32
9	0.43	0.32	0.23
Average	0.38	0.31	0.27

After the experiments, we calculate the fatigue degree based on a questionnaire. The questionnaire consists of four options: very easy, easy, tired, and very tired, to evaluate the three operation manners: E1, E2, and E3. When the participants finish all the experiments, they are asked to choose an evaluation option for each experiment immediately. We count the total choice of the participants, as shown in Table 9. As a whole, we can see the obvious decrease in the workload with the aiding of our online learning method.

Table 9.

The number of choices on each option among 11 participants.

	Very easy	Easy	Tired	Very tired
E1	3	4	4	1
E2	1	7	3	0
E3	2	8	1	0

Note. The bold value shows the obvious decrease in the workload with the aiding of our online learning method.

Conclusion

We have proposed a human–classifier collaboration framework to recognize multi-target in real time. The framework is easy to build, having improved the performance of the whole system and reduced the workload of the operator. Particularly, our framework improves the current state-of-the-art object detectors by learning with necessary human supervision, providing a good solution to the problem of insufficient high-quality samples in dynamic and complex tasks. In other words,

We have designed an adaptive human-in-the-loop multi-target recognition framework which is generic for any given tasks.

The proposed framework performs better than previous methods in the condition of small samples.

Online learning can further improve the classifier with samples collected from human–classifier interaction.

Whenever the classifier makes a mistake, this “hard example” can be used to improve the classifier through learning.

Two experiments with total 16 participants show the practical effectiveness. The experimental results demonstrate that our switching method increases the hit rate by 30% and decreases the false miss rate by 7%, the miss rate by 21% compared with single SVM classifier recognition. Compared with a single operator, the hit rate is increased by 9% and the miss rate is decreased by 10% with a cost of 1% increase in false rate. In addition, with the help of an eye tracker, we find that online learning based on the interactive collaboration is able to obviously promote the ability of the classifier and the system meanwhile.

There are still several interesting research directions worth investigating. For switching the collaboration levels, the existence of R is rational because it can remarkably reduce the operator’s work load. However, we cannot trust the automation completely without the precise evaluation of the task. So if we don’t have a perfect classifier of evaluation, we may directly replace the R with RH, which is a mode that gives much trust to the classifier under the supervision of an operator and to transfer the number of our collaboration levels from four to three. In addition, how to promote the real-time performance of the learning method and reduce the operator’s workload in the process is still a problem. Finally, other state-of-the-art recognition, collaboration, evaluation, and learning methods can be tested in the proposed framework.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work is supported by National Natural Science Foundation (NNSF) of China (grant no. 61403410, no.61773393, no. 61573371 and no. 61503403).

References

Richler

Wilmer

Gauthier

. General object recognition is specific: evidence from novel and familiar objects. Cognition 2017; 166: 42–55.

Rozantsev

Lepetit

Fua

. Detecting flying objects using a single moving camera. IEEE Trans Pattern Anal Mach Int 2017; 39(5): 879–892.

Burkert

Butenuth

. Complex event detection in pedestrian groups from UAVs. ISPRS Ann Photogramm Remote Sens Spatial Inf Sci I 2012; 3: 335–340.

Guo

Sprague

. Replication of human operators situation assessment and decision making for simulated area reconnaissance in war games. J Defense Model Simulat 2015; 2012(2012): 285–286.

Ahsan

Sun

Hays

. Complex event recognition from images with few training examples. In: 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), 2017, pp. 669–678. IEEE.

Zhu

Chen

Zhao

. Emotion recognition from Chinese speech for smart affective services using a combination of SVM and DBN. Sensors 2017; 17(7): 1694.

Krizhevsky

Sutskever

Hinton

. Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, 2012, pp. 1097–1105.

Wang

. Spatiotemporal recurrent convolutional networks for traffic prediction in transportation networks. Sensors 2017; 17(7): 1501.

Cooper

Goodrich

. Towards combining UAV and sensor operator roles in UAV-enabled visual search. In: Proceedings of the 3rd ACM/IEEE international conference on Human robot interaction, 2008, pp. 351–358. ACM.

10.

Russakovsky

Fei-Fei

. Best of both worlds: human-machine collaboration for object annotation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 2121–2131.

11.

Yang

Zhao

Zhou

. Sparsity-driven SAR imaging for highly maneuvering ground target by the combination of time-frequency analysis and parametric Bayesian learning. IEEE J Select Topic Appl Earth Observ Remote Sens 2016; PP(99): 1–13.

12.

Qiu

Wen

Fan

. Occluded object detection in high-resolution remote sensing images using partial configuration object model. IEEE J Select Topic Appl Earth Observ and Remote Sens 2017; PP(99): 1–17.

13.

Wang

Guo

Huang

. Knowledge guided disambiguation for large-scale scene classification with multi-resolution CNNs. IEEE Trans Image Proc Pub IEEE Signal Proc Soc 2017; 26(4): 2055–2068.

14.

Tkach

Bechar

Edan

. Switching between collaboration levels in a human–robot target recognition system. IEEE Trans Syst Man Cybern C 2011; 41(6): 955–967.

15.

Greef

Arciszewski

HFR

Neerincx

. Adaptive automation based on an object-oriented task model: implementation and evaluation in a realistic C2 environment. J Cogn Eng Decis Mak 2010; 4(2): 152–173.

16.

Johnson

. Coactive design: designing support for interdependence in joint activity. Elect Eng Math Comput Sci 2014; 3(1):43–69.

17.

Scientist of U.S.A.F.O.O.T.C. Autonomous Horizons., the US Air Force, June 2015; 4(1): 9–15.

18.

Wah

Van Horn

Branson

. Similarity comparisons for interactive fine-grained categorization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 859–866.

19.

Lad

Parikh

. Interactively guiding semi-supervised clustering via attribute-based explanations. In: European conference on computer vision, Springer, Cham, 2014, pp. 333–349.

20.

Biswas

Parikh

. Simultaneous active learning of classifiers & attributes via relative feedback. In: Computer vision and pattern recognition (CVPR), 2013 IEEE conference, 2013, pp. 644–651. IEEE.

21.

Siddiquie

Gupta

. Beyond active noun tagging: Modeling contextual interactions for multi-class active learning. In: Computer vision and pattern recognition (CVPR), 2010 IEEE conference, 2010, pp. 2979–2986. IEEE.

22.

Vijayanarasimhan

Grauman

. Large-scale live active learning: Training object detectors with crawled data and crowds. International Journal of Computer Vision 2014; 108(1–2): 97–114.

23.

Yao

Gall

Leistner

. Interactive object detection In: Computer vision and pattern recognition (CVPR), 2012 IEEE conference, 2012, pp. 3242–3249. IEEE.

24.

Song

FYYZS

Xiao

ASJ

. Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365, 2015.

25.

Huber

Wellig

. Human factors of target detection tasks within heavily cluttered video scenes. In: Target and background signatures. International Society for Optics and Photonics, 2015; 9653: 96530R.

26.

van Maanen

Lindenberg

Neerincx

. Integrating human factors and artificial intelligence in the development of human-machine cooperation. In: Proc. of the 2005 international conference on artificial intelligence (ICAI’05), 2005.

27.

Bolton

Siminiceanu

Bass

. A systematic approach to model checking human–automation interaction using task analytic models. IEEE Trans Syst Man Cybern A Syst Humans 2011;41(5):961–976.

28.

Hussmann

Meixner

Zuehlke

. Model-driven development of advanced user interfaces. Stud Comput Int 2011; 340: 4429–4432.

29.

Degani

Gellatly

Heymann

. HMI aspects of automotive climate control systems. In: 2011 IEEE international conference on Systems, man, and cybernetics (SMC), 2011, pp. 1795–1800.

30.

Bolton

Bass

. Enhanced Operator Function Model (EOFM): A task analytic modeling formalism for including human behavior in the verification of complex systems. In: The handbook of formal methods in human-computer interaction. Springer, Cham, 2017, pp. 343–377.

31.

Wing

. A specifier’s introduction to formal methods. Computer 1990; 23(9): 8–22.

32.

Bechar

Meyer

Edan

. An objective function to evaluate performance of human–robot collaboration in target recognition tasks. IEEE Trans Syst Man Cybern C 2007; 39(6): 611–620.

33.

Oren

Bechar

Edan

. Performance analysis of a human–robot collaborative target recognition system. Robotica 2012; 30(5): 813–826.

34.

Piciarelli

Micheloni

Foresti

. Trajectory-based anomalous event detection. IEEE Trans Circ SystVideo Technol 2008; 18(11): 1544–1554.

35.

Nevin

. Signal detection theory and operant behavior: a review of David M. Green and John A. Swets’ signal detection theory and psychophysics. Journal of the Experimental Analysis of Behavior 1969; 12(3): 475–480.

36.

Bechar

Edan

. Human–robot collaboration for improved target recognition of agricultural robots. Indus Robot 2003; 30(5): 432–436.

37.

Dalal

Triggs

Schmid

. Human detection using oriented histograms of flow and appearance. In: European conference on computer vision, Springer, Berlin, Heidelberg, 2006, pp. 428–441.

38.

Murphy

Pratt

Burke

. Crew roles and operational protocols for rotary-wing micro-UAVs in close urban environments. In: Proceedings of the 3rd ACM/IEEE international conference on Human robot interaction, 2008, pp. 73–80. ACM.

39.

Sheridan

Verplank

. Human and computer control of undersea teleoperators. Massachusetts Inst of Tech Cambridge Man-Machine Systems Lab, 1978.

40.

Canny

. A computational approach to edge detection. Readings in Computer Vision, 1987, pp. 184–203.

41.

Wickens

Hollands

Banbury

. Engineering psychology & human performance. Psychology Press, 2015.

42.

Leal-Taixé

Milan

Reid

. Motchallenge 2015: towards a benchmark for multi-target tracking. arXiv preprint arXiv:1504.01942, 2015.

43.

Itti

Koch

Niebur

. A model of saliency-based visual attention for rapid scene analysis. IEEE Trans Pattern Anal Mach Int 1998; 20(11): 1254–1259.

44.

Zhang

Dai

Porikli

. Deep edge-aware saliency detection. arXiv preprint arXiv:1708.04366, 2017.

45.

Levine

. Saliency detection based on frequency and spatial domain analysis. Neuroscience 2005; 8(8): 975–977.