Abstract
Camera calibration error, vision latency, nonlinear dynamics, and so on present a major challenge for designing the control scheme for a visual servoing system. Although many approaches on visual servoing have been proposed, surprisingly, only a few of them have taken into account system dynamics in the control design of a visual servoing system. In addition, the depth information of feature points is essential in the image-based visual servoing architecture. As a result, to cope with the aforementioned problems, this article proposes a Kalman filter-based depth and velocity estimator and a modified image-based dynamic visual servoing architecture that takes into consideration the system dynamics in its control design. In particular, the Kalman filter is exploited to deal with the problems caused by vision latency and image noise so as to facilitate the estimation of the joint velocity of the robot using image information only. Moreover, in the modified image-based dynamic visual servoing architecture, the computed torque control scheme is used to compensate for system dynamics and the Kalman filter is used to provide accurate depth information of the feature points. Results of visual servoing experiments conducted on a two-degree of freedom planar robot verify the effectiveness of the proposed approach.
Keywords
Introduction
As the computing power of CPUs continues to increase and computer technology keeps improving, the idea of visual servoing has enjoyed huge success in many applications since the debut of the renowned tutorial paper by Hutchinson et al. in 1996. 1 In general, there are two basic visual servoing architectures—image-based visual servoing (IBVS) and position-based visual servoing (PBVS). 1 –5 Despite visual servoing systems having many attractive features, their performance has been hindered by issues such as camera calibration error, nonlinear dynamics, and vision latency. Although many approaches on visual servoing have been proposed, 6 –12 only a few of them have taken into account system dynamics in the control design of a visual servoing system. 6,7,11 For a robotic system involving highly nonlinear dynamics, its control performance will not be satisfactory unless the nonlinear dynamics of the system is carefully dealt with. In the work by Corke and Good, 6,7 the dynamics issue of a visual servoing system is investigated and the idea of feedforward control is exploited to cope with the vision latency problem. To ameliorate the poor dynamic response due to the low sampling rate of visual servoing applications, some researchers have exploited the acceleration command, which is computed directly from image information. 13,14 The image-based dynamic visual servoing (IBDVS) 13 architecture is a modified version of the classical IBVS architecture. In the IBDVS architecture, the velocity loop of the robot controller adopts the computed torque control (CTC) scheme, 15 while a conventional feedback-type velocity loop is adopted in the classical IBVS architecture. Since the CTC scheme contains a feedforward compensation term, it is not surprising that the IBDVS architecture yields better control performance than that of the classical IBVS architecture. A similar idea for IBDVS was also proposed by Keshmiri et al. 14 However, the IBDVS architecture only provides the desired joint acceleration command for the CTC scheme. That is, the desired joint angle command and the desired joint velocity command are completely ignored.
In addition, the depth values of feature points are essential in calculating the image Jacobian when implementing the IBVS architecture. One of the easiest methods for estimating the depth values of feature points is to use a binocular camera and the concept of disparity 16 and/or epipolar constraints. 17 However, this kind of approach has drawbacks such as not being robust and not being computationally efficient since two image planes are involved in the calculation. In addition to the above disparity/epipolar constraints-based approaches, the nonlinear observer-based approach and the virtual visual servoing approach 18 can be employed to estimate the depth values of feature points 19,20 as well. Generally, the nonlinear observer-based approach and the virtual visual servoing approach can provide good depth estimation results as long as image measurements are accurate and their noise levels are very low. However, in practice, image noise cannot be ignored; as such, the accuracy of depth estimation when using these approaches may not be consistent.
It is well known that the Kalman filter 21 –24 has advantages such as being capable of dealing with the dynamic system with noise and providing good prediction results of system states. Consequently, to alleviate the effects of image noise and vision latency that are encountered in the depth estimation process when implementing the image Jacobian, this article proposes a Kalman filter-based depth and velocity estimator by exploiting the concept of virtual visual servoing and Kalman filter. Furthermore, as mentioned previously, when implementing the CTC scheme in the original IBDVS architecture, only the desired acceleration command is used. It is not a common way to implement the CTC scheme. Therefore, to deal with this problem, in addition to the desired joint acceleration command, in this article, the desired joint velocity command and the desired joint angle command are also used when implementing the CTC scheme. The modified image-based dynamic visual servoing architecture is called MIBDVS in this article. Several experiments have been conducted on a two-degree of freedom (2-DOF) planar manipulator to assess the performance of the proposed Kalman filter-based depth and velocity estimator and the proposed MIBDVS architecture.
According to the above literature review and analysis, the main contributions of this article are summarized in the following. By employing the Kalman filter to cope with image noise, the proposed Kalman filter-based depth and velocity estimator outperforms the one that does not use the Kalman filter. In addition, the proposed Kalman filter-based approach can be employed to estimate the joint velocity of the robot using image information only. By exploiting the desired joint angle command, the desired joint velocity command, and the desired joint acceleration command in the implementation of the CTC scheme, the proposed MIBDVS architecture exhibits better tracking performance than the classical IBVS architecture.
The remainder of the article is organized as follows. The second section briefly reviews the camera model and the IBVS architecture. The third section proposes the Kalman filter-based depth and velocity estimator that can be used to estimate object depth as well as joint velocity. The fourth section introduces the proposed modified image-based dynamic visual servoing architecture. Experimental results and conclusions are given in the fifth and sixth sections, respectively.
Brief review on camera model and classical visual servoing architectures
Brief review on camera model and camera parameters
Perspective projection (i.e. the pin-hole model)
25
is adopted in this article. In order not to have an inverted image, a virtual image plane which is located between the optical center
The values of intrinsic camera parameters can be obtained by performing camera calibration. 26,27
Brief review on classical IBVS architectures
The eye-to-hand camera configuration is adopted in this article.
3
Based on the type of feature, generally classical visual servoing architectures can generally be divided into two categories—PBVS and IBVS. This article focuses on IBVS. Figure 1 shows the control block diagram of a classical IBVS architecture. In Figure 1,

Control block diagram of a classical IBVS architecture. IBVS: image-based visual servoing.
The relationship between the time derivative of the image feature point and the velocity screw of the camera frame is described by
where Le
is the so-called image Jacobian matrix (i.e. the interaction matrix) and
If the goal is to exponentially converge the image feature error, then a proportional-type controller can be used; that is
Suppose that the desired image feature vector is constant; that is,
From equations (3) and (4), one will have
The derivation of the image Jacobian matrix is elaborated in the following. A three-dimensional (3D) feature point
Differentiating equation (6) with respect to time will yield
Suppose that this 3D point undergoes a rigid body motion. One will have
where
Developing equation (8) will yield
Substituting equation (9) into equation (7) and rearranging terms will result in
Equation ( 10) can be further expressed as
Depth and velocity estimation based on Kalman filter and virtual visual servoing
The image Jacobian matrix described by equation (10) consists of five parameters—
The idea of virtual visual servoing proposed by Marchan and Chaumette 18 was originally used in augmented reality applications. Since the virtual image must appear at the correct position in the real scene in augmented reality applications, the relationship between the camera frame and the real object is therefore crucial. That is, the calibration accuracy of extrinsic camera parameters is very important. The concept of virtual visual servoing 13 is illustrated in Figure 2 and will be elaborated in the next subsection.

Concept of virtual visual servoing.
Pose and velocity estimation based on virtual visual servoing
In Figure 2,
The image feature error e vir between m and m* is defined as
If the goal is to exponentially converge e vir, one can let
Substituting equation (13) into equation (14) will yield
Substituting equation (12) into equation (15) will yield
From equation (16), one will have
As shown in Figure 2, the rigid transformation
Note that
As illustrated in Figure 2, after the time duration ti
− t
0 had passed, the original image point m moved to the new image point
One interesting application of virtual visual servoing is that it can be used to estimate the velocity of the actual object point Pj
. The idea is to integrate the virtual velocity screw
To improve the depth estimation accuracy, the acceleration information of the virtual object point Po in the camera frame can be taken into consideration. 13 Detailed derivations are provided in the following.
Suppose that the virtual object point Po undergoes a rigid body motion, 29 one will have
To obtain the acceleration term, one can differentiate equation (21) with respect to time to get
Suppose that the sampling time ▵t is very small. The velocity information of the virtual object point Po in the camera frame at time instant t 0 + ▵t can be approximated as
Substituting equations (21) and (22) into equation (23) will yield
Equation ( 24) can be rewritten as
After some manipulations, equation (25) can be further expressed as
Equation (26) can be expressed in matrix form as
Equation (
27) describes the relationship between the velocity
With the consideration of the acceleration term, equations (18) and (19) can be rewritten as
Depth and velocity estimation based on Kalman filter and virtual visual servoing
Considering the fact that the captured image often contains noise and there are limitations on computational resources and camera sampling rate, this article proposes a depth and velocity estimator that combines the Kalman filter with the virtual visual servoing technique so as to reduce noise effects and also improve estimation accuracy. Figure 3 shows the schematic diagram of the proposed depth and velocity estimator.

Schematic diagram of the proposed depth and velocity estimator based on Kalman filter and virtual visual servoing.
The discrete-time state equation and output equation of a typical dynamic system can be expressed as
where X(k) is the state vector, U(k) is the input vector, and Y(k) is the output vector; ξ(k) is the process noise vector and η(k) is the measurement noise vector; and Ad , Bd , and Cd are constant matrices of proper dimensions. In this article, the process noise vector ξ(k) is assumed to be a zero vector.
The position and velocity of the actual object point Pj in the camera frame are defined as the state variables X(k) in equation (33). In addition, the acceleration of the actual object point Pj in the camera frame is defined as the input U(k) (equation (33)) to the system
Ad and Bd in equation (31) are described by the following equation
That is
In the following, we will determine the transformation matrix Cd
between the system states
The Kalman filter-based depth and velocity estimator is implemented using equations (33) –(37)
where K(k) is the Kalman filter gain matrix and Σ(k) is the covariance matrix for the state estimate
In this article, the covariance matrix R for the measurement noise η(k) is determined through a trial-and-error manner, whereas the covariance matrix Q for the process noise ξ(k) is set to a null matrix in equation (37). The proposed depth and velocity estimator that combines the Kalman filter with the virtual visual servoing technique is easy to implement. It is used in the proposed MIBDVS architecture that will be investigated in the next section. In particular, the proposed Kalman filter-based depth and velocity estimator is used in the MIBDVS architecture to estimate the parameter values of the interaction matrix. It is worth noting that the virtual visual servoing technique exploits the idea of IBVS. As a result, the virtual visual servoing technique inherits the drawbacks of IBVS as well. For instance, if the straight line that passes the real object point and the virtual object point is parallel with the optical axis, then their corresponding image points on the image plane will coincide with each other. In this case, it is impossible to exploit the error between these two image points to estimate the position/velocity of the real object point. Nevertheless, the user can choose the initial position of the virtual object point to avoid such a case happening.
Dynamic visual servoing
Dynamic model of a 2-DOF planar robot manipulator and CTC
The dynamic model of a 2-DOF planar robot manipulator can be described by
where τ is the 2 × 1 torque vector; M(q) and
Unlike most classical visual servoing schemes which only use a proportional-type feedback control law, both IBDVS and the proposed MIBDVS exploit the idea of CTC. 15,30,31 In general, the CTC law τ ctc can be expressed as
where
Suppose that the system identification results are perfect; that is,
Letting τ in equation (38) be equal to τ ctc described by equation (39) will yield
Since the inertia matrix M(q) is a nonsingular square matrix, multiplying the inverse matrix of M(q) on both sides of equation (40) will lead to
One interesting observation is that the CTC method can yield satisfactory performance if the dynamic model obtained through system identification is accurate. However, if the identified dynamic model is not accurate, then the CTC method may result in poor control performance. Figure 4 shows a typical block diagram of CTC.

Typical block diagram of CTC. CTC: computed torque control.
IBDVS and the proposed MIBDVS
Figure 5 illustrates the control block diagram of IBDVS. The IBDVS incorporates a depth and velocity estimator, a second-order visual loop controller, and a robot control loop that uses the position feedback provided by the encoder. The IBDVS architecture is similar to the classical IBVS architecture. Both the IBDVS architecture and the classical IBVS architecture use the image feature command for the visual loop. The difference is that in the IBDVS architecture, the velocity loop of the robot control architecture adopts the CTC scheme rather than the conventional feedback controller. However, as shown in Figure 5, the IBDVS architecture only provides the desired joint acceleration command

Control block diagram of IBDVS. IBDVS: image-based dynamic visual servoing.

Control block diagram of the proposed MIBDVS. MIBDVS: modified image-based dynamic visual servoing.
In Figure 6, the depth and velocity estimator estimates the parameter values essential in the calculation of interaction matrix

Derivation of position command used in the CTC scheme.CTC: computed torque control.
Controller design of MIBDVS
The controller design of the MIBDVS architecture in Figure 6 will be explicated in the following. The task function E is defined by equation (42), where
Suppose that the goal is to converge image feature error to behave as a second-order system. As a result, one will have equation (43), where
Substituting equation (42) into equation (43) will yield
The velocity command
Image feature command generation and interpolation
In the experiment, the image feature command is generated through the so-called teach by showing method. During the “teach by showing” stage, the user holds and moves a fiducial marker to the goal position and the camera is used to record the entire moving trajectory of the fiducial marker. In the “execution” stage, the recorded moving trajectory is adopted as the image feature command for the visual servoing scheme and the selective compliance assembly robot arm (SCARA) robot is controlled to repeat (i.e. move along) the recorded moving trajectory. Note that in this article, the recorded moving trajectory is represented by a PH curve. 34,35
Experimental setup and results
Figure 8 shows the experimental system that consists of a 2-DOF SCARA robot (as shown in Figure 9), two eye-to-hand cameras (mounted on the ceiling as shown in Figure 10), a personal computer, and an intelligent motion control platform-2 card by Industrial Technology Research Institute, Zhudong Township. Note that two eye-to-hand cameras are used in the hand–eye calibration process 36 (for later use in the joint velocity estimation experiment). When performing visual servoing, only the left eye-to-hand camera (the camera denoted as “L” in Figure 10) is used. The two joints of the planar robot are actuated by two AC servomotors and the motor drives are set to torque mode throughout the experiments. In particular, the “L” eye-to-hand camera, which is equipped with a lens of 16 mm focus length, has a maximum resolution of 1280 × 1024 pixel 2 and 60 Hz frame rate. In addition, the distance (measured by a ruler) between the “L” eye-to-hand camera and the 2-DOF SCARA robot is around 135 cm.

Experimental system.

2-DOF SCARA robot. DOF: degree of freedom; SCARA: selective compliance assembly robot arm.

Eye-to-hand camera mounted on the ceiling.
Experimental results of Kalman filter-based joint velocity estimation
In this experiment, the SCARA robot is controlled to perform a contour following motion. Three different approaches—the depth and velocity estimator without incorporating the Kalman filter, the proposed Kalman filter-based depth and velocity estimator, and the least square fit (LSF) method 37 —are used to estimate the joint velocity of the robot. In particular, the LSF method uses the encoder data of the servomotor installed at each joint to estimate the joint velocity, whereas the other two approaches only use the image information obtained by the camera to estimate the joint velocity. Since the resolution of the encoder data is much higher than that of the image data provided by the camera, it is expected that the estimation accuracy of the LSF method will be better than the other two approaches. Therefore, the estimation results of the LSF method will be used as a reference to assess the estimation accuracy of the proposed Kalman filter-based depth and velocity estimator in addition to the depth and velocity estimator without incorporating the Kalman filter. Note that in this experiment, the object feature point is on the tip of the second joint (i.e. end-effector). The depth and velocity estimator without incorporating the Kalman filter as well as the proposed Kalman filter-based depth and velocity estimator can estimate the velocity of the end-effector in the camera frame using image information only. By exploiting the results of hand–eye calibration and inverse robot Jacobian, one can convert the velocity of the end-effector in the camera frame into the joint velocity of the robot.
According to the joint velocity estimation results shown in Figures 11 and 12, the estimation performance of the proposed Kalman filter-based depth and velocity estimator is clearly better than that of the depth and velocity estimator without incorporating the Kalman filter.

Velocity estimation result of the first joint. (a) depth and velocity estimator without incorporating the Kalman filter and (b) proposed Kalman filter-based depth and velocity estimator.

Velocity estimation result of the second joint: (a) depth and velocity estimator without incorporating the Kalman filter and (b) proposed Kalman filter-based depth and velocity estimator.
Experimental results of Kalman filter-based depth estimation
In this experiment, the SCARA robot is controlled to perform a contour following motion. Two different approaches—the proposed Kalman filter-based depth and velocity estimator and the depth and velocity estimator without incorporating the Kalman filter—are tested in the experiment. Note that in this experiment, the depth of the robot is estimated using the image information only. In addition, the ground truth of the object depth measured by a ruler is around 135 cm. Results of the depth estimation experiment are shown in Figure 13. Clearly, the proposed Kalman filter-based depth and velocity estimator exhibits better depth estimation accuracy than the depth and velocity estimator without incorporating the Kalman filter.

Depth estimation result: (a) depth and velocity estimator without incorporating the Kalman filter and (b) proposed Kalman filter-based depth and velocity estimator.
Comparison of tracking performance between IBVS and MIBDVS
In this experiment, the SCARA robot is controlled to perform a contour following motion. Both the classical IBVS and the proposed MIBDVS are tested. Figure 14 shows the desired contour. Figure 15 shows the image command after interpolation, whereas Figure 16 shows the image velocity command. Tracking results on the image plane are shown in Figure 17, whereas Figure 18 shows the tracking errors of the image features. In addition, the performance comparison between the classical IBVS and the proposed MIBDVS is summarized in Table 1, where “RMS” represents the root-mean-square value and “MAX” is the maximum value. Based on Table 1, clearly, both the RMS values and the MAX values of tracking error on the u-axis and v-axis for the case of the proposed MIBDVS are smaller than those for the case of the classical IBVS. In addition to tracking error, contour error—an important indicator of contour following accuracy—is also compared. Again, both the RMS values and the MAX values of contour error for the case of the proposed MIBDVS are smaller than those for the case of the classical IBVS. Experimental results indicate that the proposed MIBDVS structure outperforms the classical IBVS structure in both tracking performance and contour following accuracy.

Desired contour; red line: the recorded moving trajectory of the fiducial marker during the “teach by showing” stage and blue line: the desired contour, which is a PH curve used to represent (i.e. fit) the recorded moving trajectory.

Image command after interpolation.

Image velocity command.

Tracking results on the image plane: (a) IBVS and (b) MIBDVS. IBVS: image-based visual servoing; MIBDVS: modified image-based dynamic visual servoing.

Tracking error of image feature: (a) IBVS and (b) MIBDVS. IBVS: image-based visual servoing; MIBDVS: modified image-based dynamic visual servoing.
Performance comparison between IBVS and MIBDVS.
IBVS: image-based visual servoing; MIBDVS: modified image-based dynamic visual servoing; RMS: root-mean-square value; MAX: maximum value.
Conclusions
This article exploits the concept of virtual visual servoing and Kalman filter to develop a method for estimating the depth value that is essential in calculating the image Jacobian matrix used in IBVS architectures. In particular, the Kalman filter is employed to cope with image noise so as to improve the accuracy of depth estimation. In addition, the proposed Kalman filter-based approach is also employed to estimate the joint velocity of the robot using image information only. Moreover, to achieve better visual servoing performance, this article proposes the MIBDVS architecture that exploits the desired joint angle command, the desired joint velocity command, and the desired joint acceleration command in the implementation of the CTC scheme. Several experiments conducted on a 2-DOF planar manipulator are used to evaluate the performance of the proposed Kalman filter-based depth and velocity estimator and the proposed MIBDVS architecture. Experimental results indicate that the two proposed approaches outperform the ones based on the classical IBVS architecture.
In this article, the inertia matrix, Coriolis matrix, and friction vector, which are essential in the implementation of the CTC scheme, are obtained through system identification. However, the accuracy of the identification results of these matrices/vectors greatly affects the effectiveness of the CTC scheme as well as that of the proposed MIBDVS architecture. Improving identification accuracy is one possible future direction. In addition, the sampling rate for the inner servo control loop is often more than 10 times that for the outer vision loop. This results in a major challenge for the control design of MIBDVS. How to ease this difficulty so as to facilitate the control design of MIBDVS is another possible research direction.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The project is supported by the Ministry of Science and Technology, Taiwan, under MOST 105-2221-E-006-105-MY3.
