Sage Journals: Discover world-class research

Abstract

This paper proposes an uncalibrated adaptive visual servoing (VS) control framework based on dual-camera fusion to address key technical challenges in robotic visual servoing systems, including real-time state estimation, multi-space coordination, and dynamic target tracking. By combining the complementary advantages of “eye-in-hand” and “eye-to-hand” camera configurations, an adaptive switching mechanism is designed to achieve coordinated control between image space and Cartesian space, addressing convergence problems of traditional methods during target occlusion or field-of-view loss. Key features of the framework include: an uncalibrated control method based on the image Jacobian matrix; adaptive parameter adjustment based on Kalman filtering (KF); and a dual-camera fusion switching strategy. Experiments show the method achieves a positioning accuracy of 1.197 mm and an orientation accuracy of 0.149° in a representative static positioning task; demonstrates effective performance in scenarios involving out-of-view target acquisition and occlusion recovery; and reduces tracking errors by 13%–28% while shortening convergence time by 5%–32% in dynamic tracking tasks. This framework provides a practical technical approach for visual servoing systems in complex environments, showing potential for broad industrial applications.

Keywords

visual servoing adaptive control dual camera robot tracking Kalman filtering

Introduction

Visual servoing (VS) has emerged as a fundamental technology in automation and intelligent manufacturing, offering precise positioning and tracking capabilities for complex robotic tasks. VS systems guide robot motion through image feature feedback, directing feature points toward desired positions.¹ The non-contact measurement and real-time feedback characteristics of VS have enabled its application in industrial assembly, agricultural harvesting, and dynamic target tracking. However, with Industry 4.0’s demands for enhanced system adaptability, VS technology continues to face challenges in dynamic environment adaptation, real-time state estimation, target occlusion handling, and multi-space coordination. These limitations restrict its deployment in unstructured environments and necessitate more sophisticated control strategies.

Traditional VS technology relies on two primary configuration paradigms. The “eye-in-hand” configuration acquires local image features by mounting cameras on robot end-effectors. Li et al.² developed a hybrid solution combining PBVS with binocular cameras and monocular uncalibrated visual servoing for underwater environments, addressing system uncertainties while experiencing some complexity in switching mechanisms. Ahmad et al.³ employed deep learning for watermelon flower size and orientation estimation, achieving positioning errors of 1.028 cm and improving agricultural automation efficiency, though adaptability to lighting variations and occlusion remains limited. Drummond and Cipolla⁴ investigated Lie algebras in affine transformations, constructing a framework for handling continuous disturbances and providing theoretical foundations, though practical applicability to complex three-dimensional targets requires further validation. Chang⁵ proposed combining “look-then-move” and “visual tracking” approaches for smartphone assembly, addressing field-of-view limitations while maintaining room for improvement in long-distance initial positioning efficiency. Zhang et al.⁶ designed “rotational perspective moment” features for multi-rotor aircraft control precision, Cui et al.⁷ applied “eye-in-hand” configuration for flexible refueling boom vibration control, while Li et al.⁸ implemented a cherry tomato harvesting system using RGB-D cameras, achieving a 96.25% success rate. Zheng et al.⁹ and Yang et al.,¹⁰ respectively, applied the “eye-in-hand” configuration to quadrotor positioning and fixed-wing UAV tracking tasks. However, this configuration remains susceptible to failure during target occlusion or rapid movement, exposing its operational limitations.

In contrast, the “eye-to-hand” configuration provides global perspective through fixed cameras. He et al.¹¹ developed a 2.5D visual servoing method for textureless part grasping, though with limited support for local fine operations. These methods often exhibit reduced control precision when lacking local feature feedback. Rastegarpanah et al.¹² optimized hybrid visual servoing trajectory efficiency through 3D feature estimation, while Rastegarpanah et al.¹³ introduced adaptive gain mechanisms to enhance system performance, though multi-camera collaborative data fusion requires further development. Both configurations possess distinct characteristics, yet single perspectives struggle to balance global and local requirements, providing motivation for this paper’s dual-camera fusion architecture.

Recent developments in VS technology have addressed dynamic target tracking. Gao et al.¹⁴ developed neural network adaptive controllers for underwater uncertainty handling, though dependence on training data affects real-time performance. Li et al.^15,16 proposed uncalibrated tracking methods for dynamic feature points and fruits, respectively, handling unknown motion parameters while lacking effective recovery mechanisms for out-of-view targets. Hao et al.¹⁷ combined DETR and BILSTM for trajectory prediction in object detection, introducing Kolmogorov–Arnold Networks (KANs) to improve model efficiency and reduce complexity. Wang et al.¹⁸ enhanced moving target prediction precision by combining hand-eye vision with end-effector pose, though stability during velocity variations requires improvement. While these studies advanced dynamic adaptability, continuous tracking capability for out-of-view targets remains insufficient, motivating the global camera guidance approach introduced in this paper.

In industrial applications, VS technology has demonstrated progress. Liu et al.¹⁹ integrated vision with laser sensors, reducing rivet positioning time to 1–5 s for high-efficiency requirements. Chen et al.²⁰ developed collision-free IBVS path planning achieving 100% collision avoidance through systematic design, while Li et al.²¹ enhanced depth estimation through RGB-D technology, improving real-time performance and accuracy. However, these methods often require precise calibration or specific hardware, with limited flexibility under environmental variations, establishing the context for this paper’s uncalibrated strategy.

Machine learning has advanced adaptive VS methods. Gu et al.¹ and Shi et al.²² implemented adaptive servo gain adjustment through Q-learning, improving convergence speed while inadequately addressing target loss scenarios. Zhang et al.²³ developed RARLC controllers that reduced errors within three cycles, demonstrating rapid learning capabilities. Hernandez-Barragan et al.²⁴ employed damped least squares to mitigate singularities, enhancing redundant robot manipulability, while Zhong et al.²⁵ and Chang et al.²⁶ combined Kalman filtering to address uncalibrated control and depth estimation challenges, respectively. Xie et al.²⁷ developed data-driven IBVS for 7-DOF manipulators, though dependence on extensive data affects real-time performance. These methods enhanced adaptability but remain insufficient for extreme occlusion and target loss scenarios.

To address single configuration limitations, hybrid VS methods have gained attention. Liu and Dong²⁸ applied polar coordinate RMPC to IBVS, optimizing 6-DOF manipulator trajectories; Yang et al.²⁹ proposed three-stage motion planning for flexible joint manipulator applications; Brown et al.³⁰ reduced PBVS errors through ALS methods. Rotithor et al.³¹ improved insertion task control through DMP and IBVS switching. Chaber et al.³² simplified MPC models to reduce computational requirements, Chen et al.³³ combined fuzzy neural networks to address SMC chattering issues, while Li et al.³⁴ optimized image moment features for ultra-redundant manipulator tracking. Although these methods performed effectively in specific scenarios, global and local information fusion remains limited, particularly regarding out-of-view target handling and multi-camera collaboration. To our knowledge, no research has fully exploited the complementary characteristics of “eye-in-hand” and “eye-to-hand” cameras to design VS systems that balance global monitoring and local precise control.

To address these challenges, this paper proposes a dual-camera fusion control architecture that enhances system environmental adaptability by integrating the complementary advantages of “eye-in-hand” and “eye-to-hand” configurations. Compared to single-perspective approaches,^20,34 this architecture mitigates target loss risks in occlusion or limited field-of-view scenarios through dynamic coordination of multimodal visual information, leveraging global camera guidance for control restoration and addressing limitations in dynamic tracking methods.^15,17 We develop a Kalman filter-based adaptive regulation mechanism that optimizes dynamic tracking stability through global velocity estimation, advancing beyond traditional local error adjustments.^1,22 Furthermore, through motion driving strategies based directly on image features, this architecture circumvents traditional calibration process complexity,^5,21 optimizing system initialization and enhancing operational flexibility. Experimental results demonstrate that the proposed method achieves improved performance in static positioning precision and dynamic tracking compared to traditional methods, providing a systematic solution for VS technology in unstructured environments.

The contributions of this paper are reflected in the following three aspects:

A hybrid visual servoing architecture that synergistically fuses two control schemes: Position-Based Visual Servoing (PBVS) for long-range guidance using a global camera, and Image-Based Visual Servoing (IBVS) for high-precision alignment using a local “eye-in-hand” camera. This dual-mode framework addresses complex scenarios including target occlusion and out-of-view acquisition.

An adaptive control mechanism featuring global-local synergy that utilizes a Kalman filter with global camera observations to estimate target velocity. This velocity information proactively tunes local controller gains, enhancing performance in dynamic tracking tasks.

A hybrid switching strategy that combines global guidance with local alignment. This strategy, based on error thresholds and hysteresis mechanisms, ensures smooth and reliable transitions between PBVS and IBVS modes while providing fault tolerance for local vision failure.

Dual-camera fusion visual servoing system

To address the precision and reliability demands of industrial target localization and tracking, this paper presents a hybrid visual servoing control framework based on “eye-in-hand” and “eye-to-hand” dual-camera configurations. The overall control flow, depicted in Figure 1, fuses the long-range guidance of Position-Based Visual Servoing (PBVS) with the close-range precision of Image-Based Visual Servoing (IBVS).

Figure 1.

Hybrid visual servoing framework with dual-camera integration.

The physical implementation of this framework, shown in Figure 2, integrates two complementary camera configurations: a global camera fixed above the workspace for long-range guidance and occlusion handling, and a local camera mounted on the end-effector for high-precision final alignment. To transform raw camera images into control-relevant inputs, the system employs a feature extraction and processing pipeline detailed in Figure 3. This pipeline consists of two parallel channels: the RGB channel locates feature points using a gradient-based corner detection algorithm with sub-pixel refinement, while the depth channel extracts corresponding depth values. The criterion for corner detection is given by:

C (x, y) = \det (M) - k \cdot trace (M)

(1)

For features captured by the global camera, the target pose in the robot base frame, ${}^{B}{T_{F}}$ , is computed via a rigid body transformation:

{}^{B}{T_{F}} = {}^{B}{T_{C}} \cdot {}^{C}{T_{F}}

(2)

where ${}^{B}{T_{F}}$ is the transformation from the camera to the base, and ${}^{C}{T_{F}}$ is the feature pose within the camera frame. Concurrently, the current pose of the robot’s end-effector, ${}^{B}{T_{E}}$ , isobtained through forward kinematics:

{}^{B}{T_{E}} = f (q)

(3)

These spatial pose relationships form the mathematical foundation for the PBVS control component.

Figure 2.

Dual-camera configuration simulation.

Figure 3.

Feature extraction and processing pipeline.

This integration of a dual-camera configuration and a dedicated information processing pipeline establishes a complete hybrid control framework. The subsequent sections will detail its uncalibrated control method, state estimation, and adaptive optimization strategy.

Uncalibrated visual servoing control method

The core objective of the proposed control method is to minimize the image feature error vector $e (t)$ , defined as:

e (t) = s - s^{*}

(4)

where $s \in R^{2 n}$ represents the current image coordinates of $n = 4$ feature points, and $s^{*} \in ℝ^{2 n}$ denotes their desired image coordinates. The relationship between the time derivative of this error, $\overset{\cdot}{e}$ , and the camera’s velocity, $v_{c}$ , is governed by the image Jacobian matrix, $L_{e}$ . For a single feature point with coordinates $(u, v)$ and depth $Z$ , the matrix is formulated as:

L_{e} = (\begin{matrix} - \frac{f_{x}}{Z} & 0 & \frac{u}{Z} & \frac{uv}{f_{x}} & - \frac{{f_{x}}^{2} + u^{2}}{f_{x}} & v \\ 0 & - \frac{f_{y}}{Z} & \frac{v}{Z} & \frac{{f_{y}}^{2} + v^{2}}{f_{y}} & - \frac{uv}{f_{y}} & - u \end{matrix})

(5)

Here, $f_{x}$ and $f_{y}$ are the camera’s focal lengths in pixel units. While these intrinsic parameters are constant, the matrix itself is critically state-dependent, as it is a function of the variable pixel coordinates $(u, v)$ and depth $Z$ .

For the complete system of $n$ feature points, the individual Jacobian matrices are vertically stacked to form the system Jacobian $L_{e}$ :

[\begin{matrix} {\overset{\cdot}{e}}_{1} \\ ⋮ \\ {\overset{\cdot}{e}}_{n} \end{matrix}] = [\begin{matrix} L_{e} (u_{1}, v_{1}, Z_{1}) \\ ⋮ \\ L_{e} (u_{n}, v_{n}, Z_{n}) \end{matrix}] v_{c} = L_{e} v_{c}

(6)

In this expression, ${\overset{\cdot}{e}}_{i}$ is the velocity of the $i$ th feature point in the image plane, while $v_{c} \in R^{6}$ is the camera’s 6D velocity vector (encompassing linear and angular components). Each sub-matrix $L_{e} (u_{i}, v_{i}, Z_{i})$ is computed using the real-time state of the $i$ th feature to ensure the full Jacobian accurately reflects the current system configuration.

Based on this kinematic relationship, a classical Image-Based Visual Servoing (IBVS) control law is formulated to compute the camera velocity command $v_{c}$ . To handle both static and dynamic tracking tasks in a unified manner, the law is expressed in a general PID form:

v_{c} = - L_{e}^{+} u_{pid} (e)

(7)

where $L_{e}^{+}$ is the Moore-Penrose pseudo-inverse of $L_{e}$ , and $u_{pid} (e)$ is the error signal processed by an adaptive PID controller (detailed in the next section). The camera velocity $v_{c}$ is then mapped to the robot’s joint velocities $\overset{\cdot}{q}$ via the robot Jacobian’s pseudo-inverse, $J_{q}^{+}$ :

\overset{\cdot}{q} = J_{q}^{+} v_{c}

(8)

where $\overset{\cdot}{q} \in R^{n_{dof}}$ are the joint velocities for the $n_{dof}$ -DoF robot, and $J_{q}^{+}$ is the corresponding $n_{dof} \times 6$ pseudo-inverse matrix.

This IBVS controller is integrated into a hybrid switching strategy with a PBVS controller to leverage the strengths of both configurations. To ensure stable transitions and explicitly prevent mode chattering—a key concern in switching control—a hysteresis mechanism is implemented:

Mode = {\begin{matrix} IBVS, & if ∥ e ∥ \leq ϵ_{th_low} \\ PBVS, & if ∥ e ∥ > ϵ_{th_high} \\ Current Mode, & otherwise \end{matrix}

(9)

The thresholds, $ϵ_{th_low}$ and $ϵ_{th_high}$ , were determined through repeated tuning and validation on our experimental platform, guided by the theoretical need to maintain a well-conditioned image Jacobian matrix at the moment of switching. For our local camera operating at a 424 × 240 resolution, $ϵ_{th_low}$ was set to 120 pixels. This value represents a boundary region within the image where the Jacobian is prone to becoming ill-conditioned. Practical trials confirmed that higher thresholds risked control oscillations, while lower ones delayed the transition to high-precision control. Consequently, the value of 120 pixels was chosen, with $ϵ_{th_high}$ set to 144 pixels to create a robust hysteresis gap, representing the optimal trade-off identified for our system. Additionally, the system is designed to default to PBVS mode if the local camera features are lost due to occlusion, ensuring continuous operation.

Furthermore, to enhance the smoothness of the robot’s motion, particularly during mode transitions, a rate-limiting module is implemented at the low-level control interface. This module independently constrains each component of the 6D command acceleration vector, which is composed of a linear acceleration $\overset{\cdot}{v} (t) = [a_{x}, a_{y}, a_{z}]^{T}$ and an angular acceleration $\overset{\cdot}{ω} (t) = [α_{x}, α_{y}, α_{z}]^{T}$ . Specifically, the absolute value of each component is saturated to adhere to its respective maximum threshold:

\begin{matrix} | a_{j} (t) | \leq a_{j, max} for j \in {x, y, z} \end{matrix}

(10)

\begin{matrix} | α_{k} (t) | \leq α_{k, max} for k \in {x, y, z} \end{matrix}

(11)

where $a_{j, max}$ and $α_{k, max}$ are the predefined upper limits for each linear (in m/s²) and angular (in rad/s²) acceleration axis, respectively. These values are set conservatively based on the robot’s dynamic constraints, such as joint torque limits and payload capacity. This component-wise approach prevents abrupt jumps in any single degree of freedom of the velocity command, effectively suppressing mechanical shock and mitigating potential oscillations during mode switching.

State estimation and adaptive optimization

To enhance the system’s performance in dynamic environments, this study proposes a state-aware adaptive PID control strategy. The core of this strategy is its ability to dynamically optimize control gains based on the target’s real-time motion characteristics. To acquire these crucial dynamic states—specifically the target’s 3D position and velocity—a classic Kalman filter is employed as the state estimator. This section will first detail the implementation of the state estimator and then elaborate on the novel adaptive tuning law that relies on its output.

Target state estimation using a Kalman filter

A Kalman filter based on a six-dimensional state space is employed to simultaneously estimate the target’s position and velocity. The state vector $X$ is defined as:

X = [x, y, z, v_{x}, v_{y}, v_{z}]^{T}

(12)

where $[x, y, z]$ and $[v_{x}, v_{y}, v_{z}]$ are the target’s position and velocity, respectively. A constant velocity model is adopted, with the state transition matrix $F$ defined as:

F = [\begin{matrix} 1 & 0 & 0 & dt & 0 & 0 \\ 0 & 1 & 0 & 0 & dt & 0 \\ 0 & 0 & 1 & 0 & 0 & dt \\ 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 0 & 1 \end{matrix}]

(13)

where $dt$ is the system’s control period. Because the high-frequency (25–30 Hz) working mode of the visual servoing system determines its extremely short control interval (33–40 ms), the target’s acceleration change during this period can be regarded as a micro-disturbance and ignored. Therefore, selecting the constant velocity model is an effective strategy that strikes an ideal balance between model accuracy and computational efficiency, sufficient to precisely capture the target’s primary dynamics.

The process noise covariance matrix $Q$ is modeled based on the principle of random acceleration disturbance, characterizing the dynamic uncertainty of position and velocity within the system:

Q = σ_{a}^{2} [\begin{matrix} \frac{d t^{4}}{4} & 0 & 0 & \frac{d t^{3}}{2} & 0 & 0 \\ 0 & \frac{d t^{4}}{4} & 0 & 0 & \frac{d t^{3}}{2} & 0 \\ 0 & 0 & \frac{d t^{4}}{4} & 0 & 0 & \frac{d t^{3}}{2} \\ \frac{d t^{3}}{2} & 0 & 0 & d t^{2} & 0 & 0 \\ 0 & \frac{d t^{3}}{2} & 0 & 0 & d t^{2} & 0 \\ 0 & 0 & \frac{d t^{3}}{2} & 0 & 0 & d t^{2} \end{matrix}]

(14)

where $σ_{a}^{2} = 0.01$ is the variance of the acceleration disturbance, determined through experimental optimization. The measurement noise covariance matrix $R$ is defined as:

R = 0.01 \cdot I_{3 \times 3}

(15)

This corresponds to the position measurement noise from the ZED camera. The filter is implemented using standard recursive prediction and update steps, with the measurement matrix $H$ defined as:

H = [I_{3 \times 3}, 0_{3 \times 3}]

(16)

To enhance practical performance, a position differencing method aids velocity estimation. Here, the 3D position measurement vector at the previous time step, $Z_{k - 1}$ , is subtracted from the current vector $Z_{k}$ , and the result is divided by the time interval to obtain a complete velocity vector estimate $v_{k} = [v_{x}, v_{y}, v_{z}]^{T}$ :

v_{k} \approx (Z_{k} - Z_{k - 1}) / dt

(17)

Furthermore, an anomaly detection mechanism resets the filter if the estimated velocity magnitude exceeds 5 m/s, improving robustness.

For the standard constant velocity model adopted in this study, its controllability and observability are theoretically guaranteed, which ensures the convergence of the filter’s estimation error and establishes a foundation for long-term stable estimation.

Adaptive PID parameter tuning

The proposed adaptive tuning strategy directly leverages the target’s velocity vector $v$ , provided in real time by the Kalman filter state estimator. This state awareness is key to achieving high-performance tracking. It allows the controller to dynamically adjust the proportional gain $K_{p}$ of the Image-Based Visual Servoing (IBVS) controllers to precisely match the target’s motion characteristics.

A nonlinear gain tuning law is introduced for the eight dual-axis PID controllers. The design of this law is formulated as:

K_{p, i} = K_{p, base} + α \tanh (β ∥ v ∥) + γ \frac{| e_{i} |}{\sqrt{∥ e ∥^{2} + ϵ}}

(18)

where the base gain $K_{p, base} = 1.0$ , velocity influence coefficient $α = 4.0$ (optimized for industrial scenario velocity ranges), velocity sensitivity $β = 1.0$ , error weight $γ = 0.1$ , and stability factor $ϵ = 10^{- 6}$ . The selection of these parameters was determined through experimental tuning, with the objective of achieving good dynamic tracking performance while ensuring system stability. This tuning law fuses a velocity feedforward term, related to the target’s speed, for compensating dynamic errors, with an error-based regulation term, related to the error magnitude, to balance convergence speed and steady-state precision, all built upon a base gain. To further ensure stability, the calculated gain is constrained by the following saturation function:

K_{p, i} = \max (0.1, \min (3.0, K_{p, i}))

(19)

The lower bound prevents the control gain from being nullified, while the upper bound acts as a safety barrier against excessive gains.

To theoretically prove the reliability of this adaptive controller, its stability is analyzed using Lyapunov’s second method. A Lyapunov candidate function $V = \frac{1}{2} e^{T} e$ is defined to represent the total squared error. Its time derivative $\overset{\cdot}{V}$ , reflecting the rate of change of the error, can be derived by substituting the error dynamics $\overset{\cdot}{e} = - P K_{p} e$ :

\overset{\cdot}{V} = e^{T} \overset{\cdot}{e} = - e^{T} P K_{p} e

(20)

Here, $P$ is a positive semi-definite projection matrix and the diagonal gain matrix $K_{p}$ is positive definite. Thus, it can be concluded that $\overset{\cdot}{V}$ is negative semi-definite ( $\overset{\cdot}{V} \leq 0$ ). Based on Lyapunov stability theory, and considering the properties of visual servoing systems, it can be further argued that the only system trajectory that can remain in the set where $\overset{\cdot}{V} \equiv 0$ is the trivial trajectory $e (t) \equiv 0$ . Therefore, the equilibrium point $e = 0$ is asymptotically stable, providing a solid theoretical foundation for our adaptive control strategy.

Experimental results and analysis

Hardware platform configuration

The experimental platform for this study consists of a 7-degree-of-freedom Franka Emika Panda collaborative robot and a dual-camera system, as shown in Figure 4. The vision system employs a hybrid “eye-in-hand” and “eye-to-hand” configuration: an Intel RealSense D415 camera is mounted on the end-effector as a local camera, and a ZED 2 stereo camera is fixed above the workspace as a global camera. A conveyor belt system is integrated to simulate dynamic industrial scenarios. The entire system utilizes a dual-host architecture for vision processing and robot control, communicating via Ethernet at a servoing frequency of 25 Hz.

Figure 4.

Schematic diagram of the experimental platform.

Performance evaluation metrics

To quantitatively evaluate the proposed method’s performance, a suite of core metrics covering both image and Cartesian space was established. Experiments were conducted using a 3 × 4 checkerboard as the target, with its feature points and the end-effector pose providing the respective error benchmarks for the image and Cartesian spaces.

In the image space, the feature point error ( $e_{f}$ ) is defined as the average Euclidean distance between the four current and desired corner point coordinates:

e_{f} = \frac{1}{4} \sum_{i = 1}^{4} \sqrt{{(u_{i} - u_{i}^{*})}^{2} + {(v_{i} - v_{i}^{*})}^{2}}

(21)

In Cartesian space, the end-effector pose is evaluated from two dimensions: accuracy and stability. Pose accuracy is quantified by positional and orientation errors: the positional error ( $e_{p}$ ) is defined as the Euclidean distance between the final mean position $p$ and the target position $p^{*}$ ;

e_{p} = ∥ p - p^{*} ∥_{2} = \sqrt{{(x - x^{*})}^{2} + {(y - y^{*})}^{2} + {(z - z^{*})}^{2}}

(22)

while the orientation error ( $e_{o}$ ) is the angle between the actual orientation $q$ and the target orientation $q^{*}$ , computed using unit quaternions:

e_{o} = 2 \arccos (| q \cdot q^{*} |)

(23)

On the other hand, the system’s steady-state stability is measured by the positional standard deviation ( $s_{p}$ ); a lower value indicates less jitter and higher stability near the target point:

s_{p} = \sqrt{\frac{1}{N - 1} \sum_{i = 1}^{N} ∥ p_{i} - p ∥_{2}^{2}}

(24)

Furthermore, dynamic performance is assessed by the convergence time and steady-state tracking error.

Static target positioning accuracy validation

To rigorously evaluate the static positioning performance of the proposed dual-camera fusion hybrid control method, this section analyzes a representative positioning task. In this task, the robot end-effector is required to move from an initial standby region to a precise target pose within the forward workspace. This process simulates a common industrial workflow and is designed to test the method’s accuracy and stability throughout the entire motion. For performance comparison, the proposed method was tested against a conventional Position-Based Visual Servoing (PBVS) method under identical task conditions.

The complete process and results of this positioning task are presented in Figures 5 and 6. In the 3D spatial trajectory plot (Figure 5), the trajectory corresponding to the proposed method exhibits a smooth path and converges precisely to the desired target point. In contrast, the trajectory of the conventional PBVS method shows a clear spatial deviation from the target at the end of the task. This performance difference is further confirmed by the positional error convergence comparison plot (Figure 6), which shows that the positioning error ( $e_{p}$ ) of the proposed method rapidly decreases and stabilizes at a near-zero level, while the error of the PBVS method converges to a much larger value.

Figure 5.

3D trajectories for static positioning: proposed method versus conventional PBVS.

Figure 6.

Positional error ( $e_{p}$ ) convergence for static positioning: proposed method versus conventional PBVS.

The final quantitative metrics of the experiment confirm the aforementioned visual observations. The proposed method achieves a final positional error of 1.197 mm and an orientation error of 0.149°, demonstrating higher positioning accuracy compared to the 37.85 mm and 2.163° achieved by the PBVS method. Regarding steady-state stability, the positional standard deviation of the proposed method is 0.322 mm, which is comparable to the 0.309 mm of the PBVS method. This indicates that good stability is maintained while accuracy is improved.

The observed performance difference stems from the novel control framework proposed in this study. The core advantage of this framework lies in the combination of dual-camera synergy and adaptive control. In the final positioning phase, the system switches to image-based closed-loop control. This approach not only reduces the reliance on precise hand-eye calibration, but its adaptive gain strategy also optimizes the error convergence process, jointly ensuring the final positioning accuracy. In contrast, the performance bottleneck of conventional PBVS lies in its complete reliance on 3D reconstruction. Its accuracy is inevitably limited by factors such as camera depth uncertainty and eye-to-hand calibration error, and its fixed linear control law is ill-suited for the final fine-tuning task required to achieve millimeter-level precision.

Sensitivity analysis to depth measurement noise

To investigate the system’s performance under sensor inaccuracies, this section evaluates the impact of depth measurement noise on both positioning accuracy ( $e_{p}$ ) and stability ( $s_{p}$ ). In practical applications, such noise can arise from challenging conditions, for instance when observing reflective surfaces or low-texture areas. To model these effects, the analysis was conducted by simulating a static positioning task while injecting zero-mean Gaussian noise with varying standard deviations ( $σ_{depth}$ ) into the depth measurements ( $Z$ ).

The results are presented in Figure 7. The plots demonstrate a non-linear relationship between noise level and performance degradation. Notably, the system exhibits a degree of tolerance to minor noise: as the standard deviation of the depth noise ( $σ_{depth}$ ) increases from 0 to 10 mm, the final positional error ( $e_{p}$ , top plot) only increases from ∼1.2 to 1.5 mm. However, as the noise level rises further to 30 mm, the error increases more substantially to over 23 mm. A similar trend is observed for the positional standard deviation ( $s_{p}$ , bottom plot), which indicates increased steady-state jitter. The key finding is that the control system remained stable throughout all tests and did not diverge, even under considerable noise levels. This analysis quantifies the system’s sensitivity to depth sensor inaccuracies and confirms its operational viability in non-ideal conditions where depth perception may be compromised.

Figure 7.

Analysis of the impact of depth measurement noise on positioning accuracy ( $e_{p}$ , top) and stability ( $s_{p}$ , bottom).

Experiments in challenging scenarios

To evaluate the performance of the proposed framework under non-ideal conditions, this section presents experiments in two typical challenging scenarios: out-of-view target acquisition and temporary local camera occlusion. These experiments are designed to validate the system’s stability and recovery capabilities when subjected to such disturbances.

Out-of-view target acquisition

This experiment validates the system’s capability to acquire a target initially located outside the local camera’s field of view. The test involves the robot moving from a fixed initial pose to four different target points within the workspace. During this process, the system first utilizes the global camera for long-range guidance (PBVS mode) and then automatically switches to the local camera (IBVS mode) to perform precise positioning once the target enters its view. As illustrated in Figure 8, the trajectories for all test cases successfully converged, with the final positioning errors remaining at the millimeter level, consistent with the static positioning results. These outcomes demonstrate that the hybrid switching strategy enables reliable mode transitions. The transition from PBVS to IBVS is smooth, with no significant jitter caused by the switching observed in the trajectories, which reflects the strategy’s stability and effectiveness in handling large initial deviations by combining the advantages of both methods.

Figure 8.

Cartesian end-effector trajectories for four target positions.

Occlusion recovery experiment

To test the system’s recovery capability upon temporary loss of local visual information, this experiment applied a temporary occlusion to the local camera as the system approached a steady state. During the occlusion, the system automatically switches to PBVS mode to maintain pose stability using information from the global camera, thus preventing task interruption. After the occlusion is removed, it switches back to IBVS mode to resume precise positioning. The experiment was conducted under two conditions, with the target at the center and the edge of the field of view.

As shown in Figure 9, the system successfully recovers and converges quickly in both conditions. The final image error stabilizes at ∼0.5 pixels, with the 3D positional error in the 2–3 mm range. This successful recovery is attributed to the PBVS control during the occlusion phase, which prevents excessive end-effector drift and keeps the initial error for recovery within a threshold acceptable for IBVS. The entire recovery process exhibits no significant velocity spikes or oscillations, demonstrating that the framework’s switching mechanism provides an effective fault-tolerance and recovery strategy, which is important for enhancing system reliability in complex environments.

Figure 9.

End-effector 6-DOF velocity and feature error convergence post-occlusion: (a) center of field of view; (b) edge of field of view.

Dynamic scene experiments

This section evaluates the dynamic tracking performance of the proposed framework under various target poses and velocities. To examine the effectiveness of the adaptive control strategy, the proposed method is compared against a non-adaptive baseline. This baseline, referred to as “Ours w/o Comp.” in the results, utilizes the same dual-camera switching framework as the proposed method but employs a fixed-gain controller ( $K_{p} = 1.0$ ), thereby disabling the adaptive tuning mechanism.

Tracking performance across different poses

To investigate the system’s tracking capability for dynamic targets across diverse spatial poses, experiments were conducted with a conveyor belt speed fixed at 2.3 cm/s under four pose conditions: translation, rotation, tilt, and a combination of rotation and tilt. The proposed method integrates a Kalman filter with an adaptive PID controller. Performance was assessed using steady-state tracking error ( ${\bar{e}}_{f}$ ), convergence time ( $T_{r}$ ), and steady-state variance ( $s_{f}^{2}$ ). Table 1 presents the comparative results for these conditions.

Table 1.

Dynamic tracking performance comparison across pose variations.

Pose	Method	Steady-state error ${\bar{e}}_{f}$ (pixel)	Convergence time $T_{r}$ (ms)	Variance $s_{f}^{2}$ ( ${pixel}^{2}$ )
Translation	Ours	21.636	3400	0.131
	Ours w/o comp.	25.457	5000	0.074
Rotation	Ours	21.627	3600	0.164
	Ours w/o comp.	27.188	4000	0.515
Tilt	Ours	21.423	4200	0.097
	Ours w/o comp.	24.611	5200	0.055
Rotation + tilt	Ours	20.865	4000	0.153
	Ours w/o comp.	25.252	4800	0.086

The results in Table 1 show that the proposed method achieved improved performance across all conditions. Compared to the baseline, in the translation scenario, the steady-state error decreased to 21.636 pixels and convergence time shortened to 3400 ms, representing reductions of 15% and 32%, respectively. For the rotation case, the error was 21.627 pixels, lower than the baseline’s 27.188 pixels. Under the tilt condition, the error was 21.423 pixels, compared to the baseline’s 24.611 pixels. In the complex rotation-plus-tilt scenario, the proposed method’s error of 20.865 pixels also reflected an improvement over the baseline’s 25.252 pixels.

This performance improvement stems from the dynamic compensation mechanism. The Kalman filter estimates target velocity, and the adaptive PID gain tuning allows the controller to better match the target’s dynamics, thereby enhancing tracking performance. Figure 10 illustrates this process for each of the four conditions. It can be observed that the error stabilizes quickly even in the complex rotation-plus-tilt scenario, demonstrating the effectiveness of the adaptive strategy. Although the baseline method exhibited a slightly lower variance in some non-rotational conditions (e.g. 0.074 vs 0.131 pixel² in translation), the proposed method showed a lower variance in the rotation case (0.164 vs 0.515 pixel²). Overall, the adaptive approach achieves a favorable balance between tracking accuracy, convergence speed, and stability.

Figure 10.

Dynamic tracking performance comparison at 2.3 cm/s conveyor speed. The figure includes four conditions: (a) translation, (b) rotation, (c) tilt, and (d) rotation combined with tilt. Each condition comprises three subfigures from top to bottom: (i) tracking process photographs, (ii) end-effector pose diagrams, and (iii) error variation comparison curves.

Tracking performance across different speeds

This subsection further examines tracking performance across a range of conveyor speeds, from 2 to 8 cm/s, covering low-to-high velocity ranges. To exclude interference from pose variations, all experiments utilized a uniform, basic horizontal pose to ensure consistency. Table 2 details the comparative results.

Table 2.

Dynamic tracking performance comparison across speed variations.

Speed (cm/s)	Method	Steady-state error ${\bar{e}}_{f}$ (pixel)	Convergence time $T_{r}$ (ms)	Variance $s_{f}^{2}$ ( ${pixel}^{2}$ )
2	Ours	20.85	3600	0.142
	Ours w/o comp.	24.93	4400	0.097
4	Ours	38.76	4000	0.189
	Ours w/o comp.	48.91	4200	0.154
6	Ours	54.60	5200	0.167
	Ours w/o comp.	73.62	5600	0.176
8	Ours	71.33	3800	0.168
	Ours w/o comp.	99.45	4800	0.235

The results show that while the tracking error for both methods increases with speed, the proposed method maintains a lower error across all tested velocities. This performance difference becomes more apparent at higher speeds; for instance, at 8 cm/s, the error for the proposed method was 71.33 pixels, a 28% reduction compared to the baseline’s 99.45 pixels. This is attributed to the adaptive gain mechanism’s response to varying speeds: at low speeds, the gains adjust moderately to maintain stability, while at higher speeds, the controller increases the gains to boost dynamic response. The bar charts in Figure 11 visualize these performance metrics. These results confirm the adaptive strategy’s ability to balance error suppression with stability across different dynamic conditions.

Figure 11.

Dynamic tracking performance comparison across different speeds.

Tracking adaptability under speed step disturbances

This section assesses the system’s tracking adaptability under abrupt speed changes. To emphasize the effects of velocity transients, experiments were conducted using a horizontal target pose, simulating two step disturbances introduced at ∼10th servoing cycle: (1) an acceleration from 2 to 8 cm/s and (2) a deceleration from 5 to 0 cm/s.

Figure 12 illustrates the resulting error variations. In the acceleration case (a), the proposed method demonstrates a faster response post-disturbance, with its steady-state error converging to a mean of 78.324 pixels, compared to the baseline’s slower stabilization at 102.920 pixels. In the deceleration case (b), where the target comes to a complete stop, the proposed method reaches the convergence threshold ( $e_{f} < 1$ pixel) in 95 cycles, whereas the baseline requires ∼110 cycles. These outcomes demonstrate the effectiveness of the adaptive control strategy. Compared to the fixed-gain baseline, the proposed method can rapidly adjust controller gains based on the real-time state estimations from the Kalman filter, allowing it to adapt more effectively to both sudden acceleration and deceleration of the target.

Figure 12.

Tracking error comparison under speed disturbances: (a) speed increases from 2 to 8 cm/s; (b) speed decreases from 5 to 0 cm/s.

Conclusion

This paper proposed an uncalibrated adaptive visual servoing framework based on dual-camera fusion. By leveraging the complementary strengths of “eye-in-hand” and “eye-to-hand” configurations through a hybrid switching strategy, the proposed method addresses challenges such as target occlusion and limited fields of view, while avoiding the need for complex hand-eye calibration. A key feature is the adaptive control strategy that fuses global velocity estimates with local image error for dynamic gain compensation, which was shown to improve the tracking of moving targets. Experimental results demonstrated the framework’s capabilities, showing millimeter-level accuracy in static positioning and reduced tracking errors in dynamic scenarios compared to a non-adaptive baseline.

However, this work also has several limitations that open avenues for future research. First, the current system relies on structured features; future work will integrate deep learning-based techniques for feature extraction to enhance adaptability in unstructured scenes. Second, the sensitivity analysis revealed that positioning accuracy degrades under significant depth measurement noise. To mitigate this, future research could explore the fusion of additional sensor modalities or develop control laws that are inherently less sensitive to depth parameters.

Footnotes

Handling Editor: Hang Su

ORCID iD

Fengze Xu

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Wang

, et al. Homography-based uncalibrated visual servoing with neural-network-assisted robust filtering scheme and adaptive servo gain. Asian J Control 2022; 24(6): 3434–3455.

Bian

Huang

, et al. Hybrid visual servoing control for underwater vehicle manipulator systems with multiple cameras. IEEE Trans Syst Man Cybern Syst 2024; 54(3): 1742–1754.

Ahmad

Park

Ilyas

, et al. Accurate and robust pollinations for watermelons using intelligence guided visual servoing. Comput Electron Agric 2024; 219: 108753.

Drummond

Cipolla

Application of lie algebras to visual servoing. Int J Comput Vis 2000; 37: 21–41.

Chang

WC.

Robotic assembly of smartphone back shells with eye-in-hand visual servoing. Robot Comput Integr Manuf 2018; 50: 102–113.

Zhang

Fang

Zhang

, et al. Dynamic image-based output feedback control for visual servoing of multirotors. IEEE Trans Ind Inform 2020; 16(12): 7624–7636.

Cui

Wang

Liang

, et al. Visual servoing of a flexible aerial refueling boom with an eye-in-hand camera. IEEE Trans Syst Man Cybern Syst 2021; 51(10): 6282–6292.

Lien

Huang

, et al. Hybrid visual servo control of a robotic manipulator for cherry tomato harvesting. Actuators 2023; 12(6): 253–253.

Zheng

Wang

, et al. Image-based visual servoing of a quadrotor using virtual camera approach. IEEE/ASME Trans Mechatron 2017; 22(2): 972–982.

10.

Yang

Wang

Zhou

, et al. Online predictive visual servo control for constrained target tracking of fixed-wing unmanned aerial vehicles. Drones 2024; 8(4): 136.

11.

Zhang

, et al. Moment-based 2.5-D visual servoing for textureless planar part grasping. IEEE Trans Ind Electron 2019; 66(10): 7821–7830.

12.

Rastegarpanah

Aflakian

Stolkin

Optimized hybrid decoupled visual servoing with supervised learning. Proc Inst Mech Eng I J Syst Control Eng 2021; 236(2): 338–354.

13.

Rastegarpanah

Aflakian

Stolkin

Improving the manipulability of a redundant arm using decoupled hybrid visual servoing. Appl Sci 2021; 11(23): 11566.

14.

Gao

Proctor

, et al. Sliding mode adaptive neural network control for hybrid visual servoing of underwater vehicles. Ocean Eng 2017; 142: 666–675.

15.

Zhao

Chang

Visual servoing tracking control of uncalibrated manipulators with a moving feature point. Int J Syst Sci 2018; 49(11): 2410–2426.

16.

Qiu

, et al. Hybrid uncalibrated visual servoing control of harvesting robots with RGB-D cameras. IEEE Trans Ind Electron 2022; 70(3): 2729–2738.

17.

Hao

Zhang

Honarvar Shakibaei Asli

Motion prediction and object detection for image-based visual servoing systems using deep learning. Electronics 2024; 13(17): 3487.

18.

Wang

Liang

Pan

, et al. Research on a visual servo method of a manipulator based on velocity feedforward. Space Sci Technol 2021; 2021(5): 1–8.

19.

Liu

Zhu

Dong

, et al. Hybrid visual servoing for rivet-in-hole insertion based on super-twisting sliding mode control. Int J Control Autom Syst 2020; 18(8): 2145–2156.

20.

Chen

Jiang

Image-based visual servoing with collision-free path planning for monocular vision-guided assembly. IEEE Trans Instrum Meas 2024; 73: 1–17.

21.

Zhang

, et al. RGB-D image processing algorithm for target recognition and pose estimation of visual servo system. Sensors 2020; 20(2): 430.

22.

Shi

Hwang

, et al. Decoupled visual servoing with fuzzy Q-learning. IEEE Trans Ind Inform 2018; 14(1): 241–252.

23.

Zhang

Manzoor

Joo

, et al. Robust adaptive repetitive learning control for manipulators with visual servoing. Mechatronics 2024; 98: 103121.

24.

Hernandez-Barragan

Villaseñor

Lopez-Franco

, et al. Image based visual servoing with kinematic singularity avoidance for mobile manipulator. PeerJ Comput Sci 2024; 10: e2559.

25.

Zhong

, et al. Adaptive neuro-filtering based visual servo control of a robotic manipulator. IEEE Access 2019; 7: 76891–76901.

26.

Chang

Cheng

, et al. Dynamic visual servoing with Kalman filter-based depth and velocity estimator. Int J Adv Robot Syst 2021; 18(3): 172988142110166.

27.

Xie

Zheng

Jin

A data-driven image-based visual servoing scheme for redundant manipulators with unknown structure and singularity solution. IEEE Trans Syst Man Cybern Syst 2024; 54: 1–12.

28.

Liu

Dong

Robust online model predictive control for image-based visual servoing in polar coordinates. Trans Inst Meas Control 2020; 42(4): 890–903.

29.

Yang

Zhao

, et al. A high-certainty visual servo control method for a space manipulator with flexible joints. Sensors 2023; 23(15): 6679.

30.

Brown

Kong

, et al. Improved noise covariance estimation in visual servoing using an autocovariance least-squares approach. Mechatronics 2020; 68: 102381.

31.

Rotithor

Salehi

Tunstel

, et al. Stitching dynamic movement primitives and image-based visual servo control. IEEE Trans Syst Man Cybern Syst 2022; 53(5): 2583–2593.

32.

Chaber

Domański

Marusak

, et al. On the simplification of the internal nonlinear robot models for the MPC-based visual servoing. Nonlinear Dyn 2024; 112(15): 13047–13071.

33.

Chen

Liu

, et al. Picking robot visual servo control based on modified fuzzy neural network sliding mode algorithms. Electronics 2019; 8(6): 605.

34.

Zhou

Zhu

, et al. Image moment-based visual positioning and robust tracking control of ultra-redundant manipulator. J Intell Robot Syst 2024; 110(2): 83.

Research on uncalibrated adaptive visual servoing control based on dual-camera fusion

Abstract

Keywords

Introduction

Dual-camera fusion visual servoing system

Uncalibrated visual servoing control method

State estimation and adaptive optimization

Target state estimation using a Kalman filter

Adaptive PID parameter tuning

Experimental results and analysis

Hardware platform configuration

Performance evaluation metrics

Static target positioning accuracy validation

Sensitivity analysis to depth measurement noise

Experiments in challenging scenarios

Out-of-view target acquisition

Occlusion recovery experiment

Dynamic scene experiments

Tracking performance across different poses

Tracking performance across different speeds

Tracking adaptability under speed step disturbances

Conclusion

Footnotes

ORCID iD

Funding

Declaration of conflicting interests

References