Real-time object tracking system based on field-programmable gate array and convolution neural network

Abstract

Vision-based object tracking has lots of applications in robotics, like surveillance, navigation, motion capturing, and so on. However, the existing object tracking systems still suffer from the challenging problem of high computation consumption in the image processing algorithms. The problem can prevent current systems from being used in many robotic applications which have limitations of payload and power, for example, micro air vehicles. In these applications, the central processing unit- or graphics processing unit-based computers are not good choices due to the high weight and power consumption. To address the problem, this article proposed a real-time object tracking system based on field-programmable gate array, convolution neural network, and visual servo technology. The time-consuming image processing algorithms, such as distortion correction, color space convertor, and Sobel edge, Harris corner features detector, and convolution neural network were redesigned using the programmable gates in field-programmable gate array. Based on the field-programmable gate array-based image processing, an image-based visual servo controller was designed to drive a two degree of freedom manipulator to track the target in real time. Finally, experiments on the proposed system were performed to illustrate the effectiveness of the real-time object tracking system.

Keywords

Visual tracking FPGA convolution neural network visual servoing robot vision

Introduction

Object tracking is one of the fundamental and interesting topics in robotic and computer vision system. Applications include surveillance, navigation, motion capturing, and so on. Generally, the visual object tracking contains steps of processing images to extract color, edge, and other feature information, which are time-consuming. For example, the Soble edge detector and Harris corner detector have been proved with good robustness to light illumination changing and random noises, and they still rely on software processing in personal computers due to the high computation consumption. However, the personal computers have the problems of high weight and power consumption and can prevent the object tracking from being applied in the real-time applications with limited computation resources. Therefore, it is in high demand to find a solution for the real-time image processing on lightweight hardware of low power consumption.

Recently, more and more image processing algorithms were implemented on graphics processing unit (GPU)- or field-programmable gate array (FPGA)-based image processing platform due to their good parallelized image processing ability. Acharya et al. implemented the scale-invariant feature transform (SIFT) feature detector on GPU device to achieve more than 55 frames per second (fps) for a video graphics array (VGA) resolution image.¹ Heymann et al. introduced an impressive efficiency of SIFT feature detector based on GPU device.² Lu et al.³ used parallel Hough transforms to detect straight line feature by using FPGA. Hernandez-Lopez et al.⁴ proposed an FPGA-based image interesting point detection algorithm, which implemented SUSAN and Harris corner detection on the flexible FPGA device. The feature extraction rate can achieve 161 fps at 30 megapixel resolution. Isakova et al.⁵ implemented a stereo vision algorithm using FPGA and solved the stereo calibration and stereo matching problem only by the FPGA device. It proved the high image data computation ability of FPGA device. Shimizu and Hirai⁶ utilized complementary metal–oxide–semiconductor (CMOS) and FPGA to build a real-time and robust target tracking system, using four double data rate (DDR) RAM to store the high frame rate (up to 1000 fps) images. Possa et al.⁷ proposed low memory required edge and corner detectors on FPGA. They concluded the advantages of the FPGA-based image processing system by comparing the efficiency and the power consumption between GPU- and FPGA-based architectures.

Chang et al.⁸ proposed a high-speed Harris corner detector on FPGA device, with high efficiency and high frame rate (540 fps). Kryjak et al.⁹ realized a high-resolution image-processing platform based on FPGA; their background generation and target detection algorithm were of high efficiency and robustness. Anderson et al.¹⁰ proposed an FPGA-based vision system for an autonomous mobile robot, including processes of object detection, tracking, and path planning. Dillinger et al.¹¹ used an FPGA- and digital signal processor (DSP)-based image processing system to detect falling objects. The Zernike moments method is useful for the object recognition in binary image, but the processing speed of moments by PC cannot fulfill the real-time application requirement. Liu et al.¹² proposed an FPGA-based Zernike moments calculation method to detect the target in laser image in real time. Santos and Ferreira¹³ proposed an FPGA- and fuzzy logic-based position tracking system and proved the possibility of embedding a high-level logic controller into an FPGA device. Okumura et al.¹⁴ proposed a real-time image mosaicing system of high frame rate (up to 500 fps) on FPGA device. Tippetts et al.¹⁵ proposed an FPGA-based vision system and applied it on a small unmanned vehicle and illustrated the performance of real-time target tracking. The existing approaches have proved the FPGA’s ability of running the complicate algorithms of image processing and high-level logic calculation. Compared to the PC-based image processing with long time delay, the pipeline-based image processing on FPGA shows a better performance on time consumption. The advantages of small size and low weight of FPGA-based image processing system are more useful for the embedded vision applications. However, the widely used image processing algorithms were designed for CPU device, to implement these algorithms on FPGA device is a challenging and time-consuming work. Different from the FPGA/DSP or FPGA/Advanced RISC Machine (ARM) hybrid systems^11,16 in the proposed system, FPGA and ARM core are putting together inside the ZYNQ device. Between the two processing cores, an Advanced Extensible Interface (AXI) bus can transfer data with a high band width and high frequency. Based on this AXI bus, we can implement both the image processing algorithm and robot control algorithm in a single chip.

The main contributions of this article are twofold. First, a new FPGA-based vision system was proposed for object tracking objective, where several critical and real-time image processing modules, like image undistortion, color space convertor, edge and corner detectors, and the convolution neural network (CNN) were implemented using the programmable gates in FPGA. Second, a visual servoing-based controller was designed to drive a two degree of freedom (DOF) manipulator to track fast-moving target, and the whole visual servoing scheme was implemented on a ZYNQ system. It is noting that all the image processing and visual servoing were implemented on the same chip and thus can exhibit the abilities of low power consumption and lightweight.

The rest of this article is organized as follows: In section “Image processing on FPGA,” the image processing modules including the CNN module implemented on FPGA device are introduced, followed by the image-based visual servo controller in section “Visual servoing-based object tracking.” In section “Experiments,” experiments are demonstrated to illustrate the system performance. Finally, conclusions and discussions are given in section “Conclusion.”

Image processing on FPGA

The proposed real-time object tracking system is mainly composed of a Xilinx ZYNQ-7000 system-on-a-chip (SOC) core board, a 2-DOF manipulator, and a digital CMOS sensor with 120 degrees field of view. Figure 1 shows the software framework of the proposed on-chip object tracking system. As shown in Figure 1, sequenced images are transferred from the CMOS module to the frame buffer in the FPGA core and then processed parallelly with the image processing modules in FPGA. The extracted information from the processing is utilized as the input for the CNN. The classification result of CNN was used for the visual servo controller to realize real-time object tracking; in the same FPGA SOC, the visual servo control algorithm is implemented in the ARM core. Using the AXI bus, the feedback information can be transferred to the ARM core without any external electric connection. In this section, the design details of the image processing modules will be introduced, including image undistortion, color space convertor, Sobel edge detector, and Harris corner detector.

Figure 1.

Structure of the ZYNQ-based tracking system.

Image undistortion

Due to the lens or CMOS sensor imperfection, images captured are always distorted and generally need a correction procedure before being forwarded to further processing. The distortion can be classified into three types, that is, the radial distortion, tangential distortion, and optical center shift. Generally, considering radial distortion only can satisfy most practical applications, and therefore the well-known distortion model given by Zhang¹⁷ is applied, that is

{\begin{matrix} u_{d} = (1 + k_{1} r^{2} + k_{2} r^{4}) u_{n} \\ v_{d} = (1 + k_{1} r^{2} + k_{2} r^{4}) v_{n} \end{matrix}

where $r^{2} = x_{c}^{2} + y_{c}^{2}$ and $(x_{c}, y_{c})$ denotes the undistorted pixel location. k₁ and k₂ are the radial distortion parameters. (u_n, v_n) is the normalized (pinhole) image projection. (u_d, v_d) is the distorted normalized image projection. The transformation matrix between the distorted and undistorted image coordinates can be obtained as follows

{\begin{array}{l} u_{c} = u_{d} f_{x} + α f_{x} v_{d} + c_{x} \\ v_{c} = v_{d} f_{y} + c_{y} \end{array}, (u_{d}, v_{d}) \Leftrightarrow (u_{c}, v_{c})

where (f_x, f_y) denotes the focal length of the camera, and α represents the angle between the x and y CMOS axes. (c_x, c_y) denotes the image’s center shift. These parameters can be calibrated offline through the MATLAB calibration toolbox¹⁸ and are then stored as constants in the FPGA SOC. The distorted and undistorted images do not correspond to each other pixel to pixel due to the nonlinear distortion transformation. In the article, the bilinear interpolation method is used to calculate the pixel value of the undistorted image from the neighboring pixels, as follows

f (u_{c}, v_{c}) = [1 - x x] [\begin{matrix} f_{a} & f_{b} \\ f_{c} & f_{d} \end{matrix}] [\begin{matrix} 1 - y \\ x \end{matrix}]

where $f (u_{c}, v_{c})$ represents the illumination value of the undistorted pixel and (x, y) is the corrected location calculated by the undistortion module. $f_{a}, f_{b}, f_{c}, f_{d}$ represent the surrounding pixels’ illumination value. Figure 2 illustrates the designed undistortion module on FPGA. The color value of the original image $f_{o} (x_{o}, y_{o})$ was transferred to the DDR memory for buffering. Meanwhile, the pixel coordinates of the original image (u_o, v_o) were transferred to the rectify module. The float multiplier in equations (2) and (3) was implemented by the DSP module in FPGA. After the calculation in rectification module, the output (x_r, y_r) was the rectified location corresponding to (x_o, y_o). So by this way, we actually set up a rectification lookup table.

Figure 2.

(a) Undistortion module structure and (b) logic schematic of undistortion module.

Figure 3 illustrates an undistortion example. The original frame and the undistorted frames by our system and MATLAB are shown in Figure 3(a) to (c), respectively. The edge of the chessboard undistorted by MATLAB was smoother than that of our system. The reason is that the float multiplier of our designed system has two decimal places of accuracy for better real-time performance and lower computation cost. Fortunately, the lower accuracy will not affect the final performance according to our experimental results.

Figure 3.

Calibration result: (a) original image, (b) corrected image by FPGA, and (c) corrected image by MATLAB toolbox. FPGA: field-programmable gate array.

Color space conversion

After undistortion, the undistorted image was ready for further processing algorithms, that is, the color convertor, the edge detector, and corner detector. It is worth noting that the algorithms can be optionally selected to process by a switching architecture designed in FPGA. Hue, saturation, and value (HSV) color space¹⁹ has been proved to be more robust than red, green, blue (RGB) in color tracking applications. Here, the RGB-to-HSV convertor in the FPGA will be introduced in detail. HSV is actually a nonlinear transform of RGB color space. The conversion from RGB to HSV is given as

H = {\begin{array}{l} 0, & R = G = B \\ \frac{(G - B {) 60}^{\circ}}{\max (R, G, B) - \min (R, G, B)} + 360 °, & R > B > G \\ \frac{(B - R {) 60}^{\circ}}{\max (R, G, B) - \min (R, G, B)} + 120 °, & \max (G, B, R) = G \\ \frac{(R - G {) 60}^{\circ}}{\max (R, G, B) - \min (R, G, B)} + 240 °, & \max (G, B, R) = B \end{array}

S = {\begin{array}{l} 0, \max (R, G, B) = 0 \\ \frac{\max (R, G, B) - \min (R, G, B)}{\max (R, G, B)}, others \end{array}

V = \max (R, G, B)

According to equations (4) to (6), the logic of HSV color space convertor is designed as shown in Figure 4. The module uses the RGB channels of each pixel as input and outputs the desired HSV channels. The difference between R, G, B values was calculated by the substractors in FPGA device, and the comparison progress between three color values was implemented by the comparators. With these two progresses in FPGA, the R, G, B values can be finally converted to H, S, V color space. Because the pixels of the whole image can be processed parallelly by the separated color convertor modules, the computation speed is much faster than those in PC-based systems.

Figure 4.

The logic of color convertor.

Features extraction for edge and corner

For many vision applications like object tracking, visual navigation, and stereo vision, it is in high demand to extract edge and corner features from images. In this article, we will propose FPGA logic designs for the widely used Sobel edge and Harris corner. Because they are simple but very efficient for object tracking, the Sobel edge and Harris corner features extracted from the designed FPGA logic module will be used as the feedback information for visual servo control.

Edge detector

The edge feature can represent the contour of a target. Here, the Sobel operator¹⁴ algorithm was selected to collect the edge information. The Sobel edge detection is divided into two steps:

Calculate the vertical or horizontal Sobel operators G_x and G_y by convoluting with the gradient kernels on x and y directions respectively, given as

G_{y} = ω_{y} \otimes [\begin{matrix} p_{11} & p_{12} & p_{13} \\ p_{21} & p_{22} & p_{23} \\ p_{31} & p_{32} & p_{33} \end{matrix}] and G_{x} = ω_{x} \otimes [\begin{matrix} p_{11} & p_{12} & p_{13} \\ p_{21} & p_{22} & p_{23} \\ p_{31} & p_{32} & p_{33} \end{matrix}]

where ω_x and ω_y denote the gradient kernels on x and y directions, respectively, that is

ω_{y} = [\begin{array}{l} 1 2 1 \\ 0 0 0 \\ - 1 - 2 - 1 \end{array}] and ω_{x} = [\begin{array}{l} - 1 0 1 \\ - 2 0 2 \\ - 1 0 1 \end{array}]

Calculate the arithmetic square root G_s of the values G_x and G_y as follows

\begin{array}{l} G_{x} (p_{22}) = - p_{11} - 2 p_{21} - p_{31} + p_{13} + 2 p_{23} + p_{33} \\ G_{y} (p_{22}) = p_{11} - p_{31} + 2 p_{12} - 2 p_{21} + p_{13} - p_{33} \\ G_{s} (p_{22}) = \sqrt{G_{x}^{2} + G_{y}^{2}} > T \end{array}

where G_x and G_y denote the gradients on x and y direction, respectively. With the predefined threshold T, the binary image of Sobel edge detector can be obtained. Figure 5 illustrates the FPGA logic of the Sobel edge detection. In FPGA, the calculation in equation (9) is consisted of comparator, subtractor, and bits shifter. With the result of the gradient calculation, the predefined threshold value T has been used to remove the edges with small gradients. Figure 6 illustrates a detection example of the designed Sobel module. It is seen that most edges in the result image are sharp and continuous.

Figure 5.

Schematic of the Sobel edge and Harris corner detector.

Figure 6.

(a) Original image and (b) Sobel edge image.

Harris corner detector

Based on the operation of Sobel detector, the Harris corner detection²⁰ is further implemented. To detect Harris corners, the Harris matrix is first calculated as follows

H (x, y) [\begin{array}{l} h_{x} (x, y) & h_{x y} (x, y) \\ h_{x y} (x, y) & h_{y} (x, y) \end{array}] = ω \otimes [\begin{array}{l} G_{x}^{2} (x, y) & G_{x} G_{y} (x, y) \\ G_{x} G_{y} & (x, y) G_{y}^{2} (x, y) \end{array}]

where ω represents a 3 × 3 Gaussian convolution kernel and G_x and G_y are the vertical or horizontal Sobel operators calculated in the previous section. Then, the Harris corner evaluation value V is calculated as

V = \det (H) - k * {trace}^{2} (H) > T_{Harris}

where k is a parameter in the range of 0.04–0.06, det(H) denotes the determinant of matrix H, and trace(H) denotes the trace of matrix H. The determinant and trace are calculated as

\begin{array}{l} \det (H) = h_{x}^{2} * h_{y}^{2} - h_{x} h_{y} * h_{x} h_{y} \\ trace (H) = h_{x}^{2} + h_{y}^{2} \end{array}

If the corner value V is larger than a predefined threshold T_Harris, it means a Harris corner exists at the pixel location. To further remove the noisy corners, the non-max algorithm is used, which compares the pixel gray value to the eight surrounding pixels. If the gray value is smaller or bigger than all the eight surrounding pixels, the corner is removed. Figure 5 illustrates the FPGA logic of the designed Harris corner detector, containing six submodules as follows:

Calculate the vertical and horizontal Sobel operator G_x and G_y. This convolution process is implemented on FPGA device by multiplying the pixel value with the corresponding value in the gradient kernels on the x and y directions.

Calculate the values of $G_{x}^{2}$ , G_xG_y, and $G_{y}^{2}$ using the multiplier units on FPGA.

Store the gradient value $G_{x}^{2}$ , G_xG_y, and $G_{y}^{2}$ into a 3 × 3 line buffer module, and calculate the Harris corner matrix value $h_{x} (x, y)$ , $h_{x}_{y} (x, y)$ , and $h_{y} (x, y)$ using three convolution units with $G_{x}^{2}$ , G_xG_y, and $G_{y}^{2}$ , and the Gaussian kernel ω.

Calculate the determinant det(H) and the trace trace(H) of the Harris corner matrix by using the summator and multiplier and then calculate the Harris corner evaluation value V.

Compare the Harris corner response value V with the threshold value T. If the value is smaller than threshold value, set the pixel value to zero.

Finally, compare the Harris corner response value V with the eight surrounding pixel values. If the value of this pixel is smaller than all the surrounding pixels, treat it as a noise pixel and set the value to zero.

Figure 7 demonstrates an example of the Harris corner detection with the proposed FPGA-based module. By setting different thresholds, the number of the Harris corners can be controllable. Figure 7(a) and (c) is the original image of two indoor environments. Figure 7(b) and (d) is the detection result of Harris detector. From the result, it is seen that most of the strong corners were detected. The accuracy and quantity are good enough for the most applications.

Figure 7.

(a) and (c) The original images captured in two indoor environments; (b) and (d) the results of Harris corner extraction.

Hardware-based CNN implementation

The CNN was recommended to be the state-of-art image classification and recognition algorithm,²¹ and the possibility of hardware-based implementation of the algorithm has been proved in the work done by Coric et al.²² In this section, we will introduce the implementation of the CNN algorithm to realize object detection. The structure of the CNN network can be found in Figure 8. The implementation completes the forward and backward parts: the backward part is implemented on ARM core; the forward part is constructed with convolution and pooling layers. The implementation of the forward part of CNN on FPGA device contains three steps:

Step 1. Construct the discrete convolution layer

C^{n} (i, j) = f_{c} (\sum_{d \in (i, j)} A_{d}^{n} u^{n} (i, j) + τ^{n})

where $C^{n} (i, j)$ denotes the convolution result at (i, j) location, $u^{n} (i, j)$ is the pixel value of the input image, $A_{d}^{n}$ represents a convolution parameter, τⁿ is the bias parameter, and f_c denotes the nonlinear function. The convolution result is controlled by the pixel value and the surrounding pixels value. Figure 9(a) and (b) illustrates the window-based convolution schematic on the FPGA core. The original image is first processed by the line shift buffer module in Figure 9(a) to produce the pixel window, and then the feature map can be obtained by the convolution module in Figure 9(b).

Figure 8.

The CNN structure. The C represents the convolution layers, and S denotes the pooling layers. CNN: convolution neural network.

Figure 9.

FPGA-based CNN module. (a) The line buffer-based 3 × 3 pixel window generator and (b) the schematic of window-based convolution operator. FPGA: field-programmable gate array; CNN: convolution neural network.

Step 2. Construct the pooling layer

u^{n} (i, j) = f_{p} (C^{n - 1} (i, j) + τ^{n})

Step 3. In many industrial practice, the CNN training was always finished off-line. In this article, the training progress was pretrained on PC, and the trained parameters were stored on ARM device. When power up the device, the trained parameters were loaded to the FPGA-based CNN module to tune the convolution parameters via the AXI bus provided by the ZYNQ device.

Visual servoing-based object tracking

As shown in Figure 10, a CMOS camera is mounted on a 2-DOF pan-tilt cradle head in our object tracking system to realize the automatic tracking of a fast-moving target. Visual servo control technology has attracted increasing attention in robotics due to its high efficiency and accuracy.^23–26 In this section, a visual servo controller is presented to drive the 2-DOF manipulator to track the preselected target object. The image-based visual servo control method is applied in the article, due to its advantages.²⁷ The color blob or Harris corners extracted by the FPGA-based image processing modules are used to detect the target’s location. As shown in Figure 10(a), denoting $q = {(θ_{1}, θ_{2})}^{T}$ as the cradle head configuration in the space of pan and tilt angle θ₁ and θ₂, we can obtain the kinematics model of the cradle head as follows

B_{{\dot{X}}_{C}} [\begin{matrix} \dot{x} \\ \dot{y} \\ \dot{z} \\ {\dot{θ}}_{1} \\ {\dot{θ}}_{2} \end{matrix}] = \underset{J_{robot} (θ)}{\underset{︸}{[\begin{matrix} \frac{L_{2} C_{2} C_{1}}{δ θ_{1}} & \frac{- L_{2} S_{2} C_{1}}{δ θ_{2}} \\ \frac{- L_{2} C_{2} S_{1}}{δ θ_{1}} & \frac{- L_{2} S_{2} C_{2}}{δ θ_{2}} \\ 0 & \frac{L_{2} C_{2}}{δ θ_{2}} \\ 1 & 0 \\ 0 & 1 \end{matrix}]}} [\begin{matrix} {\dot{θ}}_{1} \\ {\dot{θ}}_{2} \end{matrix}] = J (q) \dot{q}

where $B_{{\dot{X}}_{C}}$ denotes the end effector velocity with respect to the robot base and J(q) is the Jacobean matrix. The kinematics model of the pan-tilt head is given in equation (15), where (x, y, z) denotes the camera position with respect to the base frame. C_i and S_j represent the cos(θ_i) and sin(θ_j), respectively. L₁ and L₂ represent the length of links, respectively. From equation (15), we can find the pan-tilt head, which can be treated as a 2-DOF robot. In this case, two rotation angles of the camera can be controlled. Based on the pinhole camera model, the relationship between the feature points’ velocity on camera frame and the image frame was established and denoted by

[\begin{matrix} \dot{u} \\ \dot{v} \end{matrix}] = J_{image} [\begin{matrix} V \\ ω \end{matrix}] = [\begin{array}{l} \frac{f}{z} 0 - \frac{u}{z} & - \frac{u v}{f} \frac{f^{2} + u^{2}}{f} - v \\ \underset{J_{T}}{\underset{︸}{0 \frac{f}{z} - \frac{v}{z}}} & \underset{J_{w}}{\underset{︸}{- \frac{f^{2} + u^{2}}{f} \frac{u v}{f} u}} \end{array}] [\begin{array}{l} \dot{x} \\ \dot{y} \\ \dot{z} \\ {\dot{θ}}_{1} \\ {\dot{θ}}_{2} \\ 0 \end{array}]

where V and ω denote the camera’s translational and angular velocities, respectively, with respect to the base frame, $(\dot{u}, \dot{v})$ denote the feature points’ velocity on the image frame, and f denotes the focal length of the camera. Substituting equation (15) into equation (16), we obtain equation (17) which denotes the relationship between the angular velocity of pan-tilt head and the change of the feature points on the image plane

\dot{θ} = [\begin{array}{l} {\dot{θ}}_{1} \\ {\dot{θ}}_{2} \end{array}] = {[\begin{array}{l} \frac{L_{2} C_{2} C_{1} f}{z} + \frac{- u v}{f} & \frac{- L_{2} S_{2} S_{1} f + u L_{2} C_{2}}{z} + \frac{f^{2} + u^{2}}{f} \\ \frac{- L_{2} C_{2} S_{1} f}{z} + \frac{- (f^{2} + u^{2})}{f} & \frac{- L_{2} S_{2} C_{1} f + L_{2} C_{2} v}{z} + \frac{u v}{f} \end{array}]}^{- 1} [\begin{array}{l} \dot{u} \\ \dot{v} \end{array}]

= J_{system}^{- 1} (z, f, u, v, θ) Δ ξ

where the relationship between the velocity ${(\dot{u}, \dot{v})}^{T}$ of the feature points and the angular velocity of joints ${({\dot{θ}}_{1}, {\dot{θ}}_{2})}^{T}$ is established. $J_{system}^{- 1}$ represents the inverse matrix of the system Jacobean matrix J_system. It is seen that the system Jacobean matrix depends on the depth information z but hard to be measured. In the article, the Broyden updating²⁸-based Jacobean matrix estimation is used as

J_{t + 1} = J_{t} + \frac{(Δ ξ_{t} - J_{t} Δ θ) Δ θ^{T}}{Δ θ^{T} Δ θ}

where J_t represents the previous estimated Jacobean matrix, and $Δ ξ_{t} = {(u (t) - u_{d}, v (t) - v_{d})}^{T}$ denotes the tracking error at the tth sampling time instant. Based on equation (17) and the estimated Jacobean matrix in equation (18), we design a Proportional Derivative (PD) controller to control the movement of the pan-tilt head as follows

U (k) = - K_{D} \dot{θ} (k - 1) - K_{P} J^{- 1} (k - 1) Δ ξ (k - 1)

where U(k) is the control input of the pan-tilt system at the kth sampling time instant, K_D is a constant matrix of velocity gain, and K_P is an image gain matrix, $J_{t}^{- 1}$ is the inverse matrix of J_t.

Figure 10.

(a) Kinematics model of the pan-tilt head and (b) our pan-tilt head tracking system.

Experiments

Two experiments were performed to demonstrate the effectiveness of the proposed object tracking system. The first one illustrated the real-time CNN-based visual servo tracking performance of a ping-pang ball using HSV color blob. The second one illustrated the tracking performance of a small chessboard using the Harris corners. Note that the whole object tracking system integrated all the algorithms in the same Xilinx ZYNQ-7000 SOC core board, and no extra computers are needed to implement the object tracking. For convenience, we utilize a VGA display screen, directly connected to the SOC board, to show the processed image sequences.

Color-based object tracking experiment

Figure 11 shows the snapshots of the color-based tracking of a ping-pang ball. Images are captured from the CMOS camera sequentially and then transferred into the FPGA through an 8-bit digital parallel port realized in the FPGA device. The CNN parameters were pretrained and precached in the SOC. The visual servo controller parameters were selected based on prior experiments.

Figure 11.

HSV color space-based object tracking. (a) Position A; (b) position B; (c) position C; and (d) position D.

As shown in Figure 11, the target (ping-pang ball) was detected using the color space conversion module, and its position is labeled on the image with green blob. The RGB image was converted to the HSV space, and the Hue information was selected to help separate the target from the background. It can be seen that the tracked target was robustly kept at the center of screen even the target scale changed. Our proposed tracking system can keep tracking the target at 500 fps (resolution: 640 × 480) which is the maximum frame rate of the CMOS image sensor; it means that the tracking rate can even higher when using a CMOS camera with higher frame rate.

Static object tracking experiment based on corner feature

In this experiment, the target to be tracked was fixed while our tracking system was moving. The experiment represents a series of practical applications, for example, target detection from a fast-moving air or ground vehicle. Here, the corner features were utilized to verify the effectiveness of the FPGA-based corner extraction module. Figure 12 illustrates a frame sequence of the static target tracking experiment. A chessboard target was stuck on the wall. The 2-DOF pan-tilt head was requested to keep the target at the image center. When the Harris corners were detected, the object location was recognized by matching the corner cluster and then sent to the ARM core in ZYNQ. The visual servo controller was finally used to drive the robot to make the target be at the image center.

Figure 12.

Harris corner-based static object tracking.

Figure 12(d) and (h) shows the trajectory of the target, moving from left upper corner to the center. The experimental result proved the robustness and accuracy of our vision system. It is seen that the target’s location (indicated by black lines) automatically and quickly (within 2 s) converged to the image center (indicated by red lines). The stable error of the tracking algorithm (smaller than 2 pixels) depends on not only the tracking algorithm but also the control accuracy of servo motors.

Table 1 shows the comparison of the computation performance for Harris and Sobel algorithm implementation on PC (CPU: i7 4500u, memory: 8GDDR3) and FPGA. From the table, it can be seen that the FPGA-based approach exhibits much better performance than CPU. The FPGA resource consumption is shown in Table 2, including the resource consumption of the color convertor, Sobel edge detector, and Harris corner extractor. It can be seen that 65% block RAM was used for the data buffering, about 56% of logic cells were occupied, and 27% DSP module was used for the floating multiply. There are still about half of logic resources that are free of usage, and it means the proposed approach can be further improved.

Table 1.

Implementation timing comparison.

Resolution	CPU (Harris; ms)	FPGA (Harris; ms)	CPU (Sobel; ms)	FPGA (Sobel; ms)
640 × 480	17.5	0.98	28.3	1.18
1280 × 480	68.2	4.3	124.7	5.3
1920 × 1080	272.3	13.6	477.9	21.1
1920 × 1920	613.5	46.7	924.1	40.8

FPGA: field-programmable gate array.

Table 2.

Implementation resource consumption of FPGA.

Resource	Used	Total	Percentage
Logic cells	48k	85k	0.56
Block memory	3.2 MB	4.9 MB	0.65
DSP modules	60	220	0.27
LUTs	31,000	53,200	0.58
DMA channels	2	4	0.5
Registers	32,000	106,400	0.3
AXI ports	1	4	0.25

FPGA: field-programmable gate array; DSP: digital signal processor; LUT: lookup table.

Conclusion

This article presents a real-time object tracking system based on FPGA and CNN. The image processing algorithms, including image undistortion, color space convertor, and Sobel edge, Harris corner features detectors, and CNN were redesigned and implemented with the programmable gates in the FPGA core of the ZYNQ SOC. Further, the visual servo controller was designed and implemented in the ARM core of the ZYNQ SOC, driving a 2-DOF pan-tile cradle head to realize the object tracking. The image processing, CNN, and visual servo control are implemented in the same ZYNQ SOC without any external electric connection. Finally, experiments were performed to illustrate the effectiveness of the proposed tracking system. The proposed real-time visual tracking system can be easily applied to robotic applications, like victim detecting and tracking in rescue task, and other real-time tracking applications. Our future work includes the robustness improvement of the object recognition and the extension of its application on the mobile robot navigation and object tracking.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The article was supported in part by Hong Kong RGC via grant 14204814, Shenzhen Peacock Plan Team grant (KQTD20140630150243062), Shenzhen Fundamental Research grant (JCYJ20140417172417120) and (JCYJ20140417172417145), Shenzhen Key Laboratory grant (ZDSYS20140508161825065), Shenzhen Science and Innovation Committee grant (JCYJ20140417172417145), and Guangdong Science and Technology Foundation grant (2014A010103007).

References

Acharya

Venkatesh Babu

Vadhiyar

. A real-time implementation of sift using GPU. J Real-Time Image Process 2014; 9: 1–11.

Heymann

Muller

Smolic

. SIFT implementation and optimization for general-purpose GPU. Media Cul Soc 2015; 67(1): 7–13.

Song

Shen

. Parallel Hough transform-based straight line detection and its FPGA implementation in embedded vision. Sensors 2013; 13(7): 9223–9247.

Hernandez-Lopez

Torres-Huitzil

Garcia-Hernandez

. FPGA-based flexible hardware architecture for image interest point detection. Int J Adv Robot Syst 2015; 12(93): 1–15.

Isakova

Basak

Sonmez

. FPGA design and implementation of a real-time stereo vision system. In: 2012 international symposium on innovations in intelligent systems and applications, Trabzon, 2012, pp. 1–5.

Shimizu

Hirai

. Implementing planar motion tracking algorithms on CMOS+FPGA vision system. In: Intelligent robots and systems, 2006 IEEE/RSJ international conference on, 2006, pp. 1366–1371.

Possa

Mahmoudi

Harb

. A multi-resolution FPGA-based architecture for real-time edge and corner detection. IEEE Trans Comput 2013; 63(10): 2376–2388.

Chang

Jiang

Hofstee

. Feature detection for image analytics via FPGA acceleration. IBM J Res Dev 2015; 59(2/3): 1–8.

Kryjak

Komorkiewicz

Gorgon

. Real-time background generation and foreground object segmentation for high-definition colour video stream in FPGA device. J Real-Time Image Pr 2014; 9(1): 61–77.

10.

Anderson

Lee

DJ,

Archibald

. FPGA implementation of vision algorithms for small autonomous robots. Proc SPIE 2005; 6006: 401–411.

11.

Patzak

Suslov

Leinen

. FPGA-based real-time image segmentation for medical systems and data processing. IEEE Trans Nucl Sci 2006; 53(4): 2097–2101.

12.

Liu

HM,

. A FPGA and Zernike moments based near-field laser imaging detector multi-scale real-time target recognition algorithm. 2010 third international symposium on information science and engineering, Shanghai, 2010, pp. 370–374.

13.

Soares Dos Santos

Ferreira

JAF

. Novel intelligent real-time position tracking system using FPGA and fuzzy logic. ISA Trans 2013; 53(2):402–414.

14.

Okumura

Raut

. Real-time feature-based video mosaicing at 500 fps. 2013 IEEE/RSJ international conference on intelligent robots and systems, Tokyo, 2013, pp. 2665–2670.

15.

Tippetts

Lee

DJ,

Archibald

. An on-board vision sensor system for small unmanned vehicle applications. Machine Vision and Applications 2012; 23(3): 403–415.

16.

Zhang

Pan

. Design of high-speed parallel data interface based on arm and FPGA. J Comput 2012; 7(3): 804–809.

17.

Zhang

. A flexible new technique for camera calibration. IEEE Trans Pattern Anal Mach Intell 2000; 22(11): 1330–1334.

18.

Bouguet

. Camera calibration toolbox for MATLAB. 2004. http://www.vision.caltech.edu./bouguetj/calib_doc/

19.

Cucchiara

Grana

Piccardi

. Improving shadow suppression in moving object detection with HSV color information. 2001 IEEE intelligent transportation systems, Oakland, CA, 2001, pp. 334–339.

20.

Harris

. A combined corner and edge detector. Proc Alvey Vision Conf, 1988; 1988(3): 147–151.

21.

Krizhevsky

Sutskever

Hinton

. ImageNet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 2012: 25(2): 2012.

22.

Coric

Latinovic

Pavasovic

. A neural network FPGA implementation. Proceedings of the 5th seminar on neural network applications in electrical engineering, Belgrade, 2000, pp. 117–120.

23.

Cai

Somani

Knoll

. Orthogonal image features for visual servoing of a 6-dof manipulator with uncalibrated stereo cameras. IEEE Trans Robot 2016; 32(2): 1–10.

24.

Kragic

Christensen

. Robust visual servoing. Int J Robot Res 2016; 22(10): 923–940.

25.

Chen

Sun

. Moving groups of microparticles into array with a robot–tweezers manipulation system. IEEE Trans Robot 2012; 28(5): 1069–1080.

26.

Chen

Wang

. Transportation of multiple biological cells through saturation-controlled optical tweezers in crowded microenvironments. IEEE/ASME Trans Mech 2016; 21(2): 888–899.

27.

Chaumette

Hutchinson

. Visual servo control. I. Basic approaches. IEEE Robot Autom Mag 2006; 13(4): 82–90.

28.

Hosoda

Asada

. Versatile visual servoing without knowledge of true Jacobian. Proceedings of the IEEE/RSJ/GI international conference on intelligent robots and systems, vol. 1, Munich, 1994, pp. 186–193.