Abstract
In this paper, a new object tracking system is proposed to improve the object manipulation capabilities of service robots. The goal is to continuously track the state of the visualized environment in order to send visual information in real time to the path planning and decision modules of the robot; that is, to adapt the movement of the robotic system according to the state variations appearing in the imaged scene. The tracking approach is based on a probabilistic collaborative tracking framework developed around a 2D patch-based tracking system and a 2D-3D point features tracker. The real-time visual information is composed of RGB-D data streams acquired from state-of-the-art structured light sensors. For performance evaluation, the accuracy of the developed tracker is compared to a traditional marker-based tracking system which delivers 3D information with respect to the position of the marker.
1. Introduction
One of the main tasks of service robotics systems operating in human environments, such as the FRIEND (

Mobile manipulation platforms: (a) FRIEND, (b) PR2
The main goal of tracking rigid bodies in robotics is to determine their pose with respect to a common reference coordinate system

Tracking during object grasping and manipulation. Snapshots of 2D features (a-d) and 3D point cloud (e-f) tracking.
In this work, we tackle the challenge of a single rigid object of interest pose
The contributions of the paper may be summarized as follows:
development of a collaborative robotic tracking framework for fusing information from more trackers;
implementation of a 2D-patch-based particle filtering algorithm for tracking a reference object of interest;
occlusion detector based on ray casting using an RGB-D camera system;
usage of object tracking for improving mobile manipulation scenarios.
As will be shown in the performance evaluation section, it is relatively difficult to compare the results of the proposed tracker, which by its nature processes 3D information, to existing 2D systems. Although there are a large number of image-based trackers which accurately deliver visual information in the 2D image domain, the tracker presented in this paper is intended to be used in service robotics application where 3D information regarding the objects of interest is imperative.
1.1. Related Work
Although object tracking is a well-developed research area in the computer vision community, its application both in mobile manipulation and robotics in general [5] is quite restricted. Recently, Krainin et al. [6] applied the concept of object tracking during manipulation for building online 3D models of objects of interest. As they remark, the online learning and tracking of new objects is an imperative task for successful robotic manipulation scenarios. Also, tracking approaches have been proposed by Wang et al. [7] and Krainin et al. [6] for the purpose of hand tracking and modelling.
Teichman and Thrun [8] proposed a semi-supervised, boosting classification approach to the problem of track classification in dense 3D range data. The method uses a series of 2D-3D features, such as Spin images,
A novel paradigm for training a binary classifier in the context of tracking has been proposed in [9, 10, 11] for the problem of pure 2D image
Boosting is a machine learning technique used in a variety of computer vision applications such as image segmentation, text and object recognition, natural language processing, medical diagnostics, etc. In this paper, a boosting approach similar to the one described by Grabner and Bischof [12] has been used to track the object of interest in the 2D image domain.
One of the first applications of boosting in the area of computer vision was conducted by Viola and Jones [14] for the purpose of fast object detection. In this research, boosting, in its
In a number of recent papers, such as that by Choi et al. [15], the output results from more trackers are fused together using a weighting scheme to improve the performance of the overall tracking procedure. The same concept has been successfully applied by Yang et al. [16] in relation to the problem of tracking the motion of the human heart. In Yang et al. [16] a collaborative tracking framework is suggested.
Tracking has also been heavy investigated in relation to camera pose and the dense 3D reconstruction of human environments. The DTAM (
2. A collaborative tracking framework
The collaborative nature of the tracking framework is defined within the classical Bayesian tracking approach. Let
where
The tracking challenge can be defined as the estimation of the posterior probability
Sequential Bayesian tracking follows a Markov modelling approach, where the prediction step is defined as:
The update state is defined as:
In order to improve the tracking capability, we propose a set of collaborative trackers that can take advantage of each other's data. The tracking framework is built around a set of two trackers:
a 2D
a 2D
Both trackers contribute to the final state estimate as:
As suggested by Yang et al. [16], the information from both trackers may be fused as:
3. 2D-3D Object Tracking
The block diagram of the proposed tracking loop for improving the mobile manipulation capabilities of service robots is shown in Figure 3. The first step in the diagram is to acquire a stream of RGB-D data, that is, data of images with corresponding depth information provided by structured light sensors (e.g., MS Kinect®, Asus®) or stereo cameras (e.g., Point Grey's Bumblebee®).

Block diagram of the proposed 2D-3D object tracking approach for robotic grasping and manipulation (best viewed in colour)
The reference features are calculated from an object recognition module. In this paper, a clustering object detection system is used, which segments objects of interest on flat surfaces such as tables [18]. The determined 3D object cluster is further projected on the 2D image, and an initial tracking patch is calculated in the form of an ROI. As will be explained, the initial patch is tracked using an on-line boosting method. Parallel with the patch-based tracker, an optical flow system determines the 2D-3D correspondences between consecutive frames. Finally, an occlusion detector evaluates possible object occlusions.
3.1. 2D Patch-Based Tracking
A patch is considered an ROI in the 2D image that is being tracked by a specific tracker. This patch provides a search area for the second tracker, which establishes the 2D-3D point correspondences and estimates the transformation between consecutive poses of the object of interest. For tracking the initial patch, an on-line boosting learning technique has been implemented, as will be further explained below.
The boosting approach is a general framework used to improve the accuracy of a certain machine learning algorithm. This is performed by combining a weighted voting scheme using
where
The
where
The basic boosting classification algorithm works as follows. Given a set of training samples:
where
A weak learner
In the case dealt with in this paper, which concerns on-line boosting, the initial training set is composed only of an initial set of RGB-D frames. From these first frames, an initial training set composed of a series of Haar features is built. Further, the classifier is trained on-line by redetecting the tracked features.
At the 2D image level, the on-line boosting performances have been improved using a classical
The boosting tracking architecture provides a 2D ROI in the RGB image, which can be used to track the 3D pose of an object of interest. As will be further detailed below, this pose is obtained from a
3.2. Optical Flow for 3D Object Pose Estimation
The 3D pose of the reference point cloud is calculated based on the 2D ROI obtained from the boosting method and the available 3D point cloud data. 2D point features from the ROI are calculated based on the LK point tracking method [22], as shown in Figure 4. The 2D points used in LK tracking are extracted through the Harris corner detector, followed by correspondence matching using a traditional cross-correlation similarity measure. In order to account for larger motions of the object, a pyramid-based segmentation is used.

LK tracking displayed as blue lines between two consecutive feature points. The two trackers are shown as red (patch-based tracking) and yellow (LK tracking) rectangles (best viewed in colour).
For each tracked 2D point in the image, a corresponding 3D point in the point cloud data is available, that is, there exists a direct 2D to 3D correspondence between the points obtained from the optical flow and the 3D point available in the point cloud delivered by the RGB-D system. Keeping this in mind, the consecutive 3D pose of the reference point features
3.3. Occlusion Detection
One of the main components of the proposed 2D-3D tracker is the occlusion detection system. Its goal is to restart the tracker in case certain obstacles occlude the object of interest. In order to detect occlusions in real time, the ray-casting approach has been used. This means the reference point features

Occlusion detection. Orange points: reprojection of the tracked point features. Blue points: occlusion points coming from the intersection with the robotic arm (best viewed in colour).
The experiments considered two types of occlusion: objects occluded by other objects and the occlusion of an object of interest by the manipulator arm. In the second case, the major occlusion is produced by the gripper of the robotic arm. Since the occlusion detection system is based on ray-casting, it can be stated that the detection is invariant to the shape of the tracked object or of the objects present in the scene.
4. Visual Pipeline and Experimental Results
The processing pipeline for the proposed tracking system is implemented within the

Object tracking loop implementation within the ROS operating system
The evaluation of the overall visual tracking system is performed with respect to the real 3D poses of the objects of interest. Although a lot of 2D image-based object tracking methods exist, the literature is relatively scarce on 3D object trackers, which are a mandatory requirement for service robots operating in uncontrolled real-world conditions. One of the most reliable 3D object tracking systems, which relies only on 2D image information, is the ARToolKit library [23], against which the collaborative tracker presented in this paper has been evaluated.
The real 3D positions and orientations of the objects of interest were manually determined using the following setup. On the imaged scene, a visual marker, considered to be the ground truth information, was installed in such a way that the poses of the objects could be easily measured with respect to the marker. The 3D pose of the marker was detected using the ARToolKit library, which provides subpixel accuracy estimation of the marker's location with an average error of approx. 5 mm [23]. By calculating the marker's 3D pose, a ground truth reference value for camera position and orientation estimation could be obtained using the inverse of the marker's pose matrix. Further, the positions of the object's features were calculated using the proposed system. Both results, i.e., the 2D and 3D poses, were compared to the ground truth data provided by the ARToolKit marker. The 2D image marker position was calculated using its reprojection in the 2D image plane. As can be seen from Figure 7, the 2D and 3D position errors are within a tolerable range. The statistical analysis of the results is summarized in Table 1. It is important to emphasize here that the ARToolKit tracker was considered to be a reference value, or ground truth, against which the proposed approach was measured. In many situations, the classical marker-based approach fails to deliver proper pose estimation because of different lighting and surface reflection phenomena.

Variations in the calculated 2D and 3D poses with respect to the reference ARToolKit marker (best viewed in colour)
Statistical results of errors between the proposed and the ARToolKit marker based 3D tracking system
We have considered a processing cycle to begin with RGB-D data acquisition and end with occlusion detection. In the experimental setup, a calibrated MS Kinect® structured light sensor is used to acquire a sequence of 300 indoor images. The average computational time needed by a processing cycle is just over 400 ms, which is low enough to consider the proposed system as a real-time one for the considered case of object manipulation in domestic environments. The implemented tracking architecture has been tested on a typical portable computer running a 64-bit UNIX operating system on an Intel® i3 dual core CPU, each processor having a 2.40GHz clock speed.
Throughout the object handling routine, the required tracking accuracy is strictly dependent on the manipulative task and the configuration of the robot. In particular, the wider the opening angle of the gripper and the higher the number of degrees of freedom of the manipulator arm, the lower the tracking accuracy. In the presented experiments, a tracking error lower then 0.01 m was required.
5. Conclusions
In this paper, a 2D-3D object features tracking system has been proposed. It has been proven to stabilize mobile manipulation and provide accuracy in tracking rigid household objects during robotic handling. The proposed approach has been built around a collaborative tracking framework which fuses information from more available trackers. As future work, the author will consider the extension of the collaborative framework with new state-of-the-art trackers in order to improve the accuracy of the proposed system. Also, the speed enhancement of the proposed system using state-of-the-art parallel processing equipment, such as FPGAs and GPUs, would significantly decrease the processing time.
Footnotes
6. Acknowledgements
Sorin Grigorescu was supported by the Sectoral Operational Programme, Human Resources Development (SOP HRD), financed by the European Social Fund and the Romanian Government under project no. POSDRU/89/1.5/S/59323. Claudiu Pozna was financed by the Széchenyi István University. The authors would like to thank Prof. Michael Beetz and Dejan Pangercic for their support and constructive ideas.
