Abstract
We present a combined machine learning and computer vision approach for robots to localize objects. It allows our iCub humanoid to quickly learn to provide accurate 3D position estimates (in the centimetre range) of objects seen.
Biologically inspired approaches, such as Artificial Neural Networks (ANN) and Genetic Programming (GP), are trained to provide these position estimates using the two cameras and the joint encoder readings. No camera calibration or explicit knowledge of the robot's kinematic model is needed.
We find that ANN and GP are not just faster and have lower complexity than traditional techniques, but also learn without the need for extensive calibration procedures. In addition, the approach is localizing objects robustly, when placed in the robot's workspace at arbitrary positions, even
1. Introduction
Today the majority of robots are still applied in industrial settings, where they are mainly used as programmable machines to solve automation tasks with pre-defined, pre-programmed actions in very structured environments. In recent years however, the field has been moving towards extending the use of robotic systems into areas where they can co-exists and help humans [1]. Proposed applications range from household tasks, helping in a hospital, to elderly care, grocery shopping, etc. A main hurdle is that the world humans live in is an inherently ‘unstructured’ and dynamic environment.
A robot needs to be able to perceive and understand its surroundings, as the state of its workplace and the objects in it can no longer be known a priori. The robot has to rely on its sensory feedback to build a model of its surroundings. Although perception for robotic systems has been investigated for a long time, e.g., [2–4], it remains a difficult issue to solve in robotic systems [5]. A spatial understanding, i.e., to identify and localize objects autonomously and robustly with respect to itself, is crucial for motion planning, obstacle avoidance and finally object manipulation. We aim to provide our humanoid robot with the means to estimate the positions of objects relative to itself. We do so by using a machine learning technique providing predictions of the objects' positions.
2. Problem Statement and Previous Work
Localizing objects in 3D given two images from cameras in different locations is widely known in the computer vision literature as the ‘stereo vision’ problem. Prevalent approaches to this issue are based on analytical projective geometry and photogrammetry. Cameras photographing the same scene from two different locations provide different 2D projections of the 3D environment. If the ‘intrinsic parameters’ that specify each camera's projection from 3D to 2D, as well as the ‘fundamental matrix’, i.e., the rigid-body transformation between the left camera's reference frame (
While traditional stereo vision approaches, based on projective geometry, have been proven effective under carefully controlled experimental circumstances, they are not ideally suited to most robotics applications. Intrinsic camera parameters and the fundamental matrix may be unknown or time varying, and this requires the frequent repetition of lengthy calibration procedures, wherein known, structured objects are viewed by the stereo vision system, and the required parameters are estimated by numerical algorithms.
Assuming a solution to the standard stereo vision problem, applying it to a real physical robot to facilitate object manipulation remains a challenge. In many robotics applications, it is somewhat inconvenient to express the environment with respect to a camera. For example, from a planning and control standpoint, the most logical choice of coordinate system is

The localization problem, illustrated using the iCub kinematic model: images from cameras located at
Our experimental apparatus is the iCub humanoid robot [7], an open-system robotic platform developed within the EU funded RobotCub project. Our configuration consists of a 41 degree-of-freedom (DOF) upper-body mounted on a pedestal (see Figure 2). The iCub is an excellent experimental platform for cognitive and sensorimotor development, and embodied Artificial Intelligence (AI) [8]. In addition, in our setup, it is particularly well-suited for object manipulation research. The hands of the iCub allow for manipulation of objects of daily living, e.g., cups, mugs, tea boxes, etc. These objects are the right size for manipulation by the robot.

The iCub humanoid robot as used in our experiments.
To our knowledge, no machine learning techniques have been investigated so far to facilitate visual and spatial perception. Several different localization systems have previously been developed for the iCub. One method currently implemented for performing stereo vision is based on log-polar image representation. This is a biologically inspired approach that mimics the retina of the eye. Camera images are projected by a log-polar transform before typical stereo vision depth estimation algorithms can be used to analyse this view. A full review of this technique for robotics applications can be found in [9]. The currently available implementation only supports a static iCub head, putting the object position in the
The ‘Cartesian controller’ module, available in the iCub software repositories, also provides basic 3D position estimation functionality [10]. This module works well on the simulated iCub, however, its performance on the hardware platform is weak. One reason for this is its need for an accurate robot model and camera parameters, which necessitates a thorough configuration before using this module on the hardware. For example, our iCub cameras were not mounted precisely as described in the CAD model, and could not to be fixed mechanically due to design choices. This leads to large estimation errors when using the module.
One of the few techniques for 3D object localization, which aims for robustness to camera motion, is described in [11]. The approach computes the time-varying fundamental matrix, facilitating the estimation of the relative position of two objects (e.g., the hand and an object to grasp). However, it does require a precise kinematic model of the robot.
The precision of all of these approaches depends upon a very accurate kinematic model (or estimation thereof) of the robot. Currently however, no module estimating the kinematics of the iCub exists, this is partly due to the openly available construction models and thorough calibration procedures that should be applied regularly.
For other robotic platforms, machine learning has been used to estimate the kinematic model, for example, Bongard
The most accurate, currently available, localization tool for the iCub exists in the ‘stereo Vision’ module. It provides accuracy in the range of a few centimetres, but with high variance depending on where the object is ‘seen’ in the camera frame. Unlike the presented log-polar approach, this module for 3D localization
3
works with the entire iCub kinematic model, providing a position estimate in the
Approaches trying to develop hand-eye coordination have previously been investigated. For example, Hager
3. Our Approach: Learning Spatial Perception
In this paper we investigate a novel approach to spatial perception. Our method combines the two calibration tasks, defining the camera parameters and precise kinematics, into one machine learning problem, removing the need for prior calibration. The robot is able to learn localization also for cases with camera or kinematic irregularities, such as, at the limits of joints or edges of the camera frame. We apply two biologically inspired machine learning approaches: Artificial Neural Networks (ANN) and Genetic Programming (GP). These approaches are well-known approximation and prediction methods for datasets where samples of inputs and the correlated output are available.
The stereo vision problem in a humanoid depends, apart from the location of the object in both eyes, also on the nine degrees of freedom that control the eye position with respect to
3.1 Vision System: The icVision Framework
Before the robot can start learning positions, it first needs to detect and track objects. A pair of cameras, mounted in the head, provides the iCub's vision. The human-like design does not actively emit any measurement signal, but rather relies solely on ambient light images. Wide-angle lenses broaden the iCub's field of view, however they also add distortion, making traditional calibration somewhat challenging. Existing software to perceive, detect and track objects in the camera images was reused and extended to build and update a world model.
Attached to it are

The

A more complicated example of a learned
Due to the modularity of the
The 3D location estimation works as follows (see Figure 5 for reference):

The detection and localization of a specific object in the iCub's camera images is shown in this example. The position of the object is estimated in the
At first the camera images are acquired from the hardware via YARP. The images are converted into grey-scale, split into RGB/HSV channels and then distributed to all active
Each filter then processes the input images using mainly OpenCV functions
The output of this is a binary image, showing the segmented out object to be detected
A blob detection algorithm is run on these binary images to find the (centre) location of the detected object in the image frame
The position of the object in both camera images is sent to the 3D localization module
The robot's pose, i.e., the joint encoders, are read
Using both the object location in the camera frames and the robot's pose, a 3D location estimation is generated
As the last step the localized object is then placed in the world model (see Results section).
3.2 Artificial Neural Network (ANN)
The first machine learning approach uses a feed-forward, multi-layered Artificial Neural Network (ANN) trained using error back-propagation [23]. This approach requires a pre-processing step, in which the dataset (i.e., the input vector) is scaled using the limits given in Table 1 to get values in the range [−1, +1]. The limits are based on the image size (the first four values) and the joint limits (i.e., the range of motion of the stochastic controller) of the robot (the encoder values). The output of the ANN is in the same limited range and needs to be scaled as well.
The limits used to scale the input values (features) and a typical entry from the dataset collected.
For training the network, the (scaled) dataset was first randomly split into a training (
The network uses bias terms and is fully connected. In our case the ANN's output layer is a single neuron representing the estimated position along one single axis.
4.2 Genetic Programming (GP)
Genetic Programming (GP) is a search technique, most commonly used for symbolic regression and classification tasks. It is inspired by concepts from Darwinian evolution [24]. Herein we use GP to find expressions mapping 13 inputs (the same as for the ANN) to three outputs.
The basic algorithm works as follows: a population is initialized randomly. Each individual represents a tree, encoding a mathematical expression. The nodes encode a function, with the leaf nodes either being an available input or a constant value. For a given set of input values, the output of the expression can be found by recursing from the root node through to the terminal nodes. The individuals are then tested to calculate their ‘fitness’ (in our case the sum of the mean error). The lower this error, the better the individual is at performing the mapping. In the next step a new population is generated out of the old, by taking pairs of the best performing individuals and performing functions analogous to recombination and mutation. The process of test and generate is repeated until a solution is found or a certain number of individuals have been evaluated. [25] provides a comprehensive introduction to genetic programming.
We use ‘Eureqa’
5
[26], a freely available software suite, previously shown to be particularly capable in handling data from real-world experiments [27]. Compact, human readable expressions are generated employing the above-mentioned techniques. The input values do not have to be scaled in this approach and can remain in the original form. As with the neural network regression, data was split into a training (
5. Experiments and Results
As described above, the supervised learning approaches need a dataset to train, containing the state of the robot, the location of the object in the two camera frames and the expected outputs (i.e., the
To learn about more than one state and hence get the ability to generalize what it learns, the robot needs to experience various configurations and object locations. For each hand-measured position of the object the iCub was moved into a number of randomly selected poses. At each pose an entry was added to the dataset.
5.1 Estimating Object Locations on the Table
The first experiment performed was to estimate positions of objects located on a table in front of the robot. In this scenario the height (Z component) of the position is fixed and hence simplifies the learning problem. To generate the dataset, a red object was placed at a known position to mark the RPs, while the iCub moved into different poses. After collecting data for a number of robot poses, the object was moved to another position and the process repeated. Figure 6 shows the positions in which the red object was placed (grid with

The distribution of the reference points collected on the table. The position the iCub with respect to the table is indicated and the robot's reference frame offsets to the table are specified.
The ANNs were then trained using PyBrain [28] with a learning rate of

The estimation errors for one specific axis using ANNs with a varying number of hidden neurons. The error bars show min. and max. errors in 10 trials.
Table 2 compares the position prediction errors of the two techniques and shows that the neural network outperforms the GP method during learning. The errors reported are the best found during 10 learning runs. Figure 8 shows the actual and predicted locations for each test case. The top image shows the ANN approach, with tightly clustered estimations around the position, while the GP approach performs worse for most data points.
The estimation accuracy (in

The difference between the estimated object positions (blue dots) and the actual object positions (red blocks). The top plot depicts the ANN results, while the bottom shows the GP.
Both approaches performed similarly well on both the training and test set, suggesting that the learned methods are estimating correctly and are able to generalize to the data collected (6×6 grid).
The GP method, while converging faster than the neural network, performs with an average accuracy of
An interesting observation from inspecting these is that only one of the camera images is used (features
During training, both the ANN and GP approaches provide sufficient accuracy for object manipulation. The trained ANNs and the GP formulae were then tested on the real iCub hardware providing position estimates of objects in real-time. A YARP module was implemented to provide the state vector to both the trained neural network and the GP evolved formulae to compare their predictions. The experiments were performed using a cup, detected in the camera images using a specifically trained
To validate the predication errors of these learned methods, locations (on the table) and robot poses that were not in the original data set were applied. It was found that the GP out-performed (average error of
As another validation step, the performance of the two methods in terms of relative error was compared. Here the target object was moved in small increments around a central point. The measured locations are
The relative estimation errors (in cm) when predicting the position using of objects not in the training set. A fixed robot pose was used.
Figure 9 shows a plot with the estimation of these specific measurements for the relative error. To allow for a comprehensive comparison, the current state-of-the-art iCub localization module (‘stereo Vision’) is also plotted.

Comparing the location errors of GP and ANN on the iCub hardware. The black filled boxes show the hand-measured RPs, whereas the circles represent the learning approaches, ANN and GP, respectively (empty circles are ANN measurements and filled are GP). The iCub ‘stereo Vision’ module is shown in green.
5.2 Estimating Object Locations in 3D Cartesian Space
The first experiment showed that the machine learning techniques are able to estimate the position of objects detected on the table. In this second experiment, we want to overcome the limitation that the objects are placed on the table and at the same time automate the lengthy process of collecting the dataset by hand. For this experiment a new dataset, including also a varying height, was collected to learn from. A high precision robotic arm, in our case a five DOF Katana arm by Neuronics [29], is used to position a red object in front of the humanoid (see Figure 10). Using the robotic arm, providing a high precision placement of the object (in

The iCub and the Katana robot working in a shared workspace to transfer spatial understanding from the precise industrial robot arm to our humanoid robot.
To collect the dataset, both robots are moved to randomly selected poses allowing for a random sampling of the configuration space. Once the robots reach their poses, as for the first experiment, camera images and the encoder positions of the iCub are stored.
The output value, to complete the entry in the dataset, is the 3D position information provided by the kinematics of the Katana arm, which is provided in the Katana's reference frame,
Another problem arises as in this setup multiple robots are sharing the same workspace and are controlled independently. The challenge is to prevent collision between the two robots, which would likely lead to damage to either or both the robots. An existing software framework called
MoBeE provides, by computing forward kinematics, and maintaining a geometric representation of the 3D robot/world system, a safety system for our two interacting robots. If a pending collision is detected, the controllers are disconnected from the real hardware and a reflex controller takes over. Once the two robots are returned to a safe pose the suspended controllers are reactivated. Figure 11 shows the view of the world model with both robot models loaded and visualized.

The iCub and the Katana kinematic models loaded into
In this experiment only the ANN approach was investigated due to its better performance compared to the GP approach in the first experiment. Again separate ANNs (this time three), using 10 hidden neurons and a single output neuron, predict the location along each single axis. The learning rate and momentum are again
Figure 12 visualizes the prediction error for all samples and axes in the dataset. A few outliers can be seen, but generally the prediction is accurate. The localization was tested by reaching for the red block held by the Katana manipulator (Figure 2) and reaching for a cup on the table, as shown in Figure 13.

The prediction errors per axis, after training the ANNs using the full 3D localization dataset.

The iCub estimates the position of the cup placed on the table in front of the robot. Note: the cup is placed directly under the arm, but due to the parameters of the camera and the different perspective in the two images this is hard to see. 6
6. Conclusion
We tackle the problem of localizing objects in 3D space from vision on a humanoid robot. We propose a machine learning approach that does not require prior camera calibration or a precise kinematic model. The performance of this biologically inspired machine learning technique is investigated. An Artificial Neural Networks (ANN) and a Genetic Programming (GP) approach are compared using a dataset collected on the iCub humanoid robot.
Both techniques provide a learned model which implicitly contains the camera parameters, usually delivered by a time-consuming stereo camera calibration, and the robot model, otherwise provided by lengthy determination of the precise kinematics. The learnt, ‘light weight’ models can easily be incorporated into embedded systems and other robotic platforms.
We present a method requiring a pre-collected dataset. Two experiments were conducted to see how this data can be collected and how well it can be learned.
The first experiment, predicting locations of objects on the table, showed that both are methods sufficient for real-world reaching scenarios, with the ANNs out-performing the GP method.
In the second experiment, the humanoid shared a workspace with a Katana industrial robot arm. Due to the ability to allow for safe interaction, the iCub was able to learn estimating positions in full 3D Cartesian space from the Katana. The accuracy of the trained ANN localization is sufficient to allow the iCub to reach for objects, as shown in Figure 13.
Our iCub was able to learn to estimate the locations of objects, when placed in the robot's workspace at arbitrary positions, in 3D Cartesian coordinates. In addition, the approach is robustly localizing, even
Footnotes
7. Acknowledgments
The authors would like to thank: Leo Pape (IDSIA) & Ugo Pattacini (IIT) for the Cartesian controller and stereo camera calibration of the iCub; Davide Migliore & Alexandre Bernardino (IST) for helping with the vision review. This research is supported by the European Union project IM-CLeVeR, contract FP7-IST-IP-231722.
4
YARP is a middleware that allows easy, distributed access to the sensors and actuators of the iCub humanoid robot, as well as to other software modules.
