Fast Matching of Binary Descriptors for Large-Scale Applications in Robot Vision

Abstract

The introduction of computationally efficient binary feature descriptors has raised new opportunities for real-world robot vision applications. However, brute force feature matching of binary descriptors is only practical for smaller datasets. In the literature, there has therefore been an increasing interest in representing and matching binary descriptors more efficiently. In this article, we follow this trend and present a method for efficiently and dynamically quantizing binary descriptors through a summarized frequency count into compact representations (called fsum) for improved feature matching of binary point-features. With the motivation that real-world robot applications must adapt to a changing environment, we further present an overview of the field of algorithms, which concerns the efficient matching of binary descriptors and which are able to incorporate changes over time, such as clustered search trees and bag-of-features improved by vocabulary adaptation. The focus for this article is on evaluation, particularly large scale evaluation, compared to alternatives that exist within the field. Throughout this evaluation it is shown that the fsum approach is both efficient in terms of computational cost and memory requirements, while retaining adequate retrieval accuracy. It is further shown that the presented algorithm is equally suited to binary descriptors of arbitrary type and that the algorithm is therefore a valid option for several types of vision applications.

Keywords

Binary Descriptors Efficient Feature Matching Real-world Robotic Vision Applications

1. Introduction

An important issue in computer vision is the use of compact and computationally efficient binary-valued visual feature descriptors. This is especially true for large-scale robotic vision applications, for several reasons. First, a compact representation can enable efficient real-time matching of a large variety of objects and scenes found in real-world environments. Second, an efficient representation can relieve internal memory and processor to complete other tasks (particularly useful in robotics). In the literature, there has been an increasing interest in quantizing visual descriptors more efficiently, as well as improving the overall computation of training and the matching of visual descriptors. However, an improvement in computational efficiency is often made at the cost of a reduced retrieval accuracy (and vice versa) and a general trend is therefore to use computationally efficient approximated solutions supported by additional learning (either supervised or unsupervised) via a demanding off-line training phase. Yet while off-line training is acceptable for creating a reference space of objects and scenes for recognition tasks, a real time system such as an operating robot platform must also maintain a dynamic representation of recognized objects and/or scenes [1]. For this purpose, the representation of features must not only be compact but also dynamic, so that the algorithm can adapt and incorporate new features without repeating elaborate off-line training.

The work presented in this paper introduces a summative approach known as $f s u m$ , which represents and matches binary visual feature descriptors while addressing the above challenges. The main aim of the approach is to summarize binary descriptor sets into an array that represents the frequency count of the byte values within a binary string of a descriptor set. We evaluate two versions of this approach, where a one-dimensional array is used called $f s u m_{1 d}$ and a two-dimensional array is used called $f s u m_{2 d}$ . The paper presents an evaluation of our suggested $f s u m$ approach using traditional Oxford datasets. An evaluation of the matching efficiency and accuracy are presented and compared to the state-of-art. The key contribution of this work is to provides object recognition accuracy that is on par with the state-of-the-art, but with a faster matching time and lower computational cost.

The paper is organized as follows: an overview of the state-of-art in adaptable solutions for training and matching binary descriptors is presented in section 2; our suggested $f s u m$ approach is presented in section 3; the performed evaluation is presented in section 4; finally, we end this paper with conclusions in section 5.

2. Related Work

Feature matching of vector-based local visual features such as SIFT [2] and SURF [3, 4] have successfully been used for over a decade. However, vector-based visual features are computationally costly and can therefore become a bottleneck when used for vision applications (especially when used for real-time applications and together with larger datasets). As an alternative to vector-based visual features, a number of more computationally efficient binary-valued visual features, such as BRIEF [5], ORB [6], BRISK [7] and FREAK [8], have recently been proposed.

Common for all local visual features (both vector-based and binary) is that feature point descriptors are computed for image patches around distinct image keypoints and feature descriptors are therefore often coupled with a keypoint detector. Calonder et al. suggest the use of the CenSurE [9] or FAST [10] detector for detecting the keypoints over which intensity difference tests over randomly selected pixel pairs are computed in order to represent image patches for the BRIEF descriptor [5]. However, a shortcoming of the BRIEF descriptor is that it is sensitive to in-plane rotations and scaling. As an improvement, Rublee et al. suggest the ORB detector (Oriented FAST and Rotated BRIEF) [6], which extended FAST by intensity centroids [11] for orientation alongside introducing a greedy search algorithm for selecting the most uncorrelated BRIEF binary tests (those of high variance) for improving rotation invariance. Inspired by the AGAST extension to FAST [12], Leutenegger et al. suggest a scale-space detector for identifying keypoints across scale dimensions using a saliency criterion for their binary robust invariant scalable keypoints. (BRISK) [7]. A binary string for each keypoint is calculated over a rotation invariant sample pattern consisting of uniformly appropriately-scaled concentric patches around each keypoint. Similar to BRISK, [8] presented their human retinal inspired fast retina keypoint (FREAK) [8], in which a cascade of binary strings is computed as the intensities over the retinal sampling patterns of a keypoint patch. It is worth noting that, even though feature descriptors re often coupled with suggested keypoint detectors, there is no strict bound between descriptors and detectors.

In reality, feature descriptors and keypoint detectors can be combined arbitrarily, and hence customized for specific applications. Regardless of the combination of keypoint detector and descriptor extractor, the resulting feature descriptor set, c, is a set of individual binary strings of dimensionality l bytes, e.g., 32 or 64 bytes, which have been extracted for n keypoints detected in the same image. Hence and for forthcoming sections, a training dataset of descriptor sets extracted from a collection of m images are denoted by:

C = {c_{1}, c_{2},… c_{m}}

(1)

The main benefit of binary descriptors (compared to vector-based descriptors) is the support of fast brute-force matching (or linear search) by calculating the Hamming distance between features, which for binary strings is the number of bits set to 1 in the result of XOR between two strings. This can today be computed extremely efficiently with the use of the POPCNT instruction in x86_64 architectures. Here, the Hamming distance between a binary string i of a query descriptor set x and a binary string j of trained descriptor set $c \in C$ is denoted by:

d^{H a m m i n g} (x_{i}, c_{j}) = \sum_{k = 0}^{l - 1} [x_{i, k} \neq c_{j, k}]

(2)

However, a brute-force matching is only practical for smaller datasets. In the literature (and specifically in the community of large-scale feature matching), the use of hashing techniques has been considered best practice for improving the matching of binary feature descriptors. Together with the ORB algorithm [6], locality sensitive hashing (LSH) [13] has been suggested as a nearest neighbour search strategy, where features are stored in different buckets over several hash tables. Salakhutdinov & Hinton presented the concept of semantic hashing in [14], where features are mapped to much smaller binary codes. However, semantic hashing is largely practical only for searching for nearest neighbours that differ by only a couple of bits. Another prominent approach in the use of hashing techniques is multi-index hashing, presented in [15] and which showed results on datasets with binary strings of lengths up to 128 bits. In general, however, it is also the length of the binary strings and the amount of memory required to store all the buckets for a hashing approach that often serves as the limitation of a hashing approach. To date, no efficient method for handling binary codes of lengths of 32 or 64 bytes (which are the lengths of the binary strings used in this paper) has been presented. For a complete review of binary hash codes for large-scale image search, see [16].

In summary, there are currently two prominent approaches for efficient matching of visual features that are directly (or with minimal alteration) applicable to binary feature descriptors and hence, comparable to our approach: 1) clustered search trees; 2) bag-of-features, with improvement of vectors of aggregated local descriptors (VLAD) and vocabulary adaptation. The prior is computationally efficient with respect to the training of descriptor sets. The latter approach is a summarized approach and hence, a memory efficient approach. Contrarily, the prior approach is associated with a high cost of memory requirements, while the latter approach requires extensive training in order to reduce memory requirements. The novelty of our $f s u m$ approach is complementary to both clustered search trees and bag-of-features, with improvement of VLAD and thus represents an approach that is both computational and memory efficient.

3. A Summative Approach for Fast Training and Matching of Binary Descriptors

The principal idea behind the proposed approach is to maintain a byte frequency count for each byte and for each binary string of a descriptor set. Specifically, the approach transforms a binary descriptor set into a histogram that maintains the byte value distribution of the binary strings, which constitute a descriptor set [17]. The inspiration for this approach stems from two core questions: 1) is it possible to reduce the total search space prior to matching individual descriptor strings?; 2) is it possible to reduce the memory requirements for representing a binary descriptor set?

3.1. Summarize binary descriptors by frequency count

The procedure for creating a $f s u m$ is performed in two steps:

Step 1. Given a binary descriptor set $c \in C$ , which consists of n binary strings of l length, the principal objective with the suggested $f s u m_{2 d}$ approach is to create a two-dimensional array that maintains a frequency count and which represents how often a particular value occurs for each byte and for each binary string. To exemplify, consider that we are given a binary descriptor set c of dimension $n \times l$ , where n is the number of binary strings for descriptor set c and l is the length in bytes of each binary string, e.g., 32 or 64. The resulting two-dimensional array a will then have a dimension of $p \times l$ , where p is the maximum value of a byte and l is the number of bytes for each binary string, e.g., 32 or 64.

The frequency count is a weighted frequency count depending on the size of n so that the larger the n, the less significance each individual byte has on frequency occurrence. Thus, frequency occurrence is multiplied by $1 / (l * n)$ for each cell, $a_{(i, j)}$ , so that each cell value is given in the interval $[0.0,1.0]$ . This process is then repeated for each binary descriptor set c in a dataset C. Hence, A denotes the total dataset of summarized two-dimensional arrays. Algorithm 1 provides the pseudocode for summarizing a binary descriptor set c into a corresponding summarized two-dimensional representation a (which will at a later stage be referred to as a $f s u m_{2 d}$ representation).

Algorithm 1 Summarize 2D Representations - f-sum_2d

1: function Summarize-2d(c, l) 2: n ← size-of (c) 3: f ← 1/(l ^* n) 4: a(p, l) ← create 2-dimensional array 5: for i ← 0, n do 6: for j ← 0, l do 7: k ← C_i,j ⊳ j:th byte of c_i 8: a_k,j ← increase by f 9: end for 10: end for 11: return a 12: end function

Step 2. Given A, the next step is to further reduce the dimensionality by compressing each summarized two-dimensional array a into a compact one-dimensional array b. The major step here is removing the l dimension by calculating the mean value for each row a_i. Additionally, based on the intuition that extreme peaks (or the absence thereof) provide valuable distinctness in the sense of a frequency count, the min and the max value for each row a_i are in this instance also calculated. All calculated row mean, min and max values are then merged into a one-dimensional array b, which will then have a length of $3 p$ . Similar to the former step, this process is then repeated for each two-dimensional array a in a dataset A. Hence, B denotes the total dataset of summarized one-dimensional arrays. Algorithm 2 provides the pseudocode for further summarizing a two-dimensional array a into a corresponding summarized one-dimensional representation b (which will at a later stage be referred to as a $f s u m_{1 d}$ representation).

Algorithm 2 Summarize 1D Representations – f-sum_1d

1: function Summarize-1d(a, l) 2: b(3 × p) ← create one-dimensional array 3: for i ← 0, p do 4:

b_{i} \leftarrow \frac{1}{l} \sum_{j = 0}^{l} a_{i, j}

5: b_p+i ← min(a_i,0…I) 6: b_2p+i ← max(a_i,0…I) 7: end for 8: return b 9: end function

Figure 1 graphically depicts this process of summarizing a binary descriptor set c, firstly, into a two-dimensional array a and subsequently, a one-dimensional array b.

Figure 1.

Example of translating a binary descriptor set (c) from a two-dimensional array (a) to a one-dimensional array (b). The left most table is the binary feature set of dimension $n \times l$ . The middle table depicts the two-dimensional array of dimension $n \times l$ , which provides a frequency count of the various byte values. The right most table depicts the one-dimensional array of length $3 p$ , which further summarize the frequency count by reducing the l dimensionality. Dark cells indicates a lower frequency count, while lighter cells indicate a higher frequency count.

3.2. Matching of summarized representations

Matching is done by measuring the similarities between an unknown descriptor set x and each training descriptor set $c \in C$ . Matching an unknown binary descriptor set x starts by creating a new two-dimensional array $\hat{a}$ consisting of the frequency count for the unknown descriptor set, depicted in Algorithm 1. Once the two-dimensional array $\hat{a}$ is created, this array is matched against the summarized training array $a \in A$ and where similarity r is measured by the L₁ distance according to:

r = \sum_{i = 0}^{p} \sum_{j = 0}^{l} | {\hat{a}}_{i, j} - a_{i, j} |

(3)

Subsequently, a new one-dimensional array $\hat{b}$ is created for the summarized two-dimensional array $\hat{a}$ , depicted in Algorithm 2. This array is then matched against the summarized training array $b \in B$ and where, similar to the previous step, the similarity r is measured by the L₁ distance according to:

r = \sum_{i = 0}^{p} | {\hat{b}}_{i} - b_{i} |

(4)

The winning matching binary descriptor set c will be the one with the highest similarity r (or lowest L₁ distance) for a match against the corresponding arrays a and b. Moreover, the resulting similarity score r is also given for the entire descriptor set c (in opposition to traditional feature matching, where a similarity score is given for a individual descriptor string of descriptor set c).

3.3. Maintaining summarized representations

For a feature matching algorithm to be beneficial for real-world robot vision applications, the algorithm must be dynamic and adapt to changes as new data are collected. Based on the suggested summary of binary descriptor sets (presented in section 3.1), it is obvious that sets of binary descriptors are trained (summarized) independently of one another. The same is true for matching an unknown binary descriptor set, x, against a summarized representation. As a result, adding a binary descriptor set c and extending the total dataset C to $m + 1$ is achieved by summarizing the descriptor (through Algorithm 1 and Algorithm 2), and by including the summarized results in collections A and B, respectively. Moreover, removing a binary descriptor set c is achieved by simply removing corresponding summarized representations $a \in A$ and $b \in B$ .

4. Evaluation

The proposed algorithm was implemented on top of the OpenCV library¹. The oriented FAST keypoint detector [6] was used as a keypoint detector. Since our approach is based on a frequency count, we have further adopted Harris corner measures [18] for ordering the FAST keypoints, as suggested by [6]; the $n = 2000$ strongest keypoints for each image were selected and used for feature-point descriptor extraction in this evaluation. Furthermore, the following types of feature descriptors of dimensionality l bytes were considered (based on their support in OpenCV):

ORB descriptors (of 32 bytes).

BRISK descriptors (of 64 bytes).

FREAK descriptors (of 64 bytes).

The BRIEF descriptor is also supported in the current version of OpenCV. However, since the ORB (oriented FAST and rotated BRIEF) algorithm is based on the BRIEF algorithm and because BRIEF is known for having a weak rotational invariance, it was not considered in this evaluation.

4.1. Notes on datasets for evaluation

The datasets used in the literature for evaluating algorithms for feature matching are commonly based on images collected from the web, e.g., Flickr, Google search, Bing search, etc. These datasets consist therefore of the top ranked results of searches for famous landmarks or buildings, or normalized representations of groups of objects, e.g., cars, faces, etc. While such datasets are suitable for learning about general representations of concepts for searching k-nearest neighbours for a 1:N result, an algorithm for real-world robot applications is more concerned with searching for the uttermost unambiguous 1:1 matching result, e.g., recognizing one specific object among other objects. Another fundamental difference between images collected from the web and real-world sensory data is that web images have, for the most part, been taken under good conditions and different viewpoints of the same object are available (e.g., of the same landmark), while real-world sensory data is affected by many types of distortions (e.g., motion blur, different light conditions, etc.).

For evaluation of the work presented in this paper, the Oxford Affine Covariant Features dataset was used (for short referred to as oxford throughout the rest of this paper)². This dataset was introduced by [19] for the performance evaluation of local descriptors and has been commonly used for the evaluation of keypoint-based feature matching. This dataset contains eight distinct training samples (named bark1, bikes1, boat1, graf1, leuven1, trees1, ubc1 and wall1) of different real-world scenes. For each training image there exists five corresponding query images that are distorted to some extent. This dataset covers the following types of distortions: zoom and rotation changes (boat and bark); viewpoint changes (graf and wall) various degrees of image blur (bikes and trees); different degree of JPEG compression (ubc); changes in light conditions (leuven). Examples of used training images (left image colum), together with corresponding distorted query images (middle and right image column), are illustrated in Figure 2.

Figure 2.

Examples of images used for evaluation of this work. Left image column: training images of different real-world scenes, middle and right column: corresponding distorted query images.

For evaluation at a larger scale, the Oxford Flickr 100k dataset has also been used (in short referred to as flickr100k throughout the rest of this paper)³. This dataset has been used together with the Oxford Buildings 5k dataset for large-scale object retrieval, as presented in [20]. This dataset was collected by crawling Flickr's 145 most popular tags and consists of 100,146 high resolution images. For the purpose of large-scale evaluation in the following subsections, this dataset was used so that efficiency and accuracy were evaluated while concurrently introducing a growing set of “distracting” feature descriptors to the prior oxford dataset.

A complete summary of all features available for this evaluation can be seen in Table 1, where all descriptor sets were extracted and stored off-line in advance. Therefore, keypoints in the order of 2,500 were detected and used for feature-point descriptor extraction, and each of the extracted descriptor sets were subsequently down-sampled to a desired maximum of $n = 2,000$ binary strings.

Table 1.

Summary of features used for evaluation. For each type of feature descriptor, both the total number of extracted binary strings and the average number of binary strings for each descriptor set is listed.

	oxford	flickr100k	Total
Images	$8 + 40$	100,146	100,194
ORB (tot.)	94,455	188,360,909	188,455,364
ORB (avg.)	1,967	1,880	1,880
BRISK (tot.)	94,149	186,836,372	186,930,521
BRISK (avg.)	1,961	1,865	1,865
FREAK (tot.)	86,493	174,804,410	174,890,903
FREAK (avg.)	1,801	1,745	1,745

4.2. Efficiency and accuracy compared to the state-of-art

The overall purpose of the evaluation presented in this section was to compare our summative approach to the current state-of-art processes in the efficient matching and dynamic learning of binary feature descriptors. The following algorithms were used for comparison:

Clustered search trees (described further in 6.1).

Bags-of-features improved by VLAD and vocabulary adaptation (described further in 6.2).

The parameters used for the clustered search trees were $k = 16$ cluster centres and a maximum leaf size threshold of $t = 150$ in all cases (parameter values that have proven to be a good combination [21]). The only parameter that affects the bag-of-features approach is the number of cluster centres k used for training the vocabulary. In this evaluation, both vocabularies of $k = 128$ and $k = 512$ were considered and the results are consequently also reported separately.

In this evaluation, both the oxford and flickr100k datasets were used so that scalability could be evaluated by extending the training dataset of descriptor sets (from the oxford dataset) with a growing number of descriptor sets (from the flickr100k dataset). The evaluation presented in this paper was carried out on an Ubuntu server with an Intel Core i7-3770 CPU of 3.40GHz and 8GB RAM.

4.2.1. Computational efficiency

Efficiency was in this case measured as the computational cost for both training (computational time for building a model for a training dataset C) and matching (computational time for matching an unknown descriptor set x against the model built for a training dataset C). Rather than focus on particular types of binary features (e.g,. ORB descriptors), the efficiency of training and matching for arbitrary sets of binary strings of lengths $l = 32$ and $l = 64$ (bytes) was measured in this case. The computational cost was measured with respect to growing training dataset C (where C was extracted as a subset of the flickr100k dataset) and where only descriptor sets c with $n = 2000$ binary strings were used for consistency. All results can be seen in Figure 3 and Figure 4.

Figure 3.

Training (building) efficiency measured as computational cost with respect to growing training dataset C. Above: computational time for training different models with 32 bytes binary strings, below: computational time for training different models with 64 bytes binary strings.

Figure 4.

Matching efficiency measured as the computational cost with respect to growing training dataset C. Above: computational time for matching descriptor set x against different models trained with 32 bytes binary strings, below: computational time for matching descriptor set x against different models trained with 64 bytes binary strings.

From Figure 3 it can be seen that all methods had a training cost proportional to the size of the training dataset C, both for binary descriptors of length $l = 32$ and $l = 64$ . Nonetheless, the training time for both suggested that $f s u m_{1 d}$ and $f s u m_{2 d}$ was initially between $10 - 50 t i m e s$ faster than the training time for the search tree method, and about $1,000 t i m e s$ faster than the training time for a bag-of-features method (both in the case of a vocabulary of size $k = 128$ and $k = 512$ ). For a training dataset $C_{m = 1000}$ , both $f s u m$ methods were roughly $50 - 100 t i m e s$ faster than a search tree method and about $1,000 t i m e s$ faster than the training time for a bag-of-features method. Notable here is the high computational cost for a bag-of-features approach, which was the result of computationally costly clustering in order to create the vocabulary (and which, despite the use of binary descriptors, remains a costly process [22]).

Furthermore, the results of computational time for matching, shown in Figure 4, revealed that both suggested $f s u m$ representations are extremely efficient for matching datasets $C_{m < 1,000}$ . However, it is also clear that the computational cost for matching will increase, together with the size of the training dataset, especially for suggested $f s u m_{2 d}$ representations and for the search tree method for dataset $C_{m > 10,000}$ . In Figure 4, it can also be seen that the computational time for matching the suggested $f s u m_{1 d}$ representations is comparable to the matching time of bag-of-features approaches. This was expected, since both methods follow a similar matching pattern of first compressing an unknown descriptor set x into a one-dimensional array representation, followed by a comparison between the unknown array and all stored arrays of the training dataset. However, the length for an array in this case was 128 respective 512 elements for the bag-of-features approaches and 768 elements for the suggested $f s u m_{1 d}$ approach. Also notable here is the logarithmic scales in Figure 4 and that in reality, the difference in computational cost between approaches (with the exception of the suggested $f s u m_{2 d}$ approach and for the search tree approach) is in the order of $100 m s$ for training datasets $C_{m = 50,000}$ . Hence, to exemplify and summarize, consider the following example.

Example 1 Assume that we have a training dataset of $C_{m = 5,000}$ and an unknown descriptor set x test with FREAK descriptors of $l = 64$ , and that each descriptor set has approximately $n = 2,000$ binary strings. This will in this instance result in a total time of training and matching of $1.654 s + 0.008 s$ and $0.956 s + 0.114 s$ for the suggested $f s u m_{1 d}$ and $f s u m_{2 d}$ , respectively, compared to $79.690 s + 0.0275 s$ for a search tree approach and $1093.010 s + 0.0123 s$ for a bag-of-features approach (in this case trained with a vocabulary of $k = 128$ ).

4.2.2. Matching accuracy

Matching accuracy was measured as the average precision in percentage on the oxford dataset. To evaluate accuracy at a larger scale, the evaluation was repeated while adding a growing number of “distracting” images to the training set (randomly extracted as a subset of the flickr100k dataset). Rather than measuring average precision for individual binary strings, a match in this case was measured using a “winner takes it all” approach for the entire query feature descriptor set x. Hence, matching was performed between entire feature descriptor sets (rather than between individual binary feature strings). However, for this evaluation to be comparable, several requirements were established for the search tree approach in order to classify a set of matches as “a match” for the entire descriptor set.

A common issue when working with keypoint-based feature matching is to establish a threshold for distinguish true matches from false matches. In the case of binary point-features, a Hamming distance threshold value for filtering the matches has to be established. Rublee et al. reported a threshold value of 64 for ORB features [6], while [7] suggested a distance threshold of 90 for BRISK [7]. However, no such threshold value has been reported for FREAK and this type of filtering method was therefore not considered in this evaluation. Instead, a distance ratio test was utilized in this case for filtering feature matches of the search tree approach, a method initially presented by [2], together with the initial reported work on SIFT features [2]. A ratio test filtered matches according to the distance ratio between the best match to the second best match for each query string, where only the best match was kept as the true match in cases where this ratio was less than a threshold value set to 0.8 (a threshold value commonly found in the literature [2, 23]).

Another issue when working with keypoint-based feature matching is to determine the minimum number of matches that are required for defining a true match between a full query descriptor set x and a matching training descriptor set c. [2] reported that three correct matches may be enough for a reliable true match in [2]. A recently published work on the evaluation of binary keypoint-based features further addressed this issue in [24], where it was established that a higher number of best matches results in better robustness. However, results with a low number of best matches should not be discarded and in the results given in Figure 5, we have therefore used the average precision for the minimum number of correct best matches of at least 3 matches for the clustered search trees.

Figure 5.

Matching accuracy for suggested $f s u m$ compared to clustered search trees and bag-of-features supported by VLAD and vocabulary adaptation. Top: matching accuracy in per cent with respect to growing training dataset C of ORB descriptor sets, middle: matching accuracy in per cent with respect to growing training dataset C of BRISK descriptor sets, bottom: matching accuracy in per cent with respect to growing training dataset C of FREAK descriptor sets.

The results presented in Figure 5 shows a matching accuracy for the $f s u m$ approaches that is comparable to a bag-of-features approach. It is also obvious that the matching accuracy will decrease (for all approaches) if distracting descriptors are introduced. Moreover, the results show a matching accuracy for a search tree approach that is roughly $10 % - 20 %$ higher than a $f s u m_{2 d}$ approach. However, this matching accuracy comes at the cost of demanding training and high memory requirements. This is an important trade-off between efficiency and accuracy, which is further addressed in section 4.3. To reconnect to Example 2, now consider the following example.

Example 2 Assume that we have a the same training dataset of $C_{m = 5,000}$ and an unknown descriptor set xtested by FREAK descriptors (of $l = 64$ ), and that each approach is initially untrained. This will in this case give us a matching result with an accuracy of $52.75 %$ within $< 2 s$ for the suggested $f s u m_{1 d}$ and a matching result with an accuracy of $69.50 %$ within $\sim 1 s$ for the suggested $f s u m_{2 d}$ , compared to a matching result with an accuracy of $79.72 %$ within $\sim 80 s$ for a search tree approach and a matching result with an accuracy of $53.75 %$ within $\sim 18 m i n$ for a bag-of-features approach (trained with a vocabulary of $k = 128$ ).

4.2.3. Adaptability

Adaptability determines how well different approaches are able to adapt a new descriptor set c and extend dataset C to $m + 1$ , or how well they are able to remove an existing descriptor set c from a trained model and decrease the total dataset C to $m - 1$ . Similar to the evaluation presented in section 4.2.1, the computational cost was here measured with respect to a growing training dataset C. However, in this case, each approach was first trained (built) and the computational cost was subsequently measured for adding an additional descriptor set or removing an existing descriptor set.

Moreover, Figure 3 indicates that the computational cost is proportional to length l of the binary strings of the training dataset. Therefore, only descriptors of $l = 32$ (bytes) were considered in this section.

The results in Figure 6 show that the computational cost for the suggested $f s u m$ approach is almost constant, despite the size of the training dataset. The same does not hold true for either search trees or bag-of-features, both of which have a growing computational cost (due to adding and deleting descriptor sets). However, the outstanding performance of our suggested approach can be simply explained by the fact that a set of binary feature strings are retained in coherent representations that are independent of each other and adding or removing a set of summarized binary features is therefore simplified to appending or erasing the summarized set to/from the collection used for storing the training dataset, as explained in sections 3.3.

Figure 6.

The adaptability measured as the computational time with respect to growing training dataset C. Above: computational time for adding an additional descriptor set c to an already trained dataset C, below: computational time for removing a descriptor set c from an already trained dataset C.

4.2.4. Notes on memory usage

Real-world robot vision applications are limited to the robot's internal memory. It is therefore important to quantize visual features into compact and memory efficient representations. The evaluation presented in this section is an attempt to shed some light on this particular problem.

Before proceeding with this section, we need to approximate how much additional memory each approach allocates as a result of training. This is important for the search tree approach, which includes a random and hierarchical decomposition of the training dataset; hence it is not possible to estimate the exact memory requirement. To overcome this problem, we have in this case measured the average extra memory requirement for training search trees with respect to a growing training dataset C. The extra memory requirement was measured as “extra” binary strings for training the search tree (as a result of the k-medians algorithm, described in section 6.1). The parameters used in this evaluation were $k = 16$ and $t = 150$ (the same parameter values used in previous sections).

Figure 7 shows a graph representing the resulting measurements, where an equation for a power regression curve was fitted to the measurement series (note the logarithmic scale). Furthermore, with the assumption that floating point numbers are represented by $4 b y t e s$ and based on the definitions given in sections 2 and 3.1, the total memory requirement for each approach is defined according to:

Figure 7.

Estimation of extra binary strings required (as a result of training) for the search tree approach with respect to a growing number of total binary strings in training dataset C

t o t_{f s u m_{1 d}} = (m \times 3 p) \times 4 b y t e

(5)

t o t_{f s u m_{2 d}} = (m \times p \times l) \times 4 b y t e

(6)

t o t_{S . T .} = \sum_{i = 0}^{m} (n_{c_{i}} + 0.032 n_{c_{i}}^{1.022}) \times l

(7)

t o t_{B o F} = k \times l + (m \times k) \times 4 b y t e

(8)

Now, given Eq. 5, 6, 7 and 8, consider the following examples.

Example 3 Assume that we have a training dataset of $C_{m = 1,000}$ with ORB descriptors of $l = 32$ and that each descriptor set c (for the sake of simplicity) have $n = 2,000$ binary strings. Furthermore, a vocabulary for the bag-of-features approach is training with $k = 512$ , while the search tree has been trained with the same parameter values previously used. In this case, this will result in the following memory requirements: $t o t_{}$ f sum_1d $= 2.93 M B$ , $t o t_{}$ f sum_2d $= 31.25 M B$ , $t o t_{S . T .} = 63.72 M B$ and $t o t_{B o F} = 1.97 M B$ .

Example 4 Assume that we instead have a training dataset of $C_{m = 10,000}$ with FREAK descriptors of $l = 64$ and that each descriptor set c instead have $n = 3,000$ binary strings. The same algorithm parameters (as inthe previous example) will in this case result in the following memory requirements: $t o t_{}$ f sum_1d $= 29.30 M B$ , $t o t_{}$ f sum_2d $= 625.0 M B$ , $t o t_{S . T .} = 1.87 G B$ and $t o t_{B o F} = 19.56 M B$ .

From the examples provided above, it is evident that a bag-of-features approach is best for quantizing features into compact representations (which is also the recognized strength for bag-of-features). However, the suggested $f s u m_{1 d}$ provides almost the same compact representations and without any costly and extensive training. From the examples, it is also evident that our $f s u m_{2 d}$ approach is more memory efficient than a search tree approach. However, this is true only if we have a larger number of binary strings in each descriptor set ( $n ≳ 1,000$ ). From Eq. 6, it is clear that $f s u m_{2 d}$ representations are only dependent on m and l, while Eq. 5 reveals that $f s u m_{1 d}$ representations only are dependent on m. Hence, both suggested representations are therefore always of a fixed size, which is beneficial for larger sets of features.

Furthermore, in the examples above we have only considered raw feature data. In reality, there is a number of memory related issues and requirements involved in comparing the different approaches.

A bag-of-features approach with dynamic vocabulary adaptation needs to store the original training data in order to re-cluster the vocabulary (neither the suggested $f s u m$ nor a search tree approach are required to access the original data for adapting new data). This will in the case of Example 1 above require $61.04 M B$ additional data storage for the bag-of-features approach.

A search tree approach handles binary strings individually and therefore requires an identifier associated with each individual binary string (both a $f s u m$ and a bag-of-features approach retains a coherent representation of each training descriptor set c after training). Assuming we have $4 b y t e s$ identifiers; this will require $114.44 M B$ additional data for the search tree approach in the case of Example 2 above (compared to $39.06 k B$ each for the suggested $f s u m$ and the bag-of-features approach).

A search tree approach must be trained and maintained within the working memory (both of the proposed $f s u m$ representations and the vocabulary of a bag-of-features approach can be stored off-line after training).

4.3. Efficiency and accuracy in real-world scenarios

As stated in section 3.1, the motivation behind our approach is to reduce the total search space prior to matching individual descriptor strings. This section presents the experimental results of a combined approach in which the $f s u m$ algorithm was used to reduce the search space in order to process individual descriptor strings (in real-time), through the use of a brute-force search or clustered search tree search.

4.3.1. Ranked results

This section presents the detailed rank of the results in terms of how far the true match is from the best given match. A true match in this context refers to one of the training samples (bark1, bikes1, boat1, graf1, leuven1, trees1, ubc1 or wall1) in the oxford dataset. Table 2 shows the average ranked results for each type of feature descriptor and for each of our suggested representations. All results represent the average result for a training dataset of $C_{m = 1000}$ with randomly selected samples.

Table 2.

Ranked results for each descriptor type and for each query descriptor set (used in this evaluation) after matching against the suggested fsum_1d and fsum_2d representations

20.6cmRanked results	$f s u m_{1 d}$			$f s u m_{2 d}$
ORB	BRISK	FREAK	ORB	BRISK	FREAK
bark2:	3	1	92	2	2	141
bark3:	3	1	48	10	1	121
bark4:	29	3	350	15	2	489
bark5:	2	10	269	2	1	565
bark6:	54	62	556	28	20	712
bikes2:	1	1	1	1	1	1
bikes3:	1	1	1	1	1	1
bikes4:	8	1	1	1	1	1
bikes5:	69	16	1	1	1	1
bikes6:	42	8	6	1	1	1
boat2:	1	1	1	1	1	1
boat3:	1	1	1	1	1	1
boat4:	24	19	6	1	1	1
boat5:	145	84	54	7	7	9
boat6:	19	131	45	36	58	24
graf2:	1	1	1	1	1	1
graf3:	2	4	1	1	1	1
graf4:	34	32	17	10	2	2
graf5:	96	59	1	17	9	3
graf6:	46	88	5	35	21	12
leuven2:	1	1	1	1	1	1
leuven3:	1	8	1	1	1	1
leuven4:	1	20	1	1	1	1
leuven5:	1	41	1	1	1	1
leuven6:	3	82	2	1	1	1
trees2:	2	1	1	1	1	1
trees3:	2	1	1	1	1	1
trees4:	3	1	1	1	1	1
trees5:	57	2	1	12	1	1
trees6:	163	2	4	12	19	1
ubc2:	1	1	1	1	1	1
ubc3:	1	1	1	1	1	1
ubc4:	1	1	1	1	1	1
ubc5:	1	1	1	1	1	1
ubc6:	1	1	1	1	1	1
wall2:	1	1	1	1	1	1
wall3:	1	1	1	1	1	1
wall4:	2	1	1	1	1	1
wall5:	7	1	21	1	1	3
wall6:	21	1	7	1	1	5

Table 2 shows that (with the exception of FREAK descriptors) the true match is within the top $6 %$ of the ranked result for the $f s u m_{2 d}$ approach and within the top $17 %$ of the ranked result for the $f s u m_{2 d}$ approach. Furthermore, it can be seen that the $f s u m_{1 d}$ approach will suffer from zoom, rotation and/or viewpoint changes in all cases; additionally, different lightning conditions will have an affect in the case of BRISK descriptors. The suggested $f s u m_{2 d}$ approach, on the other hand, suffers from differences in zoom, combined with rotation (especial for FREAK descriptors). For this reason, we here also stress the fact that our suggested approach is entirely based on existing methods for representing binary descriptors from keypoint features and we will therefore not go into detail about how well different types of descriptors perform within different types of distorted conditions.

4.3.2. Combined matching performance and accuracy

The procedure for combining both of the $f s u m$ approaches in order to achieve matching at a large scale is as follows: 1) match an unknown descriptor set x against the maintained $f s u m_{1 d}$ representations; 2) rank the result and select the t top ranked candidates for a further match against the maintained $f s u m_{2 d}$ representations.

Depending on the application, we further suggest either using the top ranked result of the $f s u m_{2 d}$ match as the best match (in case matching speed is favoured over matching accuracy), or conducting a further search for matching individual descriptor strings among the t top ranked candidates of the $f s u m_{2 d}$ match (in case matching accuracy is favoured over matching speed). In this evaluation, the top $10 %$ ranked candidates of the $f s u m_{1 d}$ match were used in all cases as input for the $f s u m_{2 d}$ match. Moreover, a clustered search tree algorithm was used for matching individual descriptor strings (for retaining consistency with previous sections), where the top 100 candidates of the $f s u m_{2 d}$ match were used as input for the search tree match.

The results shown in Figure 8 show that by taking an approach that combines $f s u m$ with a reduced clustered search tree, it is possible to achieve matching results in about $1 s$ for a dataset with a total of 100,000 sets of descriptors (or about $180 M$ descriptor strings in this case; see Table1). By combining our approach with the current state-of-art, it is therefore possible to achieve results both efficiently and with good matching accuracy. Results were not possible to achieve with an individual search tree approach in this case, due to the memory requirements for training a search tree. Moreover, for a trade-off of reduced matching accuracy, it is possible to achieve a matching result in less than $0.3 s$ for a dataset with a total of 100,000 sets of descriptors. Notable here is also that the plot of the matching accuracy when using a combined approach was similar to the plot of the matching accuracy presented in Figure 5 for both our suggested $f s u m$ and for the applied search trees (despite the fact that the training data for the search tree is a highly reduced search space).

Figure 8.

Results for a combined approach of using: 1) $f s u m_{1 d}$ + $f s u m_{2 d}$ ; 2) $f s u m_{1 d}$ + $f s u m_{2 d}$ + clustered search trees (a combination of both $f s u m_{1 d}$ and $f s u m_{2 d}$ is here simply referred to as $f s u m$ ). Top row: combined matching accuracy, bottom row: matching efficiency measured as the combined computational cost

5. Conclusions

In this paper we have presented a method that summarizes binary visual descriptor sets into compact representations or an $f s u m$ , with the purpose of achieving efficient feature matching in larger scale vision applications. It has been shown that the $f s u m$ representations are computationally efficient. It has also been shown that the suggested representations are well-suited for real-world applications, e.g., robotic vision applications, where a feature matching algorithm must be adaptable. Furthermore, it has been shown that the suggested $f s u m$ method can be used in combination with other algorithms where the $f s u m$ are used in order to reduce the search space for other algorithms. This is particularly useful for algorithms that can improve matching accuracy, but that would otherwise not be able to execute a feature match in a larger scale, due to limitations in system resources.

Another aspect that has been addressed in this paper is the issue of memory requirements for training and maintaining a model for the matching of binary descriptors in the computer's working memory. Since the suggested $f s u m$ maintains coherent representations for descriptor sets, it is noted that trained representations can also be stored off-line and accessed on demand. This aspect will be the subject of future research. The simplicity in the proposed matching strategy of using the L₁ distance for measuring similarity raises the question of whether it is possible to transform the proposed approach into a separate database solution for representing and matching binary descriptor sets.

Finally, in the context of the performance of feature matching, we would like to point out that the evaluation presented in this paper was carried out using an experimental set-up that employed a single process (for comparison and consistency). However, since descriptor sets are handled individually in the suggested $f s u m$ approach, this approach is also extremely suitable for parallel processing. Using a multi-threaded architecture and running several parallel matching processes can in this case improve performance.

Footnotes

6. Acknowledgements

This work was supported by the Swedish Research Council under the grant number: 2011-6104, Long-Term and Large-Scale Perceptual Anchoring.

1

Current stable v. 2.4.10, online:

2

Online:

3

Online:

Appendix

References

Persson

Andreas

, Samer Al Moubayed, and Amy Loutfi. Fluent human-robot dialogues about grounded objects in home environments. Cognitive Computation, 6(4):914–927, 2014.

David

. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2):91–110, 2004.

Bay

Herbert

Tuytelaars

Tinne

Van Gool

Luc J.

. Surf: Speeded up robust features. In ECCV (1), pages 404–417, 2006.

Bay

Herbert

Ess

Andreas

Tuytelaars

Tinne

Van Gool

Luc

. Speeded-up robust features (surf). Computer Vision and Image Understanding, 110(3): 346–359, 2008. Similarity Matching in Computer Vision and Multimedia.

Calonder

Michael

Lepetit

Vincent

Strecha

Christoph

Fua

Pascal

. Brief: Binary robust independent elementary features. In Proc. of the 11th European Conference on Computer Vision: Part IV, ECCV'10, pages 778–792, Berlin, Heidelberg, 2010. Springer-Verlag.

Rublee

Ethan

Rabaud

Vincent

Konolige

Kurt

Gary

. Bradski. Orb: An efficient alternative to sift or surf. In Proc. of the 2011 International Conference on Computer Vision (ICCV), ICCV ′11, pages 2564–2571, Washington, DC, USA, 2011. IEEE Computer Society.

Leutenegger

Stefan

Chli

Margarita

Siegwart

Roland

. Brisk: Binary robust invariant scalable keypoints. In Proc. of the 2011 International Conference on Computer Vision (ICCV), ICCV ′11, pages 2548–2555, Washington, DC, USA, 2011. IEEE Computer Society.

Alahi

Alexandre

Ortiz

Raphael

Vandergheynst

Pierre

. Freak: Fast retina keypoint. In Proc. of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), CVPR ′12, pages 510–517, Washington, DC, USA, 2012. IEEE Computer Society.

Agrawal

Motilal

Konolige

Kurt

Blas

Morten Rufus

. Censure: Center surround extremas for realtime feature detection and matching. In Forsyth

David A.

Torr

Philip H. S.

Zisserman

Andrew

, editors, ECCV (4), volume 5305 of Lecture Notes in Computer Science, pages 102–115. Springer, 2008.

10.

Rosten

Edward

Drummond

Tom

. Machine learning for high-speed corner detection. In Proc. of the 9th European Conference on Computer Vision - Volume Part I, ECCV ′06, pages 430–443, Berlin, Heidelberg, 2006. Springer-Verlag.

11.

Rosin

Paul L.

. Measuring corner properties. Comput. Vis. Image Underst., 73(2):291–307, Feb 1999.

12.

Mair

Elmar

Hager

Gregory D.

Burschka

Darius

Suppa

Michael

Hirzinger

Gerhard

. Adaptive and generic corner detection based on the accelerated segment test. In Proc. of the 11th European Conference on Computer Vision: Part II, ECCV'10, pages 183–196, Berlin, Heidelberg, 2010. Springer-Verlag.

13.

Gionis

Aristides

Indyk

Piotr

Motwani

Rajeev

. Similarity search in high dimensions via hashing. In Atkinson

Malcolm P.

Orlowska

Maria E.

Valduriez

Patrick

Zdonik

Stanley B.

Brodie

Michael L.

, editors, Proc. of 25th International Conference on Very Large Data Bases, VLDB ′99, pages 518–529, Edinburgh, Scotland, UK, 1999. Morgan Kaufmann.

14.

Salakhutdinov

Ruslan

Hinton

Geoffrey

. Semantic hashing. International Journal of Approximate Reasoning, 50(7):969–978, 2009. Special Section on Graphical Models and Information Retrieval.

15.

Norouzi

Mohammad

Punjani

Ali

David

J. Fleet

. Fast search in hamming space with multi-index hashing. In Proc. of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), CVPR ′12, pages 3108–3115. IEEE Computer Society, 2012.

16.

Grauman

Kristen

Fergus

Rob

. Learning binary hash codes for large-scale image search. In Cipolla

Roberto

Battiato

Sebastiano

Farinella

Giovanni Maria

, editors, Machine Learning for Computer Vision, volume 411 of Studies in Computational Intelligence, pages 49–87. Springer Berlin Heidelberg, 2013.

17.

Persson

Andreas

Loutfi

Amy

. A hash table approach for large scale perceptual anchoring. In Proc. of IEEE Int. Conference on Systems, Man and Cybernetics (SMC ′13), pages 3060–3066, Manchester, UK, 2013.

18.

Harris

Chris

Stephens

Mike

. A combined corner and edge detector. In In Proc. of Fourth Alvey Vision Conference, pages 147–151, 1988.

19.

Mikolajczyk

Krystian

Schmid

Cordelia

. A performance evaluation of local descriptors. IEEE Transactions on Pattern Analysis & Machine Intelligence, 27:1615–1630, 2005.

20.

Philbin

Chum

Isard

Sivic

Zisserman

. Object retrieval with large vocabularies and fast spatial matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8, 2007.

21.

Muja

Marius

Lowe

David G.

. Fast matching of binary features. In Proc. of the 2012 Ninth Conference on Computer and Robot Vision, CRV ′12, pages 404–410, Washington, Dpresented C, USA, 2012. IEEE Computer Society.

22.

Grana

Costantino

Borghesani

Daniele

Manfredi

Marco

Cucchiara

Rita

. A fast approach for integrating orb descriptors in the bag of words model. In Proc. of IS&T/SPIE Electronic Imaging: Multimedia Content Access: Algorithms and Systems, volume 8667, pages 091–098, San Francisco, California, US, Feb 2013.

23.

Heinly

Jared

Dunn

Enrique

Frahm

Jan-Michael

. Comparative evaluation of binary features. In Fitzgibbon

Andrew

Lazebnik

Svetlana

Perona

Pietro

Sato

Yoichi

Schmid

Cordelia

, editors, Computer Vision – ECCV 2012, Lecture Notes in Computer Science, pages 759–773. Springer Berlin Heidelberg, 2012.

24.

Bekele

Dagmawi

Teutsch

Michael

Schuchert

Tobias

. Evaluation of binary keypoint descriptors. In Proceedings of the 20th IEEE International Conference on Image Processing (ICIP), Melbourne, Australia, September 2013.

25.

Brin

Sergey

. Near neighbor search in large metric spaces. In In Proceedings of the 21th International Conference on Very Large Data Bases, pages 574–584, 1995.

26.

Csurka

Gabriella

Dance

Christopher R.

Fan

Lixin

Willamowski

Jutta

Bray

Cédric

. Visual categorization with bags of keypoints. In In Workshop on Statistical Learning in Computer Vision, ECCV, pages 1–22, 2004.

27.

Jégou

Hervé

Douze

Matthijs

Schmid

Cordelia

Pérez

Patrick

. Aggregating local descriptors into a compact image representation. In IEEE Conference on Computer Vision & Pattern Recognition, pages 3304–3311, Jun 2010.

28.

Jégou

Hervé

Perronnin

Florent

Douze

Matthijs

Sánchez

Jorge

Pérez

Patrick

Schmid

Cordelia

. Aggregating local image descriptors into compact codes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(9):1704–1716, Sep 2012.

29.

Perronnin

Florent

Dance

Christopher R.

. Fisher kernels on visual vocabularies for image categorization. In Computer Vision and Pattern Recognition, CVPR ′07, pages 1–8. IEEE Computer Society, 2007.

30.

Arandjelović

Zisserman

. All about vlad. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1578–1585, 2013.