Semi-supervised hash learning method with consistency-based dimensionality reduction

Abstract

With the explosive growth of surveillance data, exact match queries become much more difficult for its high dimension and high volume. Owing to its good balance between the retrieval performance and the computational cost, hash learning technique is widely used in solving approximate nearest neighbor search problems. Dimensionality reduction plays a critical role in hash learning, as its target is to preserve the most original information into low-dimensional vectors. However, the existing dimensionality reduction methods neglect to unify diverse resources in original space when learning a downsized subspace. In this article, we propose a numeric and semantic consistency semi-supervised hash learning method, which unifies the numeric features and supervised semantic features into a low-dimensional subspace before hash encoding, and improves a multiple table hash method with complementary numeric local distribution structure. A consistency-based learning method, which confers the meaning of semantic to numeric features in dimensionality reduction, is presented. The experiments are conducted on two public datasets, that is, a web image NUS-WIDE and text dataset DBLP. Experimental results demonstrate that the semi-supervised hash learning method, with the consistency-based information subspace, is more effective in preserving useful information for hash encoding than state-of-the-art methods and achieves high-quality retrieval performance in multi-table context.

Keywords

Semi-supervised hash learning method consistency-based dimensionality reduction attribute-level similarity multi-table context

Introduction

Except for post hoc analysis, in Internet of Things (IoT), video surveillance is supposed to respond to unhandled exceptions. The level of human surveillance cannot adapt to such a massive volume of data. Processing IoT, video image is an application of the multiple attributes recognition research field. In 2016, the market of Video Surveillance has grown to 96.2 billion with merging into IoTs more deeply. Monitoring equipments are the essential eyes for IoT perception. Huge number of monitors form Articulated Naturality Web and generate hundreds of millions of video data, which is redundant and costly to save. With the explosive growth of data, exact matches become much more difficult for its high-dimensional and wide storage spaces. Except for post hoc analysis, video surveillance is supposed to respond to unhandled exceptions, for example, fire warning, crowd situation awareness pre-warning, and intelligent traffic management. Traditional video surveillance system cannot satisfy real-time monitoring and warning has to be given manually. Moreover, the level of human surveillance cannot adapt to such a massive volume of data.

Therefore, approximate nearest neighbor (ANN) search has attracted much attention over the past decades. Owing to its good performance in solving the balance between retrieval performance and computational cost, hash learning techniques are widely used in solving the ANN problems. The core of hash learning is to map the original high-dimensional data into binary hamming space. Its application includes, but not limited to, image retrieval, document search, and data mining. Different from standard metric learning, hash learning is a discrete optimization problem, which is NP hard.¹ The first efficient ANN search in high-dimensional space is introduced by Indyk and Motwani.²

The target of hash encoding is to analyze and extract information from massive unstructured data to detect the goal automatically. For instance, in real IoT scenarios, a video file consists of image frames, which are snapshots of real-world scenarios and every image contains abundant information. Therefore, to encode these frame images, hash method should take multiple semantic properties into account. In real networks, it is intuitive that one vertex may demonstrate various latent aspects when interacting with different vertexes. In other words, similarity between entities usually is non-transitive since they could be similar due to different invisible reasons.

As shown in Figure 1, straight line denotes that two frames have similar elements and dotted line means dissimilarity relationship. Picture Pa contains two elements, car and human; clearly the alike elements between Pa and Pb is car, while between Pa and Pc is human. Even though Pa and Pc have the common neighbor Pa, they are dissimilar with each other. Nevertheless, in previous hash metric researches, the latent multiple semantic properties are neglected and projected into a single hamming space. As a result of projecting entity-level similarity into a single binary hamming space, previous metric methods are difficult to preserve latent attribute similarities.^3,4 In order to solve this problem, we measure the similarity at attribute level.

Figure 1.

Simulation results for the network.

In previous hash methods, the learning target is to compact semantic features and numeric neighbor distribution in full measure. This indicates that there exists a low-dimensional subspace where all hash functions segregate the data points well.⁵ Therefore, the goal of hash learning is to learn a best subspace, in which the most identification information is contained, and then being compacted into binary codes.

To simultaneously boost the discriminative power of all hash functions, researchers simplify the procedure into data dimensionality reduction and quantification, with the downsized subspace connecting original and hamming spaces. The down-sized information subspace generated by dimensionality reduction is widely used in many hashing methods, such as principal components analysis (PCA)–based hashing (PCAH)⁶ and iterative quantization (ITQ),⁷ in which the latent hypothesis is that the goal of dimensionality reduction and hash learning is the same. PCAH generates the project function with PCA method, through which the former K principal components of training data are extracted. However, PCA treats every principal component equally and neglects the difference between each other. To solve this, ITQ learns an optimization rotation angle to balance the difference of every dimension to reduce the quantization error. Similarly, the former two aim to preserve numeric features into a single hamming space and assume that the semantic neighbor distribution can be represented by the numeric neighbor relationship in hamming space. Nevertheless, in previous works, the latent consistency hypothesis has not been guaranteed.

To solve the former two problems, we propose a novel semi-supervised hash learning method with consistency-based dimensionality reduction (SemiNTH for short). SemiNTH compresses high-dimensional data into lower dimensional space by unifying numeric local distribution structure and semantic neighborhood relationships, and then, embeds the attribute-level similar relationship into multiple separate hash tables. The method aims to preserve the values of semantic attribute similarity into the numeric distances of hash code. In similar retrieval scenario, a query entity will be mapped into low-dimensional subspace and then be encoded into a hash vector by several independent hash tables. The result set contains entities for which the hamming distances are short and with which at least one common attribute exists.

The rest of this article is organized as follows. In “Related work” section, we briefly review the related works on dimensionality reduction and hash learning. The next section explains the semi-supervised hash learning method. “Experiments” section presents experimental results that validate the effectiveness of our algorithm. Finally, the conclusion and future work are given in “Conclusions and future work” section.

Related work

With the rapid growth of large-scale networks, metric learning, that is, hash learning, has been proposed as a critical technique for big data analysis tasks.

Latent Semantic Hashing (LSH)² method, first proposed to solve (r, c)-nearest neighbor problem in 1998, is being widely used in large scale high-dimensional retrieval field, for example, information search and computer vision. Despite the effectiveness of the nearest neighbor query results of the LSH method is guaranteed by Johnson–Lindenstrauss⁸ lemma its learning objective and the criterion of hash function selection are independent of training data. To solve these problems, many hash learning methods have been proposed.

With respect to learning model, methods can be classified as unsupervised, supervised, and semi-supervised. In terms of unsupervised learning model, with none but numeric features available, hashing method integrates data distribution or structure property into hashing codes, such as Spectral Hashing method which is based on data distribution⁹ and Graph Hashing method which is based on potential epidemic structure of data.¹⁰ Shen et al.¹¹ propose an asymmetric binary code learning framework based on inner product fitting, which generates binary codes that reveal the inner products between original data vectors. Supervised hashing method provided with relationship tags encodes hash coding through kernel learning, metric learning, or deep learning.^12–14 Li et al.¹⁵ propose a deep pairwise-supervised hashing method simultaneously performing feature learning and hash-code learning for applications with pairwise labels. Liu et al.¹⁶ propose a deep supervised hashing method, which employs a convolutional neural network architecture that takes pairs of images as training inputs and learns compact similarity-preserving binary code. Shi et al.¹⁷ propose a simple kernel-based supervised discrete hashing method via an asymmetric relaxation strategy, which reduces the accumulated quantization error between hashing and linear functions with preserving the hashing function and the relaxed linear function simultaneously. To solve the high time consumption of projection operators, Xu et al.¹⁸ propose a supervised hashing method for implementing a sparsity regularizer, which reduces the number of parameters of the projection matrix and helps avoid overfitting problem. Semi-supervised method learns hash functions with numeric distribution and semantic tags on subset simultaneously, such as semi-supervised hashing system (SSH).^19,20 SSH minimizes the empirical error based on labeled data, meanwhile maximizes the distinction ability of coding bits with information theory. The best numerical nearest neighbors are not necessarily the most semantically consistent ones,⁷ but it is important to apply dimensionality reduction to capture the numeric distribution structure. In most semi-supervised methods, however, the inconsistency between numeric neighbor distribution and semantic similarity, which comes from labeled data, is ignored.

With respect to subspace dimensionality reduction, many applications have proved the effectiveness of inner subspace, such as classification¹ and feature learning.²¹ Due to the fact that hash learning can be seen as a special kind of dimensionality reduction problem, the idea of inner subspace has been adopted to improve hashing preference.^7,19,22 However, those several classic researches are demonstrated to be incomplete.²³ The random selection of hash function, leading the precision of LSH subspace, cannot be guaranteed. Failing to optimize the efficiency of partition hamming space, PCAH focuses on minimizing data rebuilding error. To achieve relatively good result, SH requires too many assumptions (i.e. orthogonal constraints of hash functions, uniform distribution of data) that are unsuitable for real applications. Different from the above methods, ITQ places special emphasis on finding the subspace to minimize the quantification loss in hash learning, and finally embed the numeric local nearest neighbor distribution structure into a hamming space. Even though ITQ obtains the best performance, the multiple label constraint and preference limit its scalability. Therefore, those dimensionality subspaces fall short in narrowing the gap between original semantic and final metric spaces.

In real social network, different types of relationships produce various unrelated interactions. This phenomenon and the importance of non-triangular equivalence relations have been deeply studied in theory.²⁴ Traditional metric learning, however, treats different relations the same and projects multi-attribute data into a single hamming space without thinking about the non-transmission of their attribute similarity^3,20 and the latent reasons why semantic connections are built.^24–26 Considering the attribute, respectively, similarity components analysis (SCA), Changpinyo et al.³ propose a probabilistic graphical model to discover similar attributes called “latent components.” SCA defines separate component similarities and then generates composite similarity with OR gate. The experiment indicates that SCA receives excellent accuracy in both multi-way classification task and link prediction task. Multi-Component Hashing (MuCH), a similar hashing method with relatively low complexity, employs multiple hash tables in describing latent attribute similarities. Even though it obtains superior behavior in both semantic description and search efficiency performance than unsupervised methods, the consistency of numeric and semantic similarities is not guaranteed.

Semi-supervised hash learning method with consistency-based dimensionality reduction

As shown in Figure 2, SemiNTH consists of two learning stages, that is, a consistency-based dimensionality reduction stage and a hash learning stage. In particular, the former stage learns a similar matrix R to unify the unsupervised numeric neighbor distribution and the supervised semantic similar relationship into a low-dimensional subspace. The unsupervised numeric features are extracted from the high-dimensional original space, and the supervised semantic features are obtained from hand marking. After this stage, the composite features that can be used to identify different entities are extracted. And in the downsized subspace, the attribute-level similar entities are mapped in nearby locations with each other. Subsequently, the latter stage learns hash functions to project the composite attribute similarities into multi-hamming spaces. In a nutshell, the whole learning process corresponds to a space compression procedure and a following transformation procedure. At the stage of compression, with labeled supervision entities, we aim to confer the semantic relationships of different entities to their numerical distributions. Therefore, the numeric features that express the semantic labels will be captured from the original space. To express attribute-level similarities, the second stage transforms the downsized subspace into multiple binary spaces. An entity will be encoded by several independent binary hash tables where each corresponds to a binary space. Notably, two entities nearby in one binary space may be far from each other in another space. The reason is that two entities may be similar in some of their attributes but dissimilar in the others. Thus, the generated multiple binary spaces describe the latent similarities between entities.

Figure 2.

The framework of SemiNTH.

The learning process composes two components: (1) retentive neighborhood relationship matrix, which ensures hash codes to be consistent with hidden attribute semantics, and (2) similarity relationship constraint component, which aims at preserving the attribute semantic similarity into the distance of hash codes. In the next section, we will introduce the subspace learning process of consistency-based dimensionality reduction and latent similarity learning process, respectively.

Notation

Given a set of N training entities, denoted as $X = [x_{1}, x_{2}, \dots, x_{N}] \in R^{d \times N}$ , where $x_{i} \in R^{d}, i = 1, 2, \dots, N$ , is the $d -$ dimensional feature vector of the ith entity. The training set is downsized into subspace $Θ = [Θ_{1}, Θ_{2}, \dots, Θ_{N}] \in R^{L \times N}$ , where $Θ_{i} \in R^{L}$ denotes $L -$ dimensional vector of the ith entity in downsized subspace $Θ$ . And then $x_{i}$ will be encoded as $H = [H^{1^{T}}, H^{2^{T}}, \dots, H^{M^{T}}]$ by M independent hash spaces, where $H^{m} \in {- 1, 1}^{K \times N}$ denotes K bits binary codes of training set corresponding to the projection function of the mth hash space. The complete M projection functions compose the projection matrix $W = [W^{1}, W^{2}, \dots, W^{M}] \in R^{L \times KM}$ .

Consistency-based dimensionality reduction with subspace learning

A high-quality dimensionality reduction method should preserve the most original semantic relationship information into binary hash codes, which is designed to correspond with the target of hash learning task.

Definition 1: Numeric feature

The numerical characteristics of the component elements of an entity. These are intuitive representations that can be extracted from the original data, for example, the size of picture and term frequency-inverse document frequency.

The semantic similarity relationships of entities are expressed by their labels that are given by whether they are similar from one or more semantic standpoints, as shown in Figure 2. Thus, for two entities, there are no consistent guarantees in their semantic similarity relationships and the distance of their numeric features.

Definition 2: Consistency-based dimensionality reduction

The process that projects high-dimensional numeric features of the entities into downsized subspace, in which the distance of the low-dimensional vectors are corresponding to their semantic similarity relationships.

Thus, the goal of consistency-based dimensionality reduction is to learn a mapping matrix $O \in R^{d \times L}$ , through which the generated numeric features of two semantic neighbor entities are very close according to a certain distance calculation method.

Dimensionality reduction function with uniform feature subspace

With a reduction matrix $O \in R^{d \times L}$ , contains L linear functions and denoted as $O = [O_{1}, O_{2}, \dots, O_{L}]$ , the high-dimensional original space is compacted into downsized subspace $Θ$ . In subspace $Θ$ , the semantic similarity comes from the pairwise supervision relationships, denoted as $R_{ij} \in {- 1, 0, 1}^{N \times N}$ , where 0, 1, −1 mean untagged, similar, and dissimilar, respectively. The optimization strategy of dimensionality reduction is to penalize faraway similar pairs, and to reward nearby dissimilar pairs in hash binary spaces, and is constrained to keep in line with the subsequent hash learning process. As a transitional space to hash binary space, the value of the range of dimensional subspace vector is limited into $[- 1, 1]$ . The fact that the value of latter hash projection function ranges from ${- 1, 1}$ makes it a discrete optimization problem which is hard to solve. Therefore, for consistency, we adopt the smooth sigmoid function²¹ as below to approximate the dimensional objective function

P = \frac{2}{1 + e^{- O^{T} x}} - 1

(1)

where the value range of standard sigmoid function is expanded to $[- 1, 1]$ by doubling the range and moving one unit down. A similar matrix G is used to express the numeric similarity relationship in the downside subspace $Θ$ , and the value of $G_{ij}$ denotes the Euclidean distance between the ith entity and the jth entity. Here, $G_{ij}$ is generated by Gaussian functions, as its nice property has been proved in many information retrieval applications.^27,28 $G_{ij}$ is defined as

G_{ij} = e^{\frac{- ∥ p_{i} - p_{j} ∥^{2}}{σ_{ij}^{2}}}

(2)

where $p_{i}$ is the vector of the ith entity in downsized subspace $Θ$ and $σ$ is a scaling parameter.

Final objective

The target of dimensionality reduction process is to preserve the most useful original semantic information into numerical distance by the least dimensions. The loss function is composed of two components: consistency learning item and regularization item. To minimize the loss due to dimensionality reduction, one natural way is to optimize the learning formula, which aims at penalizing pairs of items with same semantic are mapped faraway in Euclidean distance. The penalty degree is monotonic decreasing with the Euclidean distance between a couple of similar items. Therefore, the penalty item can be described by logistic function especially for the function’s nice probabilistic interpretation. Therefore, the minimum value of consistency function loss (maximum-likelihood of logistic loss) can be represented as

L ξ = \sum_{i, j} \log (1 + e^{- G_{ij} R_{ij}})

(3)

where $R_{ij}$ denotes the semantic similarity relationship between the ith and jth entities that come from the classification information of original space. As given in equations (1)–(3), the optimize process aims to generate the optimal dimensional reduction matrix O. In our method, the lth mapping function $O_{l}$ is same with hash functions in property aspect, for its learning target is to optimize the binary encodes too. And, serving as the transition space between original and hash space, the low-dimensional subspace $Θ$ should help in balancing the spaces. Therefore, the constraint of balance requires O produce downsized subspace with L balanced dimensions by L linear functions, as given in equation (4)

\sum_{i = 1}^{N} O_{l} (x_{i}) = 0, l = 1, 2, \dots, L

(4)

where $x_{i}$ means the ith entity and $O_{l}$ denotes the lth linear function that projects the original d-dimensional features into the lth dimension of the reduced feature vector. The constraint requires that the sum of the values of the lth dimension of N entities should be 0; meanwhile, half of the N entities are mapped into values bigger than 0 and the others smaller than 0. As a regularization item to optimize the dimensionality reduction, the constraint makes loss function problem become a NP hard.¹ To solve this problem, Wang et al.¹⁹ demonstrated the equivalence relation between the balancing constraint and the maximum variance on data O. Therefore, the regularization item is relaxed into maximizing the variance for its lth dimension, which has been proved equivalent by Yunchao and Svetlana (1984)

L φ = \sum_{l} E [∥ o_{l}^{T} x ∥^{2}] \Leftrightarrow \sum_{l} E [∥ o_{l} (x) - μ_{l} ∥^{2}]

(5)

where $E [∥ ∥]$ denotes computation of variance, and the right item of ⇔ notation stands for maximum variance of mapping function. Following the proposition, which proved the maximum variance of hash function is lower bounded by the scaled variance of the projected data,⁶ the regularization item is equivalent expressed into left item of the ⇔ notation. Therefore, the consistency-based dimensionality reduction loss function can be rewritten as

LR = L ξ + γ L φ

(6)

where $γ$ controls the weight of part $L φ$ . We minimize the loss function equation (6) by updating the reduction matrix O iteratively and finally obtain an optimal distance matrix G, which expresses the similarity between local numeric neighbors.

Attribute-level hashing with latent similarity learning

Each dimension of the vector obtained in the dimensionality reduction process corresponds to a linear partition of the original space. Therefore, each partition is created according to a subset of the semantic features and the corresponding generated dimension expresses whether two entities are similar from this certain semantic perspective. One or more partitions are combined to describe whether the entities are similar, and these partitions formed a semantic attribute. The L partitions obtained from the optimization process cover all the semantic attributes, and there are as many differences as possible between these partitions. However, the distinguishable features of different similar attributes are mixed together in the reduced vector.

Multi-table similarity

With the numerical and semantic similarity being unified, by “Consistency-based dimensionality reduction with subspace learning” section, we propose a method to combine these two kinds of similarity information into hash encoding. In particular, multiple independent hash tables are adopted to express the latent similar attributes. The goal is to project one of the attribute similarities, which is described by a subset of semantic features, into K hash code bits in one of those several hash tables. For the $m th$ table that describes K semantic features, the $j th$ bit should be 1 when an entity belongs to this semantic category, otherwise should be 0. The $H^{m} \in {0, 1}^{K \times N}$ can be calculated by

H^{m} = sign (W^{m^{T}} X)

(7)

where $sign ()$ is a symbolic function and $W^{m}$ is the projection matrix of the mth table. Owing to the independence of hash tables, the final similarity of two entities is designated to be the largest one among the M hash tables. After relaxation and approximation, we formulate the final aggregated similarity relation matrix $S_{N \times N}$ following SCA,³ and the $S_{ij}$ denotes the similar relationship between entity $x_{i}$ and entity $x_{j}$ .

Final objective

In the downsized subspace $Θ$ , the neighbor distributions from both numeric and semantic perspectives are consistent. Under the measure of numerical distance, the similar entities labeled with similar relationships in semantic are mapped nearby each other, while the dissimilar entities are mapped faraway with each other. In subspace $Θ$ , the local numerical similarity is given by equation (2).

Different from the former non-transitive supervised hash MuCH,²⁰ our method, SemiNTH, provides unsupervised consistent numeric-semantic information that complements to strengthen the description ability of the final hash codes. The supervised semantic relationships and unsupervised neighbor distribution are denoted as matrix $R_{ij}$ and matrix $G_{ij}$ , respectively. To think over synthetically the factors of numeric and semantic in hash learning process, the relationship matrix $R_G_{ij} \in R^{N \times N}$ , generated by downsized subspace $Θ$ , can be represented as below

R_G_{ij} = {\begin{matrix} R_{ij} & S_{ij} \in S_{N}, \\ G_{ij} & S_{ij} \notin S_{N}, p_{j} \in C^{t} (p_{i}), \\ 0 & otherwise . \end{matrix}

where $S_{N}$ denotes the set of pairs of points that are tagged with similarity labels, and $C^{t} (p_{i})$ denotes the set of top t nearest neighbors of the entity $p_{i}$ . Specifically, the matrix $R_G_{ij}$ comes from the joint of semantic similarity matrix $R_{ij}$ and the former generated optimal distance matrix $G_{ij}$ . For instance, for the entity $p_{i}$ , the supervisory information does not label it is similar to entity $p_{j}$ ; however, the entity $p_{j}$ belongs to its numerical neighborhood set. Therefore, in the complementary situation, that is, $S_{ij} \notin S_{N}$ but $p_{j}$ is the top t nearest neighbor of $p_{j}$ , the value of $G_{ij}$ will be seen as the semantic similarity and will be added into $R_G_{ij}$ .

The objective of this process is to minimize a matrix W with which hashing codes satisfy the properties that $S_{ij}$ is large when $R_G_{ij} \neq 0$ , and $S_{ij}$ is small when $R_G_{ij}$ = –1, detailed as below

L μ = \sum_{i, j} f (R_G_{ij} \cdot S_{ij})

(8)

where $f ()$ denotes the empirical loss function and defines how well the aggregated similarity $R_G_{ij}$ approximates similar relationship $S_{ij}$ in hash spaces. Similar to equation (3), we adopt logistic loss function to optimize empirical loss. To avoid the over-fitting problem, which is common for supervised learning, two regularizations are added

L ν = γ_{1} \sum_{i \neq j} {(w_{i}^{T} w_{j})}^{2} + γ_{2} \sum_{k} E [∥ w_{k}^{T} p ∥^{2}]

(9)

LS = L μ + L ν

(10)

where $γ_{1}$ and $γ_{2}$ are applied to control the weight of various parts, respectively.

Optimization

Directly minimizing the objective loss function in equation (10) is intractable because the hash function in equation (7) is not continuous. Therefore, equation (7) is relaxed in the same way as equation (1). With the relaxation, Block Coordinate Descent (BCD) is applied to solve these two optimization problems.

As Algorithm 1 shows, first of all, the unsupervised numeric feature matrix X, supervised semantic feature matrix R, and parameters of the final hash spaces (i.e. $K, M$ ) are given as input of SemiNTH. The method will learn a dimensionality reduction matrix O and a multi-table hash projection matrix W, through which the high-dimensional input features will be compacted into low-dimensional binary hash codes. To be specific, the dimensionality of downsized subspace $Θ$ (i.e. mapped data P) equals $K \times M$ . For each table $m \in M$ , Algorithm 1 optimizes the corresponding project matrix $W^{m}$ by step 9. Here, the termination time is controlled by the maximum iteration times and the changing degree of the optimum solutions during iteration. If one of these two convergence conditions is satisfied, the optimization procedure will terminate.

Algorithm 1: Semi-supervised hash learning method with consistency-based dimensionality reduction
Input: Numeric feature matrix of training set X; adjacency relationship matrix R; number of hash bits K; number of hash tables M Output: dimensionality reduction functions O; linear hash functions W 1: initialize O by PCA 2: optimize the final objective function of dimensionality reduction with BCD 3: for all functions in O do 4: cycling update O by $O = O - α_{r} \cdot \frac{\partial LR}{\partial O}$ until LR converge 5: end for; 6: reduce X into $P = O^{T} X$ 7: optimize objective function of hash learning with BCD 8: for all linear projections $W^{m}$ of hash tables do 9: cycling update $W^{m}$ by $W^{m} = W^{m} - α_{w} \cdot \frac{\partial LS}{\partial W^{m}}$ until LS converge 10: end for; 11: encoding H by $H = sgn (W^{T} O^{T} X)$

Algorithm 1: Semi-supervised hash learning method with consistency-based dimensionality reduction

Input: Numeric feature matrix of training set X; adjacency relationship matrix R; number of hash bits K; number of hash tables M
Output: dimensionality reduction functions O; linear hash functions W
1: initialize O by PCA
2: optimize the final objective function of dimensionality reduction with BCD
3: for all functions in O do
4: cycling update O by

O = O - α_{r} \cdot \frac{\partial LR}{\partial O}

until LR converge
5: end for;
6: reduce X into

P = O^{T} X

7: optimize objective function of hash learning with BCD
8: for all linear projections

W^{m}

of hash tables do
9: cycling update

W^{m}

W^{m} = W^{m} - α_{w} \cdot \frac{\partial LS}{\partial W^{m}}

until LS converge
10: end for;
11: encoding H by

H = sgn (W^{T} O^{T} X)

The gradient of consistency-based dimensionality reduction objective function LR is

\frac{\partial LR}{\partial o_{d}} = \sum_{i \neq j} \frac{- R_{ij}}{1 + e^{R_{ij} S_{ij}}} \frac{\partial S_{ij}}{\partial o_{d}} + γ \cdot 2 \cdot X X^{T} O

(11)

\frac{\partial S_{ij}}{\partial o_{d}} = e^{- \frac{∥ P_{i} - P_{j} ∥}{δ^{2}}} \cdot \frac{- \sqrt{∥ P_{i} - P_{j} ∥}}{δ^{2}} (P_{i}^{d} - P_{j}^{d}) (\frac{\partial P_{i}^{d}}{\partial o_{d}} - \frac{\partial P_{j}^{d}}{\partial o_{d}})

(12)

For hash learning, the gradient of objective function LS is similar to Ou et al.²⁰

Experiments

In our experiments, we use two real-world labeled public datasets to evaluate the proposed SemiNTH method, and the semantic similarity comes from the intersection of their labels.

Dataset and settings

NUS-WIDE is a web image dataset, dataset that contains 269,648 images, being crawled from Flickr and associated to 81 concepts. Each image belongs to at least one concept and is regarded similar to those of the same concepts. The features are formed by concatenating five kinds of low-level features, that is, 500-D bag of word of scale-invariant feature transform (SIFT). In this article, we take into consideration the top 10 concepts and total 1134-D features. We randomly select 10,000 images as training set and 2000 for query test. In addition, one-tenth of the training set are randomly selected as supervision samples, with their non-transitive paired similarity visible. DBLP is a digital bibliography of major computer science community. Similar to MuCH,²⁰ we select authors who have published at least three papers in 31 conferences, which belong to 11 specific fields. From 5934 authors and their 30,646 papers, 3900 authors are randomly selected as training sample and leaving 1500 as testing. Two authors, coming from 390 supervision samples, are labeled similar if they have ever published in the same filed, otherwise they are dissimilar. The document of every author is formed by connecting titles and abstracts of their papers and then represented by 150-D LDA (Latent Dirichlet Allocation) feature.

The sample distributions are different in the two datasets. In NUS-WIDE training set, over 93.92% of samples belong to single concept and over 5.8% belong to double concepts. By comparison, DBLP is more disperse in concept number distribution. In detail, 34.5% belong to double concepts and 17% belong to more concepts. The detail statics are listed in Table 1.

Table 1.

Statistics of Datasets.

Datasets	1	2	3	4	More	Total
NUS-WIDE	11270	705	21	4	–	12,000
DBLP	2618	1864	676	178	64	5400

In our semi-supervised hash learning method, the relationship matrix R is generated by categories those training samples belong to. Obviously, there are two sets, one of which is composed of tagged samples and the other not. For an element in R, the relationship is marked with 1 in the situation if they have at least one common concept, otherwise it is marked with −1. In the context that one sample is tagged while another is not, the mark is set to 0. The scale parameter $σ$ of $G_{ij}$ is set to be the maximum distance among all training samples; the top neighbor number t in $R_G_{ij}$ is set to be 5 for its best performance in training. We employ grid search in ${10^{- 3}, 10^{- 4}, 10^{- 5}, 10^{- 6}, 10^{- 7}, 10^{- 8}, 10^{- 9}}$ and select the best performed hyper-parameters for training in both datasets. And the weight parameters $γ$ , $γ_{1}$ , and $γ_{2}$ are set to be $1.0 \times 10^{- 7}$ , $1.0 \times 10^{- 9}$ , and $1.0 \times 10^{- 3}$ , respectively.

Baseline and evaluation metrics

We aim to optimize hash coding properties of similarity maintenance ability and retrieval efficiency. Three widely used retrieval strategies are adopted in our experiments: (1) retrieving top Nq list, (2) hamming radius equals 2 (radius = 2), and (3) hamming radius to search,²⁰ that is, $radius = R$ , $R \in R^{+}$ . Here, given a query sample, the former two strategies will obtain a set of samples within a range of its Nq nearest neighborhoods, and the samples are encoded with no more than two different hash bits comparing to the query item, respectively. The first retrieval strategy is a retrieval efficiency indicator, which represents the minimum hamming distance between query item and the retrieved top Nq items. With the former three retrieval strategies, the efficiency of our methods will be measured by precision, recall, and mean average precision (MAP) as below.

We compare our hash method (SemiNTH) with MuCH²⁰ and other three state-of-art hashing methods, that is, ITQ,⁷ PCAH, and LSH.² Finally, all results are averaged from 10 runs on a workstation with 2.6 GHz Intel Xeon CPU and 62 GB RAM.

Results and analysis

Four set of experiments are conducted on both datasets to evaluate the performance of SemiNTH with class labels as ground truth. To evaluate the retrieval accuracy of multi-tables, we vary number of hash bits and hash tables in Figure 3. The subfigures of Figure 3 depict the query performance from two points of view, the first retrieval strategy with Nq = 200 and the mean average precision, respectively.

Figure 3.

Retrieval performance of SemiNTH on NUS-WIDE. Different hash bits of each hash table {8, 16, 24, 32} and different hash tables {1, 2, 3, 4} are tested: (a) precision of top 200 samples and (b) mean average precision at radius = 2.

In Figure 3(a), the horizontal axis represents number of hash tables, and the vertical axis means the precision of the set of top 200 samples. The four lines in each sub-figure stand for the results obtained by SemiNTH with different bits in single hash table, respectively. As shown in Figure 3(a), the precisions improve with the increased number of hash tables and 32-bit hash table maintains the highest precision mostly. Similarly, Figure 3(b), in which horizontal axis stands for the number of hash bits and vertical axis stands for the mean average precision with radius = 2, depicts the results of different hash tables in four lines and shows that the values of precision increase when more hash tables are adopted. It indicates the existing attribute-level similarity and verifies the performance of multi-table hash method. Due to the fact that the sample distribution in DBLP training set is more disperse (as shown in Table 1), availability of multi-table which is shown in Figure 4 is more obvious. Therefore, it illustrates that multi-table is necessary in preserving the original similar relationships into hamming space.

Figure 4.

Retrieval performance of SemiNTH on DBLP. Different hash bits of each hash table {8, 16, 24, 32} and different hash tables {1, 2, 3, 4} are tested: (a) precision of top 200 and (b) mean average precision at radius = 2.

In Figure 5, the horizontal axis indicates number of bits in single hash table,while vertical axis indicates the precision and recall of retrieved samples when radius equals 2, respectively. The six lines, with different legends, denote results of six hash learning methods specifically: unsupervised hash methods ITQ,⁷ LSH,² PCAH,⁶ and MuCH²⁰ (the number is fixed to be 2 for its experimental best performance), and our consistency-based semi-supervised method SemiNTH (SemiNTH2 and SemiNTH4). Here, the number 2 in SemiNTH2 represents number of hash tables adopted in SemiNTH, where the bit number of every hash table equals to that of the comparative methods. As displayed in Figure 5(a), PCAH outperforms all the other methods, and ITQ and SemiNTH have a consistent advantage over LSH and MuCH. This might be because LSH and MuCH do not rely on PCA, which helps to preserve semantic consistency for the smallest code sizes.⁷ Compared to experiments on NUS-WIDE, the significantly better performance of MuCH on DBLP is illustrated in Figure 5(b). Note that in Figure 5(b), the phenomenon that LSH plots a strongly upward trajectory with increasing bit size may be due to its theoretical convergence guarantee. The major precision gap of MuCH between NUS-WIDE and DBLP verifies our assumption that in entity level, numeric features may not be inconsistent with semantic features. The results prove that multi-table is necessary to preserve the original similar relationships into hamming space.

Figure 5.

Precision at radius = 2 on two datasets: (a) precision on NUS-WIDE and (b) precision on DBLP.

Figure 6(a) and (b) show the retrieval performances on DBLP dataset of recall at radius = 2 and the values of MAP when testing by the first retrieval strategy with Nq = 200, respectively. It can be seen that the retrieval performance of SemiNTH is always superior to others and the performance of MuCH is acceptable. It is intuitive that multi-table methods obtain obviously better performance and state the effectiveness of numeric and semantic consistency method (SemiNTH).

Figure 6.

Two different retrieval performances on DBLP: (a) recall at radius = 2; (b) mean average precision at Nq = 200.

Figure 5(a) shows that the performance of those methods is stable with different hash bits. Therefore, we take retrieval performance of the top 500 samples when bit number equals 16 as example to illustrate the retrieval accuracy and search efficiency of SemiNTH, which can be seen in Figure 7. Every line represents a method, plotted by different volume of returned samples on the horizontal axis and performance metrics on the vertical. Figure 8(a) and (b) display that SemiNTH4 maintains stable highest precision and recall. What is more; as shown in Figure 7(c), although SemiNTH4 increases with three times more storage, its search efficiency is still comparable with the best competitor ITQ. Therefore, our SemiNTH is proved more suitable in datasets that are associated with multiple correlated semantic labels. To further demonstrate the superiority of SemiNTH, we adopt the same experiments on DBLP whose distribution of label number is more uniform.

Figure 7.

Retrieval performance of top 500 on NUS-WIDE with 16 bits hash table: (a) precision of top 500 at bit = 16; (b) recall of top 500 at bit = 16; and (c) radius of top 500 at bit = 16.

Figure 8.

Retrieval performance of top 500 on DBLP with 16 bits hash table: (a) precision of top 500 at bit = 16; (b) recall of top 500 at bit = 16; and (c) radius of top 500 at bit = 16.

Figures 5(b) and 6(a) reveal the outstanding performance of SemiNTH, especially versus MuCH, which certifies that utilizing both numeric local distribution structure and semantic similarity relationship can faithfully overcome the difficulty of nearest neighbor search. The hash retrieval efficiency of SemiNTH, which seems to be limited by storage space, can be guaranteed since its search radius decreases exponentially (seen in Figure 8(c)).

Finally, SemiNTH proves our hypothesis that the numeric and semantic features may not be consistent with each other. Being complemented with semantic consistent numeric features, SemiNTH preserves the most differentiated information into multiple hamming spaces and receives the best retrieval performance. That verifies the existence of attribute-level similarity yet again.

Conclusion and future work

In this article, we propose a numeric and semantic consistency hash learning method SemiNTH to unify the numerical features and supervised semantics into a low-dimensional space before hash coding. Specifically, we improve a multiple table hash method (MuCH) with local distribution structure and conduct experiments on two public datasets. Experimental results demonstrate that SemiNTH is effective in preserving useful information for hash encoding with the consistent low-dimensional subspace. Besides, SemiNTH can obtain high-quality retrieval performance in multi-table context in which the similarities can be attribute level.

In practical applications, the similarity search on heterogeneous sources is a hard problem for its multiple similar components. Therefore, in the future work, we would like to expand our method to metric multi-source similarity.

Footnotes

Acknowledgements

We are grateful to Mingdong Ou for providing the Matlab code of the MuCH method. Thanks for the support of HABIR, an open source project, provided by Yong Yuan. We thank Pengfei Liu for his guidance on initial drafts of the paper.

Handling Editor: Fei Yu

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research is supported by National Key Research and Development Program of China (grant No. 2016YFB0800802 and grant No. 2017YFB0801804), Frontier Science and Technology Innovation of China (grant No. 2016QY05X1000), and Key Research and Development Program of Shandong Province (grant No. 2017CXGC0706 and No. 2016ZDJS01A04).

ORCID iD

Fang Lv

References

Tang

et al . Extracting shared subspace for multi-label classification. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’08), Las Vegas, NV, 24–27 August 2008, pp.381–389. New York: ACM Press.

Indyk

Motwani

. Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the 3th annual ACM symposium on theory of computing (STOC’98), Dallas, TX, 24–26 May 1998, pp.604–613. New York: ACM Press.

Changpinyo

Liu

Sha

Similarity component analysis. In: Proceedings of the 26th international conference on neural information processing systems (NIPS’13), Lake Tahoe, NV, 5–10 December 2013, pp.1511–1519. New York: Curran Associates.

Van der Maaten

Hinton

. Visualizing non-metric similarities in multiple maps. Mach Learn 2012; 87: 33–35.

Deng

Liu

et al . Visual reranking through weakly supervised multi-graph learning. In: Proceedings of the IEEE international conference on computer vision (ICCV ’13), Sydney, NSW, Australia, 1–8 December 2013, pp.2600–2607. New York: IEEE.

Wang

Kumar

Chang

Semi-supervised hashing for large-scale search. IEEE Trans Pattern Anal Mach Intell 2012; 34: 2393–2406.

Gong

Lazebnik

Gordo

et al . Iterative quantization: a procrustean approach to learning binary codes for large-scale image retrieval. IEEE Trans Pattern Anal Mach Intell 2013; 35: 2916–2929.

Johnson

Lindenstrauss

Extensions of Lipschitz mappings into a Hilbert space. Contemp Math 1984; 26: 189–206.

Weiss

Torralba

Fergus

Spectral hashing. In: Proceedings of the 21st international conference on neural information processing systems (NIPS’08), Vancouver, BC, Canada, 8–10 December 2008, pp.1753–1760. New York: Curran Associates.

10.

Liu

Wang

Kumar

et al . Hashing with graphs. In: Proceedings of the 28th international conference on machine learning (ICML’11), Bellevue, WA, 28 June–2 July 2011, pp.1–8. Madison, WI: Omnipress.

11.

Shen

Liu

Zhang

et al . Learning binary codes for maximum inner product search. In: IEEE international conference on computer vision, Santiago, Chile, 7–13 December 2015, pp.4148–4156. New York: IEEE.

12.

Liu

Wang

et al . Supervised hashing with kernels. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR’12), Providence, RI, 16–21 June 2012, pp.2074–2081. New York: IEEE.

13.

Norouzi

Fleet

Salakhutdinov

Hamming distance metric learning. In: Proceedings of the 25th international conference on neural information processing systems (NIPS’12), Lake Tahoe, NV, 3–6 December 2012, pp.1061–1069. New York: Curran Associates.

14.

Shen

Liu

et al . Supervised discrete hashing. In: Proceedings of the EEE conference on computer vision and pattern recognition (CVPR’15), Boston, MA, 7–12 June 2015, pp.37–45. New York: IEEE.

15.

Wang

Kang

WC.

Feature learning based deep supervised hashing with pairwise labels. In: Proceedings of the 25th international joint conference on artificial intelligence (IJCAI’16), New York, 9–15 July 2016. New York: AAAI Press.

16.

Liu

Wang

Shan

et al . Deep supervised hashing for fast image retrieval. In: Proceedings of the 29th IEEE conference on computer vision and pattern recognition (CVPR’16), Las Vegas, NV, 27–30 June 2016, pp.2064–2072. New York: IEEE.

17.

Shi

Xing

Cai

et al . Kernel-based supervised discrete hashing for image retrieval. In: Proceedings of the 14th European conference on computer vision (ECCV), Amsterdam, 11–14 October 2016, pp.419–433. New York: Springer International Publishing.

18.

Shen

et al . Large-scale image retrieval with supervised sparse hashing. Neurocomputing 2017; 229: 45–53.

19.

Wang

Kumar

Chang

S-F.

Semi-supervised hashing for scalable image retrieval. In: Proceedings of the 23rd IEEE conference on computer vision and pattern recognition (CVPR’10), San Francisco, CA, 13–18 June 2010, pp.3424–3431. New York: IEEE.

20.

Cui

Wang

et al . Non-transitive hashing with latent similarity components. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’15), Sydney, NSW, Australia, 10–13 August 2015, pp.895–904. New York: ACM Press.

21.

Kang

Grauman

Sha

Learning with whom to share in multi-task feature learning. In: Proceedings of the 28th international conference on machine learning (ICML’11), Bellevue, WA, 28 June–2 July 2011, pp.521–528. Madison, WI: Omnipress.

22.

Datar

Immorlica

Indyk

et al . Locality-sensitive hashing scheme based on p-stable distributions. In: Proceedings of the 12th annual symposium on computational geometry (SCG’04), New York, 8–11 June 2004, pp.253–262. New York: ACM Press.

23.

Liu

Zhang

et al . Large-scale unsupervised hashing with shared structure learning. IEEE Trans Cybern 2015; 45: 1811–1822.

24.

Szell

Lambiotte

Thurner

Multirelational organization of large-scale social networks in an online world. P Natl Acad Sci USA 2010; 107: 13636–13641.

25.

Fienberg

Meyer

Wasserman

SS.

Statistical analysis of multiple sociometric relations. J Am Stat Assoc 1985; 80: 51–67.

26.

Wang

Shen

Zhang

et al . Sparse semantic hashing for efficient large scale similarity search. In: Proceedings of the 23rd ACM international conference on conference on information and knowledge management (CIKM’14), Shanghai, China, 3–7 November 2014, pp.1899–1902. New York: ACM Press.

27.

Liu

Moore

Gray

et al . An investigation of practical approximate nearest neighbor algorithms. In: Proceedings of the 17th international conference on neural information processing systems (NIPS’04), Vancouver, BC, Canada, 13–18 December 2004, pp.825–832. Cambridge, MA: The MIT Press.

28.

Liu

Lang

et al . Hash bit selection: a unified solution for selection problems in hashing. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR’13), Portland, OR, 23–28 June 2013, pp.1570–1577. New York: IEEE.