A study on query terms proximity embedding for information retrieval

Abstract

Information retrieval is applied widely to models and algorithms in wireless networks for cyber-physical systems. Query terms proximity has proved that it is a very useful information to improve the performance of information retrieval systems. Query terms proximity cannot retrieve documents independently, and it must be incorporated into original information retrieval models. This article proposes the concept of query term proximity embedding, which is a new method to incorporate query term proximity into original information retrieval models. Moreover, term-field-convolutions frequency framework, which is an implementation of query term proximity embedding, is proposed in this article, and experimental results show that this framework can improve the performance effectively compared with traditional proximity retrieval models.

Keywords

Cyber-physical system natural language processing computational linguistics information retrieval query terms proximity convolutions

Introduction

There will be more and more data stored in cyber-physical systems than in Internet of computers. We must extract the necessary information we needed, so the technologies of information retrieval (IR) are more important than ever.¹ The main task of IR is that searching for the documents which users need based on their information needs. Therefore, the representation and organization of information items should make it easier to access information in which users are interested.² Generally, users should translate their information needs into queries which can be processed by a standard IR system. The logical views for the user’s information needs are all based on query terms in traditional IR models including vector space model,³ probabilistic model,^4–6 and language model,^7–10 and most of IR models can be applied to wireless networks for cyber-physical systems. These traditional IR models use the occurrences of query terms in the documents to determine their weights for the user’s queries, and they all adopt term-independence assumption. That is, these models ignore the relations between query terms including proximity, semantic relations, collocations, and so forth.

Some researchers tackle this problem by incorporating Query Terms Proximity (QTP) into traditional IR models. These “incorporations” are usually based on the linear combinations of QTP scores and scores of traditional IR models.^11–13 In this article, we call these proximity retrieval models QTP-attaching models, of which the scores are a linear or nonlinear combination of original scores of IR models and original QTP scores. QTP-attaching models are understood easily, and the working prototypes are developing-friendly. The research on QTP-attaching models has been very thorough. Using this mode of incorporations (basically all the linear combinations of them), there are some significant improvements in the proximity retrieval models over baseline models, and these show that QTP is a very intelligible, efficient, and useful measure to tackle the term-independence assumption in traditional IR models.

However, the shortcomings of QTP-attaching models are found in our recent research. First, there is not an appropriate theory to interpret the reason why it should be compliance with the related formulas in QTP-attaching models. Second, the score of traditional IR models and QTP score are separate. The score ranges of various traditional IR models and QTP-attaching models are different, and we must normalize these scores so that we can combine them in the same standard. Finally, in QTP-attaching models, the QTP measures are isolating values, so the choice of particular QTP measures is very important and sensitive for the performance of QTP-attaching models.

In this article, we propose QTP-embedding models to resolve these problems. At first, we propose a term-field-convolutions frequency framework to interpret the relations between query terms in a document with the relative concepts of physics and maths; then, in this framework, we propose a method to postprocess the correlation between two terms by their semantic similarity. Finally, we integrate these concepts into QTP-embedding models, which embed QTP into traditional IR models.

The rest of the article is organized as follows. We first introduce the main work of related researchers in section “Related work”. In section “QTP-assisting IR”, we outline the two types of QTP-assisting models: QTP-attaching and QTP-embedding. Then, we propose term-field-convolutions frequency framework as an implementation of QTP-embedding model in section “Term-field-convolutions frequency framework”. After that, we test our model with systematic experiments in section “Experiments”. Finally, we give conclusions and future research directions in section “Conclusion and future work”.

Related work

In this section, we will introduce important QTP-attaching models and some proximity retrieval models similar to QTP-embedding models proposed by other researchers.

QTP is a easy-understanding way to avoid the term-independence assumption in traditional IR models. It is very effectively compared with some complex methods based on deep theories. In some research works,^14–16 QTP is not an independent method but a trick to resolve some detail problems. Metzler and Croft¹⁴ propose a formal framework to model term dependencies through Markov random fields. They describe three types of term dependencies models: full independence, sequential dependence, and full dependence and then use their linear combination to rank documents. Beigbeder and Mercier¹⁵ developed a model based on a fuzzy proximity degree of term occurrences. Petkova and Croft¹⁶ propose a expert search model based on the dependencies between the named entities and terms which appear in document. They proposed a series of proximity kernels to describe these dependencies between the named entities and terms, for example, triangle kernel, Gaussian kernel, and step function.

However, in research works of Rasolofo and Savoy,¹⁷ Tao and Zhai,¹⁸ and Cummins and O’Riordan,¹⁹ some standard QTP-attaching models have been proposed. In these models, original IR models have been regarded as the core, and the measures of QTP have been appended as the attachments. These QTP-attaching models focus on the QTP itself and discuss various types of proximity measures in detail. Rasolofo and Savoy¹⁷ propose a combination of proximity measurements and Okapi probabilistic model.²⁰ On this basis, Tao and Zhai¹⁸ propose a complete proximity retrieval model with five proximity measures. Cummins and O’Riordan¹⁹ expanded Tao’s research and proposed a more integrated proximity retrieval model with 12 proximity measures.

The other researchers tried to make use of QTP by other ways. Robertson et al.²¹ proposed a method for weighting term frequencies based on the importance of the document field in which they occur. Vechtomova and Karamuftuoglu²² were inspired by Robertson’s idea and replaced original term frequency by pseudo-frequency (section “Pseudo-frequency” will introduce its definition). Vechtomova’s proximity model is indeed a preliminary QTP-embedding model, and the partial discussions in this article are based on this research.

We should further note that Vechtomova’s proximity model is not unique emphasis in her paper. She also proposed Lexical Bond combined with proximity in her paper. But there is no perceptible performance improvement (even opposite) of separate lexical bond in Vechtomova’s experiments, and the performance of combining model of proximity and lexical bond is very close to separate proximity model. In addition, the calculation of lexical bond is very large compared with proximity, and it is not worth according to its performance. Since lexical bond is based on natural sentences, we cannot use the information in the inverted table of index and must access the original document directly. Based on the above two reasons, we do not discuss lexical bond furthermore in this article.

QTP-assisting IR

QTP assisting

Query is a very summary, refined, and simple abstract of user’s information needs. Query terms in a query consisting of two or more terms are not independent absolutely. Term-independence assumption in traditional IR models is just a way to simplify problems. QTP is an effective and easy-understanding way to avoid this assumption. As an assisting strategy, QTP cannot retrieve information in IR systems independently, and it must be incorporated into a traditional IR models.

Definition 1. IR model. An IR system can be represented by a quadruple²³

S y s t e m = (D, Q, F, R (d_{i}, q))

(1)

where D is a set of information resources (in most cases is text documents or multimedia documents), Q is a set of information needs of users, F is a framework to process information resources and information needs, and $R (d_{i}, q)$ is the weighting function between elements of D and elements of Q. To a particular IR system, F and $R (d_{i}, q)$ establish its overall behavioral logics, and we call these behavioral logics the IR model which the IR system depends on.

Equation (1) gives a universal definition of all IR models. We can give a general definition of IR models which adopt the term-independence assumption.

Definition 2. Original IR model. We call an IR model original IR model if its weighting function with term-independence assumption can be represented by

R_{o g n} (Q, D) = \sum_{t \in Q \cap D} f (t, D, g (t, Q))

(2)

where $R_{ogn}$ is the weighting function for Q and D, Q is the query, D is the document collection, and t is a term in Q. g is the query term weighting function and f is the weighting function for terms t and D, which depends on parameters t, D, and $g (t, Q)$ .

Most of popular IR weighting functions, such as tfidf,²⁴ $BM 25$ ,²⁵ and KL-divergence,²⁶ are all consistent with equation (2). There is a common characteristic in these functions: the document weight of any term is unrelated to any other terms. Note that $g (t, Q)$ is only a query weighting function, which is equal to every document in D. This is what we call term-independence assumption.

According to the definition of original IR model, we call the model QTP-assisting model, which revises the term-independence assumption of original IR model by QTP.

Definition 3. QTP-assisting model. We call an IR model QTP-assisting model if its weighting function can be represented by

R_{a s t} (Q, D) = A S T (Q, D, M_{o g n}, Q T P)

(3)

where $R_{ast}$ is the weighting function for Q and D, and AST is an assisting strategy function with parameters Q, D, original IR model $M_{ogn}$ , and QTP score QTP.

This definition of QTP-assisting model is relatively rough. In next two subsections, we will define two types of QTP-assisting models in detail.

QTP-attaching model

Original IR models do not consider the proximity between query terms, so we must add the proximity information into the weighting function of original IR models. The natural idea is that, we can regard the return value of a function based on original IR model score and original QTP score as the score of compound models. We call these models QTP-attaching models.

Definition 4. QTP-attaching model. We call an IR model QTP-attaching model if its weighting function can be represented by

R_{a t h} (Q, D) = A T H (R_{o g n}, Q T P)

(4)

where $R_{ath}$ is the weighting function for QTP-attaching model; ATH is an attaching strategy function with parameters original IR model score $R_{ogn}$ and QTP score QTP.

QTP-attaching model is a subset of QTP-assisting model, but their definitions are quite similar. The differences between the definitions of QTP-assisting model and QTP-attaching model are twofold. First, Definition 3 is an open definition. All information of IR systems is in its parameter list. On the contrary, only two parameters in Definition 4. Second, as proximity retrieval models, QTP-assisting model and QTP-attaching model are all based on original IR model, but in QTP-assisting model, original IR model is $M_{ogn}$ as a whole, and in QTP-attaching model, only the score $R_{ogn}$ of original IR models is considered.

The most common form of attaching strategy function ATH is linear combination

R_{ath} (Q, D) = λ_{1} R_{ogn} + λ_{2} QTP (λ_{1} + λ_{2} = 1)

(5)

where $λ_{1}$ and $λ_{2}$ are the weights of $R_{ogn}$ and QTP.

QTP-attaching model avoids term-independence assumption through a convenient way and it can improve the performance of original IR models effectively. However, there are still two problems in its applying. First, in QTP-attaching model, the scores of original IR models and QTP scores are isolated, so the tuning of their weights $λ_{1}$ and $λ_{2}$ in equation (5) is very important. Strictly speaking, $R_{ogn}$ and QTP in equation (4) are in different dimension (just like the difference of physical properties “length” and “time”), so the adding them is not rational to a certain extent. Second, since QTP score is a single measure in equation (4), the choices of QTP measures in QTP-attaching model are complex. The performances of QTP measures are very different.^18,19 Some improve the performance, and some even reduce it.

QTP-embedding model

In this subsection, we propose QTP-embedding model to address the issue that some problems of QTP-attaching model mentioned in the previous subsection. As the name suggests, in QTP-embedding model, QTP measures are “embedded” in the original IR models:

Definition 5. QTP-embedding model. We call an IR model QTP-embedding model if its weighting function can be represented by

\begin{matrix} R_{emd} (Q, D) = \\ \sum_{t \in Q \cap D} f (E_{1} (t, QT P_{1}), E_{2} (D, QT P_{2}), E_{3} (g (t, Q), QT P_{3})) \end{matrix}

(6)

where $R_{emd}$ is the weighting function for QTP-embedding model; $E_{1}$ , $E_{2}$ , and $E_{3}$ are embedding strategy functions; $QT P_{1}$ , $QT P_{2}$ , and $QT P_{3}$ are possible different QTP measures according to $E_{1}$ , $E_{2}$ , and $E_{3}$ .

We can see that QTP-embedding model maintains the main form of original IR model. Different from QTP-attaching model, QTP-embedding model makes use of QTP measures to preprocess its parameters. For an actual QTP-embedding model, there are two problems that need to be resolved: (1) Which parameters in original IR models can be QTP-embedded? (2) How to QTP-embed? That is, how to determine the embedding strategy functions?

For the first problem, only a part of parameters is appropriate to QTP-embed in spite of every parameter being QTP-embedded formally in equation (6). At first, QTP measures are statistics to describe the relations between query terms within documents and not within queries, so the QTP-embedding-friendly parameters should be relevant with the documents. Therefore, some parameters relative to the queries, such as the length of query |q| or the term frequency within query $c (t, q)$ , are not appropriate to QTP-embed. Furthermore, easy to conclude that QTP-embedding-friendly parameters should be relevant with the content of documents, so the parameters which are about document but not about the content, such as the number of documents N, are not appropriate to QTP-embed either.

For the second problem, we propose three constraints to restrict the choices of embedding strategy functions.

Constraint 1. Let $M_{ogn}$ be an original IR model, $R_{ogn} (p_{i})$ be the weighting function of $M_{ogn}$ , and $p_{i}$ be any parameter of $R_{ogn}$ . The embedding strategy functions $E_{i} (p_{i}, QT P_{i})$ must be monotonically increasing function on $p_{i}$ . That is, when $QT P_{i}$ is a fixed value, if $p_{i_{1}} \geq p_{i_{2}}$ , then $E_{i} (p_{i_{1}}, QT P_{i}) \geq E_{i} (p_{i_{2}}, QT P_{i})$ .

Constraint 1 means that embedding strategy function cannot affect the parameter characteristics of original IR model $M_{ogn}$ . $E_{i} (p_{i}, QT P_{i})$ will replace the position of $p_{i}$ in $R_{ogn} (p_{i})$ , so $E_{i} (p_{i}, QT P_{i})$ should change with change of $p_{i}$ in the same direction.

There is a prerequisite in next two constraints: the higher the QTP measures of query terms, the closer the QTP. If original QTP measures do not satisfy this prerequisite (in fact most of original distance-based QTP measures are opposite), we can normalize them first.

Constraint 2. Let the weighting function $R_{ogn} (p_{i})$ of original IR model $M_{ogn}$ be the monotonically increasing function on its domain, that is, if $p_{i_{1}} \geq p_{i_{2}}$ , then $R_{ogn} (p_{i_{1}}) \geq R_{ogn} (p_{i_{2}})$ ; then, the embedding strategy functions $E_{i} (p_{i}, QT P_{i})$ must also be monotonically increasing function on $QT P_{i}$ , that is, when $p_{i}$ is a fixed value, if $QT P_{i_{1}} \geq QT P_{i_{2}}$ , then $E_{i} (p_{i}, QT P_{i_{1}}) \geq E_{i} (p_{i}, QT P_{i_{2}})$ .

Constraint 3. Let the weighting function $R_{ogn} (p_{i})$ of original IR model $M_{ogn}$ be the monotonically decreasing function on its domain, that is, if $p_{i_{1}} \geq p_{i_{2}}$ , then $R_{ogn} (p_{i_{1}}) \leq R_{ogn} (p_{i_{2}})$ ; then, the embedding strategy functions $E_{i} (p_{i}, QT P_{i})$ must also be monotonically decreasing function on $QT P_{i}$ , that is, when $p_{i}$ is a fixed value, if $QT P_{i_{1}} \geq QT P_{i_{2}}$ , then $E_{i} (p_{i}, QT P_{i_{1}}) \leq E_{i} (p_{i}, QT P_{i_{2}})$ .

Constraint 2 and Constraint 3 are symmetrical. These two constraints mean that embedding strategy function should play its due role, namely, improving the weight ( $R_{ogn} (p_{i})$ ) of documents in which the proximity measure (QTP) is higher.

Constraint 1 is the basis, which tries to make the performance of QTP-embedding model not lower than homologous original IR model. Constraint 2 and Constraint 3 are the essences, which try to make the performance of QTP-embedding model higher than homologous original IR model. Note that these three constraints are only necessary conditions of actual embedding strategy functions.

Term-field-convolutions frequency framework

In this section, we introduce pseudo-frequency proposed by Vechtomova and Karamuftuoglu²² first and then outline term-field-convolutions frequency framework as an actual implementation of QTP-embedding model.

Pseudo-frequency

As mentioned in section “QTP-embedding model”, only a few parameters are appropriate to QTP-embed. The term frequency within documents $c (t, d)$ is almost the most suitable one among them. Vechtomova and Karamuftuoglu²² proposed pseudo-frequency, which modified the term frequency calculation in Okapi BM25 using term proximity. Its definition is given below.

Definition 6. Pseudo-frequency. Instead of counting the actual frequency of term’s occurrence in the document, the pseudo-frequency is calculated according to next two equations²²

c (t_{i}) = {\begin{matrix} 1 + \frac{1}{span {(t_{i}, q)}^{p}} & if q \neq t_{i}; q \in Q \\ 1 & otherwise \end{matrix}

(7)

p f_{t} = \sum_{i = 1}^{N} c (t_{i})

(8)

where $c (t_{i})$ is the contribution of the $i th$ instance of the query term t to pseudo-frequency; $span (t_{i}, q)$ is distance between the $i th$ instance of the query term t and the nearest occurrence of any other term q, which belongs to the same query Q, and is not the same term as t; p is a tuning constant; N is the occurrences number of t.

IR model based on pseudo-frequency is typical QTP-embedding model obviously, and the embedding strategy function from equations (7) and (8) meets the constraints mentioned in section “QTP-embedding model” (after QTP measure normalization). However, the function of pseudo-frequency is somewhat arbitrary. Although it is generally considered “the shorter distance, the higher correlation between term pairs’’, but it is difficult to determine the particular distance-correlation function. In the next subsection, we will discuss the problem furthermore.

Term-field-convolutions frequency

Every term within document has its context, and context is the language circumstance of terms in practical application. Context is a very useful resource in the research of IR and computational linguistics. We can draw an analogy between the context of terms and the electric field around electric charge or the magnetic field around magnetic charge, that is, there is an affected area around each term of documents and we can call them “Term Field.” Each term in document has its own term field.

Furthermore, we can draw an analogy between term frequency within document and magnetic flux or electric flux. Magnetic flux or electric flux is the integral of magnetic field strength and electric field strength, so “Term’s Flux” $c (t, d)$ is the integral of Term Field Strength.

Generally, the term field strength within document can be expressed by

E (t_{i}, p) = S (θ) t_{i} \in d

(9)

where $t_{i}$ is the $i th$ instance of query term t occurring in document d, E is the term field strength of $t_{i}$ at position p, and S is term field strength distribution function. $θ$ is possible parameter list of S.

In order to meet the definition of term flux, we require that the integral of term field for an instance must be 1, that is

Φ_{t_{i}} = \int_{p \in d} E (t_{i}, p) dp = 1

(10)

where $Φ_{t_{i}}$ is term’s flux of $t_{i}$ , which is a constant equal to 1.

So, the term frequency in document can be expressed by

c (t, d) = Φ_{t} = \sum_{i = 1}^{N} Φ_{t_{i}}

(11)

where $c (t, d)$ is term frequency in document d and $Φ_{t}$ is term’s flux of t.

In equation (11), $c (t, d)$ is a known value, so it limits our choice for term field strength distribution function S in equation (9), and at the same time, it helps us to determine some tuning constants in S.

There are interactions between field and field, whether magnetic fields or electric fields. So, we consider that there are some interactions between term fields. For different instances of the same term, their term fields simply overlie each other. But for the instances of different term, there should be additional effects between them to avoid term-independence assumption.

Convolution is a very important concept in mathematics, physics, and in particular, communication science. It is defined as the integral of the product of the two functions after one is reversed and shifted. The convolution of function f and function g can be expressed by

f * g (t) = \int_{- \infty}^{+ \infty} f (τ) g (t - τ) d τ

(12)

In research of statistics, convolution is the promotion of weighted moving average. For this article, it is suitable to weigh the close-degree of two term fields. So, we choose the convolution of term fields to measure the interaction between term fields.

In addition, the interactions of term fields relate with not only their distance but also their semantic similarity. When their distance is a fixed value, the higher the semantic similarity of two terms, the lower the interaction between their term fields. If the semantic similarity of two terms is low, their substitutability is also low, so their proximity is valuable to evaluate the correlation between the document and the query. On the contrary, if the semantic similarity of two terms is high, their substitutability is also high, so their proximity is unnecessary even in highly relevant documents.

Based on the above discussions, we can define term fields interaction formally.

Definition 7. Term fields interaction. There are interactions between term fields of every instance to increase their term field strength. For a particular instance pair, the increasing value is determined by the following equation

ψ (I_{t_{a}}, I_{u_{b}}) = W (t, u) \int_{a}^{b} E_{t} (p) E_{u} (p) dp (a < b)

(13)

where $ψ$ is term fields interaction; $I_{t_{a}}$ is a instance of term t at position a; $I_{u_{b}}$ is a instance of term u at position b; $E_{t} (p)$ is term field strength of term t at position p and $E_{u} (p)$ is term field strength of term u at position p; $W (t, u)$ is a weight based on the semantic similarity of term t and term u.

Note that $W (t, u)$ should meet two constraints: (1) if $sim (t_{1}, u_{1}) > sim (t_{2}, u_{2})$ , then $W (t_{1}, u_{1}) < W (t_{2}, u_{2})$ , and vice versa; (2) if $sim (t, u) = 1$ , for example, t and u are the same terms or synonyms, then $W (t, u) = 0$ .

From Figure 1, we can see that the term field of “ electric” $E_{t}$ at position a and term field of “magnetic” $E_{u}$ at position b interact with each other. The term fields interaction $ψ (I_{t_{a}}, I_{u_{b}})$ can be calculated by equation (13).

Figure 1.

Term fields interaction.

Similar with pseudo-frequency, the term-field-convolutions frequency is calculated as follows

c_{f} (I_{t_{i}}) = Φ_{t_{i}} + \sum_{I_{j} \in d} ψ (I_{t_{i}}, I_{j}) = 1 + \sum_{I_{j} \in d} ψ (I_{t_{i}}, I_{j})

(14)

c f_{t} = \sum_{i = 1}^{N} c_{f} (I_{t_{i}})

(15)

where $c f_{t}$ is term-field-convolutions frequency; $Φ_{t_{i}}$ is term’s flux of $t_{i}$ (see equation (10)); N is the number of instances of term t in document d; $I_{t_{i}}$ is $i th$ instance of term t; $I_{j}$ is any instance of query terms in document d, even includes the instances of term t, and at this time, $ψ (I_{t_{i}}, I_{j}) = 0$ since $W (I_{t_{i}}, I_{j}) = 0$ .

Experiments

Experimental setup

We choose the traditional Okapi BM25 model²⁵ as the baseline of original IR models, and choose Tao and Zhai’s¹⁸ model and Cummins and O’Riordan’s¹⁹ model as the baseline of QTP-attaching models. The proximity function of Tao’s model tao and the best proximity function of Cummins’ model $p 6$ are as follows

tao () = \log (α + \exp (- mi n_dist (Q, D)))

(16)

\begin{matrix} p 6 () = \frac{1}{2} (((3 \cdot \log (\frac{10}{\min_dist}) + \log (prod + \frac{10}{\min_dist}) \\ + \frac{10}{\min_dist} + \frac{prod}{sum \cdot qt}) / qt) + \frac{prod}{avg_dist \cdot \min_dist}) \end{matrix}

(17)

The tuning constant $α$ in equation (16) is set to 0.30 as suggested by Tao. In addition, we choose Vechtomova’s model (namely, pseudo-frequency in section “Pseudo-frequency”) as the baseline of other QTP-embedding model. The tuning constant p in equation (7) is set to 0.75, which achieves the best results in Vechtomova’s experiments.

For calculating $W (t, u)$ in equation (13), we choose GlossVector²⁷ as similarity function, which is based on WordNet.²⁸ The semantic weighted function is as follows

W (t, u) = γ \times (1 - \sin (\frac{π}{2} \times gloss (t, u)))

(18)

where $gloss (t, u)$ is the GlossVector of t and u; $γ$ is a tuning constant. In this experiment, let $γ = 1.45$ .

Term field strength function S in equation (9) should also be determined. We choose the Decaying Co-occurrence Model proposed by Gao et al.²⁹ as function S in the experiments

S (t_{i}, p) = kD (x, y) = k e^{- α * (Dis (x, y) - 1)}

(19)

where $D (x, y)$ is distance factor proposed by Gao, and it equals term field strength function S; k is a normalizing constant; $t_{i}$ and position p equal terms x and y, respectively; $Dis (x, y)$ is the distance between x and y; $α$ is the decay rate, which will be focused on in the next subsection.

The test collection of this experiment is WT10g data set, a popular benchmark for the development and evaluation of IR systems.³⁰ It contains 1,697,027 documents and 9,147,236 distinct terms. Text REtrieval Conference (TREC) topics 451–500 correspond to this collection, so we make 50 queries from topics 451–500.

As shown in Table 1, we use eight models ( $BM 25$ , $BM 25 + tao$ , $BM 25 + p 6$ , $BM 25 + Mova$ , $Emd_0.30$ , $Emd_0.50$ , $Emd_0.80$ , and $Emd_1.00$ ) to retrieve documents for each query. Note that the $α$ in $Emd_0.30$ , $Emd_0.50$ , $Emd_0.80$ , and $Emd_1.00$ is the decay rate in equation (19). We reserve top 1000 results for each query and calculate the MAP, $P 10$ , $P 20$ , $P 100$ , and Bpref³¹ of these eight models, respectively. Table 2 shows the basic information of evaluation measures.

Table 1.

IR models in experiments.

Model	Descriptions
$BM 25$	Okapi BM25 model
$BM 25 + tao$	BM25 attached by Tao and Zhai¹⁸
$BM 25 + p 6$	BM25 attached by p6¹⁹
$BM 25 + Mova$	BM25 embedded by pseudo-frequency²²
$Emd_0.30$	BM25 embedded by term-field-convolutions frequency ( $α = 0.30$ )
$Emd_0.50$	BM25 embedded by term-field-convolutions frequency ( $α = 0.50$ )
$Emd_0.80$	BM25 embedded by term-field-convolutions frequency ( $α = 0.80$ )
$Emd_1.00$	BM25 embedded by term-field-convolutions frequency ( $α = 1.00$ )

IR: information retrieval.

Table 2.

Evaluation measures.

Measure	Descriptions
MAP	Mean average precision
$P 10$	Precision at 10; the precision after 10 documents
$P 20$	Precision at 20; the precision after 20 documents
$P 100$	Precision at 100; the precision after 100 documents
Bpref	Binary preference, top R judged non-relevant³¹

MAP, P10, P20, P100 and Bpref are very popular evaluation measures in IR. MAP is the mean of the precision scores obtained after each relevant document is retrieved. Bpref is a preference-based IR measure that considers whether relevant documents are ranked above irrelevant ones. There are generally 10 search results in one page in most of IR systems. So, P10 means the precision in page 1 (all users will see page 1); P20 means the precision in page 1 and page 2 (most users will click next page at least once). Moreover, there are 20 results in one page in some IR systems; P100 means the precision in pages 1–10 (most users will not see the pages after page 11).

Experimental results

The experimental results are shown in Table 3. Note that there is a strong correlation between the performance of term-field-convolutions frequency framework and the value of the decay rate $α$ . Figure 2 illustrates the experimental results furthermore. This figure shows the performance improvements of all the QTP-attaching models and QTP-embedding models compared with baseline model $BM 25$ according to all evaluation measures. To each measure in the figure, the seven bars are $BM 25 + tao$ , $BM 25 + p 6$ , $BM 25 + Mova$ , $Emd_0.30$ , $Emd_0.50$ , $Emd_0.80$ , and $Emd_1.00$ from left to right.

Table 3.

Experimental results.

Model	MAP	P10	P20	P100	Bpref
$BM 25$	0.1838	0.3389	0.3204	0.3634	0.1885
$BM 25 + tao$	0.1869	0.3349	0.3284	0.3732	0.1913
$BM 25 + p 6$	0.1832	0.3142	0.3082	0.3859	0.1847
$BM 25 + Mova$	0.1902	0.3340	0.3349	0.3943	0.1914
$Emd_0.30$	0.1843	0.3369	0.3305	0.3730	0.1859
$Emd_0.50$	0.1884	0.3429	0.3341	0.3799	0.1910
$Emd_0.80$	0.1915	0.3409	0.3399	0.3903	0.1926
$Emd_1.00$	0.1924	0.3362	0.3358	0.3993	0.1920

Maximum values of all models to each measure are in bold.

Figure 2.

Experimental results.

The conclusions of experimental results are threefold:

To term-field-convolutions frequency framework, the performance of $Emd_0.30$ and $Emd_0.50$ (we call them low decay rate model) is lower than the performance of $Emd_0.80$ and $Emd_1.00$ (we call them high decay rate model).

The performance of $BM 25 + Mova$ is between low decay rate model and high decay rate model.

The performance of QTP-attaching models $BM 25 + tao$ and $BM 25 + p 6$ is close to low decay rate model. In the experiments, the performance of $BM 25 + p 6$ is the lowest in most of evaluation measures except $P 100$ .

Discussions

In the experiments, term-field-convolutions frequency framework with the decay rate 0.8 achieves the best performance. The performance of $Emd_0.30$ is the lowest, $Emd_0.50$ is higher, $Emd_0.80$ is the highest, $Emd_1.00$ is little lower than $Emd_0.80$ . All these results are very close to the relative experimental result of Gao et al.²⁹ These prove the correctness of the choice of term field strength function in part.

Overall, if we choose an appropriate decay rate, the performance of term-field-convolutions frequency framework is higher than all the baseline models. Even the decay rate is not appropriate enough, its performance is not lower than traditional QTP-attaching models obviously, in spite of the performance being lower than other QTP-embedding model $BW 25 + Mova$ at this time.

Tracing it to its cause, we consider that the “proximity” in term-field-convolutions frequency framework is a “higher order proximity.” Traditional “proximity” is the relationship between term and term, and term-field-convolutions is an indirect relationship between term field and term field. Two terms in a document not only relate each other directly but also relate each other through other terms between them. Term-field-convolutions frequency framework considers these information and other models ignore them simply.

In addition, the performance of $BM 25 + p 6$ is somehow low in most of evaluation measures except $P 100$ . It should be related to the different sizes of test collections. In Cummins and O’Riordan’s¹⁹ experiments, the average size of test collections is only 136,272, and the size of WT10g is 1,697,027. $p 6$ is perhaps suitable for small test collections.

Equation (13) is in an integral form in order to be consistent with the form of the convolution (equation (12)). In fact, the convolution between term fields is discrete and the sentences are short in most of documents, so QTP-embedding model does not take too much computation time. Furthermore, some static data, such as the semantic similarity between terms, can be calculated in advance and stored into Hash table. In our experiments, there is no significant difference between QTP-attaching models and QTP-embedding models in computation time.

Conclusion and future work

In this article, we propose the concepts of QTP embedding and then propose term-field-convolutions frequency framework as an implementation of QTP-embedding. This framework draws an analogy between the context of terms and the electric field or the magnetic field and regard the term frequency as the flux of term field, then using the convolutions of term fields to consider the QTP. Theoretical analysis and experimental results show that the performance of term-field-convolutions frequency framework–based IR model is higher than baseline models including its original IR model, some QTP-attaching models, and previous QTP-embedding model.

Looking forward, possible improvements might be pursued. One of the most important requirements is more deep discussions of term field strength functions. Term field strength function used in this article is proposed in previous research,²⁹ which derived from experience, and it requires further theoretical analysis and expansions. It implies many new tasks to be done in the future.

Footnotes

Academic Editor: Wei Yu

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Natural Science Foundation of China (No. 61202181, No. 61671371), China Postdoctoral Science Foundation (No. 2012M512006), and the Fundamental Research Funds for the Central Universities (No. xjj2013097).

References

Song

Zhang

Duan

. TOLA: topic-oriented learning assistant based on cyber-physical system and big data. Future Gener Comp Sy. Epub ahead of print 9 June 2016. DOI: http://dx.doi.org/10.1016/j.future.2016.05.040.

Yates

Neto

BR.

Modern information retrieval. Boston, MA: Addison-Wesley, 1999.

Salton

Wong

Yang

CS.

A vector space model for information retrieval. Commun ACM 1975; 18(11): 613–620.

Fuhr

Probabilistic models in information retrieval. Comput J 1992; 35(3): 243–255.

Jones

Walker

Robertson

SE.

A probabilistic model of information retrieval: development and comparative experiments. Inf Process Manag 2000; 36(6): 779–808.

Géry

Largeron

BM25t: a BM25 extension for focused information retrieval. Knowl Inf Syst 2012; 32(1): 217–241.

Croft

Lafferty

Language modeling for information retrieval. Dordrecht: Kluwer Academic Publishers, 2003.

Ponte

Croft

WB.

A language modeling approach to information retrieval. In: Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval (SIGIR ‘98), New York, 24–28 August 1998, pp.275–281. New York: ACM.

Wang

Gao

. Multi-style language model for web scale information retrieval. In: Proceedings of the 33rd international ACM SIGIR conference on research and development in information retrieval, Geneva, 19–23 July 2010, pp.467–474. New York: ACM.

10.

Kurland

Krikon

The opposite of smoothing: a language model approach to ranking query-specific document clusters. J Artif Intell Res 2011; 41(2): 367–395.

11.

Dumais

Cutrell

Cadiz

. Stuff I’ve seen: a system for personal information retrieval and re-use. ACM SIGIR Forum 2015; 49(2): 28–35.

12.

Dahab

Kamel

Alnofaie

. Further investigations for documents information retrieval based on DWT. In: Proceedings of the international conference on advanced intelligent systems and informatics, Cairo, Egypt, 24–26 October 2016, pp.3–11. Berlin: Springer.

13.

Ehsan

Shakery

Candidate document retrieval for cross-lingual plagiarism detection using two-level proximity information. Inf Process Manag 2016; 52(6): 1004–1017.

14.

Metzler

Croft

. A Markov random field model for term dependencies. In: Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR ‘05), Salvador, Brazil, 15–19 August 2005, pp.472–479. New York: ACM.

15.

Beigbeder

Mercier

. An information retrieval model using the fuzzy proximity degree of term occurrences. In: Proceedings of the 2005 ACM symposium on applied computing (SAC ‘05), Santa Fe, NM, 13–17 March 2005, pp.1018–1022. New York: ACM.

16.

Petkova

Croft

. Proximity-based document representation for named entity retrieval. In: Proceedings of the sixteenth ACM conference on information and knowledge management (CIKM ‘07), Lisbon, 6–10 November 2007, pp.731–740. New York: ACM.

17.

Rasolofo

Savoy

. Term proximity scoring for keyword-based retrieval systems. In: Proceedings 25th European conference on IR research (ECIR ‘2003), Pisa, 14–16 April 2003, pp.207–218. Berlin: Springer.

18.

Tao

Zhai

. An exploration of proximity measures in information retrieval. In: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR ‘07), Amsterdam, 23–27 July 2007, pp.295–302. New York: ACM.

19.

Cummins

O’Riordan

Learning in a pairwise term-term proximity framework for information retrieval. In: Proceedings of the 32nd annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 2009), Boston, MA, USA, 19–23 July 2009, pp.251–258. New York: ACM.

20.

Robertson

Walker

. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In: Proceedings of the 17th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR ‘94), Dublin, 3–6 July 1994, pp.232–241. New York: Springer.

21.

Robertson

Zaragoza

Taylor

Simple BM25 extension to multiple weighted fields. In: Proceedings of the 13th ACM CIKM conference (CIKM 2004), Washington, DC, 8–13 November 2004, pp.42–49. New York: ACM.

22.

Vechtomova

Karamuftuoglu

Lexical cohesion and term proximity in document ranking. Inf Process Manag 2008; 44(4): 1485–1502.

23.

Korfhage

Information storage and retrieval. Hoboken, NJ: John Wiley & Sons, 1997.

24.

Singhal

Salton

Mitra

. Document length normalization. Inf Process Manag 1996; 32(5): 619–633.

25.

Robertson

Walker

Jones

. Okapi at TREC-3. In: Proceedings of the third text retrieval conference (TREC-3), Gaithersburg, MD, 2–4 November 1995.

26.

Lafferty

Zhai

. Document language models, query models, and risk minimization for information retrieval. In: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval, New Orleans, LA, 9–12 September 2001, pp.111–119. New York: ACM.

27.

Pedersen

Patwardhan

Michelizzi

. Wordnet: similarity—measuring the relatedness of concepts. In: Proceedings of the nineteenth national conference on artificial intelligence (AAAI), San Jose, CA, 25–29 July 2004.

28.

Fellbaum

WordNet: an electronic database (Language, speech, and communication series). Cambridge, MA: MIT Press, 1998.

29.

Gao

Zhou

Nie

. Resolving query translation ambiguity using a decaying co-occurrence model and syntactic dependence relations. In: Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR ‘02), Tampere, 11–15 August 2002, pp.183–190. New York: ACM.

30.

http://ir.dcs.gla.ac.uk/test_collections/, 2016.

31.

Buckley

Voorhees

Retrieval evaluation with incomplete information. In: Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval, Sheffield, 25–29 July 2004, pp.25–32. New York: ACM.