Sage Journals: Discover world-class research

Abstract

Retrieving mathematical expressions from scientific documents is a challenging task as mathematical expressions or formulae are quite different from the traditional text. Mathematical expressions are highly symbolic and complex. Moreover, the structure of a mathematical formula conveys a semantic meaning which cannot be overlooked. This paper proposes a scientific document retrieval system based on mathematical formula query. The paper explores the concept of Structure Encoded String (SES), which has been employed for mathematical expressions to capture the relations among the formula structures. A pattern based trie indexing scheme has been proposed for faster retrieval. The Jaro-Winkler Similarity has been adopted for matching and ranking. Experiments are conducted, results are reported using standard evaluation measures and compared with similar existing systems.

Keywords

Information retrieval trie indexing normalization mathematical expression retrieval

1. Introduction

Mathematical formulae and expressions are predominantly used in disciplines such as STEM (Science, Technology, Engineering, and Mathematics) research and education. A distinctive feature of the text retrieval system and mathematical information retrieval (MIR) system lies in the very nature of the data that these systems handle. To be precise, a text retrieval system enables information need of the user (usually documents) based on terms and keywords [1,2]. MIR system, on the other hand, deals with mathematical entities, symbols, expressions, formulae and complex structures. Consequently, for efficiently searching mathematical expressions, traditional text retrieval systems fail because these systems are not appropriate to handle text with complex structures [3,4].

Scientific documents which inherently contain math contents present the following challenges related to its layout and semantics in context to information retrieval [5–9]:

•
Non-linear Structure: Mathematical expressions and formulae are not linear in nature. These non-linear structures actually represent an inherent semantic meaning of the expressions [5]. For instance, $\int _{0}^{1}xdx$ .
•
Notational Ambiguity: To formally define and convey the meaning of a particular concept or scientific/natural phenomena, mathematical notation and symbol set are used. There may exist several different manners of writing the same fundamental meaning. This happens due to the lack of notational inconsistency and limited symbol set [6,8,9]. For example, $\frac{df(x)}{dt}$ and $\frac{d}{dt}f(x)$ .
•
Normalization: An implicitly equivalent mathematical expression could be written in various ways using a different set of variables and symbols or constants. The process of normalization enables to reduce mismatch among the semantically similar expressions in nature. For example, the formulae a ² + b ² and 𝛼² + 𝛽²are semantically equivalent as they retain the same structure while using different variables or symbols [5–7].
•
Encoding Schemes: The encoding schemes used by textual statements and mathematical expressions differ significantly. There are several encoding schemes available for mathematical expressions like MathML [10], LATEX [11], and Open-Math [12] to name a few. Predominantly, LATEX is the de-facto standard for encoding mathematics in a scientific document among academic communities, whereas MathML a W3C standard, is gaining widespread popularity although not all browsers still support complete features of MathML.

In the case of mathematical retrieval systems apart from the aforementioned challenges, indexing of mathematical content and similarity measures of mathematical formulae also remain open challenges. Indexing mathematical content to achieve efficient outcomes involves the process of variable ordering, term unification, and normalization which can also be utilized for query processing and query refinement [13,14]. As compared to text retrieval systems, mathematical information retrieval (MIR) system has been explored only for the last two decades. Unlike text, the structure of Mathematical Expression (ME)/formula is complex in nature and poses a vital challenge for indexing. The majority of math-aware search engines handle this aspect by using either of the two approaches of indexing namely text-based or tree/substitution tree based [5,15–17].

In text-based approach, a mathematical formula markup is converted into a plain text string. Then the traditional text retrieval schemes are implemented using existing information retrieval tools like Lucene framework or its counterparts. This conversion itself presents another set of challenges like retaining structure information of MEs into the string, resolving notational ambiguity and normalization of the string [8,18].

In Egomath2 (designed by Jozef Misutka as an extended version of Egothor by Leo Galambos, MFF UK Prague) mathematical formulae are stored using reverse polish notation. In turn, it employs an augmentation algorithm to the input by applying both transformation and generalization rules along with an ordering algorithm [8,18,19]. For instance, x ² will be represented in simple text consisting of three terms $x,\hat{∼},2$ ; but will be stored internally as postfix notation. Another math-aware system Mathematical Indexer and Searcher (MIaS) was developed by Petr Sojka and Martin Liska [15,20]. Here, the textual and mathematical segments of a scientific document were taken into account separately for employing various pre-processing tasks before indexing them. The Lucene framework was used to index the textual content of the document. For the mathematical content, multiple representations of each input formulae were created and indexed. These representations were reflections of each input formula which was pre-analysed in various stages that involve exact matching, sub-formula matching, and formula modification. Each index term was assigned a weight depending on the amount of modification done (the weight is inversely proportional to the degree of modification). Some other math-aware systems like [9,21,22] used similar approach of text-based indexing.

In the tree-based approach, mostly symbol layout tree (SLT) or operator tree (OPT) was employed to index formulae. Attributes of tree structures like sub-expression or path were extracted as index terms. Indexing all sub-structures of a formula can lead to high recall but suffers from index size growth. To address the issue of indexing the sub-structures of semantic formula, a substitution tree indexing technique was proposed by Kohlhase et al. [23] and later modified by Sojka et al. [15]. Originally substitution tree indexing was proposed by Graf [13] for theorem provers. Similarly, it was also used by Schellenberg et al. [14] for indexing layout presentation of formulae. Also, WikiMirs [8,18] is based on the notion of explicit or implicit operands of LATEX markup. The input was extracted from the Wikipedia dataset and represented as Presentation Tree after pre-processing. The process of term normalization was employed to generate generalized terms. Moreover, it used a modified similarity score based on term frequency-inverse document frequency (tf-idf) scheme and utilized an inverted index. The modified similarity score was used to evaluate the distance of the matched terms on different levels of the presentation tree.

Albeit existing MIR systems have addressed most of the challenges, but the results from the standard evaluation measures are still not convincing. To be applied to document retrieval systems, these techniques still need to be re-examined or amalgamated with other methods to improve the overall quality of the system [5,16].

To this end, this paper aims to construct a text-based MIR system: AlongMath that estimates the relevance of documents in the corpus to a mathematical query and ranks the documents based on scoring. The system AlongMath addresses the following which entails the contributions of this paper: •.
The representation of mathematical content (and query) of a document in linear form using Structure Encoded String (SES) to achieve term generalization.
•.
Using a pattern mapping table to realize normalization among equivalent mathematical expressions.
•.
A pattern-based trie (PB-Trie) indexing for faster retrieval and improved precision.
•.
Two popular MIR systems namely Math Indexer and Searcher (MIaS) and WikiMirs were chosen for the purpose of performance comparison with the proposed system AlongMath.

The remainder of this paper is organized as follows. In Section 2, the proposed system: AlongMath, and its modules are discussed. Section 3 elaborates on the pattern-based indexing scheme. Subsequently, the experimental environment, evaluation measures & comparison, along with data and results are discussed in Section 4. Section 5 concludes the paper.
2. AlongMath: System description

2.1. Overview

The complete workflow of the proposed system AlongMath is depicted in Fig. 1. The system consists of 5 modules: MathExtract Parser, Structure Encoded String (SES) Generator, Pattern Generator, Indexer, and Ranker. The system retrieves relevant documents from the index database depending on the mathematical query issued by the user. To achieve effectiveness, the framework is divided into two segments: the offline phase used for indexing and the online phase for retrieving purposes as depicted in Fig. 1.

Fig. 1.

Workflow of the proposed system: AlongMath.

Presentation MathML (P-MML) is the primarily supported format of AlongMath. Wikipedia corpus serves the basis of AlongMath, which is publicly available at NTCIR (NII Testbeds and Community for Information access Research) Project -12 MathIR Task [24]. In the offline phase, mathematical formulae, expressions, and entities are extracted defined inside <math>... <∖math> markups by MathExtract parser. Next, Structure Encoded String (SES) is generated for the reference symbols of the mathematical formulae through the SES Generator module. By taking into account the class of operators, operands, and special mathematical symbols, string patterns are generated using a mapping table. The generated string patterns maintain the order of original expressions and are indexed using pattern based trie (PB-Trie). In online retrieval, when a user searches a formula (in TEX), the query is transformed into an internal MathML format using which the SES is generated. Thereafter, a pattern is generated and searched in the proposed pattern based trie (PB-Trie) to check the existence of the pattern. Finally, the ranker module calculates the similarity scores of the relevant formulae and presents the list of documents sorted according to the similarity score of the user query.

2.2. MathExtract parser

Fig. 2.

A mathematical formula a ² + b ² = c ² and its corresponding P-MML markup.

MathML [10], a form of XML, focused towards encoding syntax and semantics of mathematical expressions, has got two major forms: presentation and content. Presentation MathML (P-MML) markup as illustrated in Fig. 2 comprises 30 elements accepting around 50 attributes. Most of these elements are related to the syntax or layout of the representation and can be categorized into the following four broad categories:

•

Token elements: <mi>, <mo>, <mn> etc.

•

Layout elements: <mrow>, <mfrac>, <mfenced> etc.

•

Script elements: <msub>, <msup>, <munder> etc.

•

Tables and matrices: <mtable>, <mtr>, <mtd> etc.

During pre-processing stage following elements were ignored and removed: •.

The elements <mtext>, <mspace> and <ms>.

•.

The elements focusing on appearance, binding actions and styling like <mstyle>, <merror>, <mpadded>, <mphantom>, <mmultiscripts>, <mlabeledtr> and <menclose>.

The aforementioned elements were eliminated as because they do not contribute to the meaning or semantic aspects of mathematical contents. These elements are more inclined towards the alignment, justification and appearance of symbols/mathematical expressions/formulae.

2.3. Structure encoded string generator

In this module, the Structure Encoded String (SES) has been discussed which is a linear representation of a Mathematical Expression (ME). In the context of mathematical contents, the term Structure Encoded String (SES) was proposed by Kumar et al. [25]. The authors designed an automated performance evaluation of Mathematical Expression (ME) recognition. The authors identified six surrounding positions viz. top-left (TL), above (A), top-right (TR), bottom-left (BL), below (B) and bottom-right (BR) which can be spatially associated with an ME symbol. The base of the expression is denoted by the symbol M which occupies the central position. The symbols TL, A, TR form the top region which is considered as a single sun-expression and is called northern region (N). Similarly, the symbols BL, B, BR form the bottom region which is called southern region (S). This concept is shown in Fig. 3.

Fig. 3.

Possible spatial regions around mathematical symbol M. Other symbols are TL (top-left), A (above), TR (top-right) for northern region represented as N and BL (bottom-left), B (below), BR (bottom-right) for southern region represented as S.

The work presented in [25] is based on LATEX input. In this paper, the similar notion has been adopted and extended to work in the context of presentation MathML (P-MML) documents and thus the SES has been generated. The module is significant as it enables to achieve term generalisation to reduce the mismatch among the mathematical fragments.

2.3.1. A running example of SES generation

To generate SES, firstly a scanning of the Presentation MathML (P-MML) markup has been performed from left to right i.e. from <math> to <∕math>. Moreover, the structural information of ME is preserved by the use of two special sets of structure symbols i.e. Ns and Ne (Ss and Se). Here, Ns and Ne represent North start and North end respectively. Similarly, Ss and Se are designated for southern region subexpressions.

Fig. 4.

SES encoding of northern region with Ns and Ne and southern region with Ss and Se.

Considering Fig. 4, the Structure Encoded String (SES) for $a_{2}^{i}$ will be represented as <a, Ss, 2, Se, Ns, i, Ne>. Here, due to the inherent structure of P-MML, the subscripts are handled before superscripts. It is observed that the symbol a represents the base mathematical symbol (M). Also, the superscript i represents the northern region encoded inside Ns and Ne, and the subscript 2 represents the southern region encoded inside Ss and Se.

Similarly, the SES for a ² + b ² = c ² will be <a, Ns, 2, Ne, +, b, Ns, 2, Ne, =, c, Ns, 2, Ne>. Therefore, this approach aids in converting the non-linear mathematical expression into Structure Encoded String, thereby making expression linear while preserving the structural information.

2.4. Pattern generation

After encoding a mathematical expression into a sequence of SES, the patterns for operands and operators are generated.

Table 1
Mapping table

Type List Mapping term

Arithmetic +, −, ∗, ÷, … OP1

Calculus ∫, ∬, ∂, … OP2

Statistics μ, Σ, 𝛱, … OP3

Measurements ‰, °, ′′, … OP4

Letter Like ℘, $\Im$ , $\Re$ , … OP5

Set-Logic ∀, ∈, ⊂, … OP6

Geometric ⊥, ∥, ⊣, … OP7

Equivalence ≡, ∼, ≅, … OP8

Arrow ←, →, ↓, … OP9

Greek 𝛼, 𝛽, 𝛾, … V

Latin A, B, C, … V

Digit 0, 1, 2, ... D

Type	List	Mapping term
Arithmetic	+, −, ∗, ÷, …	OP1
Calculus	∫, ∬, ∂, …	OP2
Statistics	μ, Σ, 𝛱, …	OP3
Measurements	‰, °, ′′, …	OP4
Letter Like	℘, $\Im$ , $\Re$ , …	OP5
Set-Logic	∀, ∈, ⊂, …	OP6
Geometric	⊥, ∥, ⊣, …	OP7
Equivalence	≡, ∼, ≅, …	OP8
Arrow	←, →, ↓, …	OP9
Greek	𝛼, 𝛽, 𝛾, …	V
Latin	A, B, C, …	V
Digit	0, 1, 2, ...	D

Considering the two MEs i.e. x ² + y ² = 1 and a ² + b ² = 1, it can be observed that they are semantically equivalent but they disagree in the context of variables used. If x, y, a and b can be transformed to an equivalent group/common term, then both the MEs can be considered as equivalent match. Retrieval depends on match type and the match type is based on the user requirements. This module is closely related to unification and normalization process. Matching can be of any type like exact, instantiation or generalisation [16,25,26].

Table 2

Pattern generation

Mathematical formula	Structure encoded string	Pattern
a ² + b ² = c ²	<a, Ns, 2, Ne, +, b, Ns, 2, Ne, =, c, Ns, 2, Ne>	V,NS,D,NE,OP1,V,NS,D,NE,OP1,V,NS,D,NE
$a_{2}^{i+1}$	<a, Ss, 2, Se, Ns, i, +, 1, Ne>	V,SS,D,SE,NS,V,OP1,D,NE
$\displaystyle \sum _{n=1}^{\infty }\frac{1}{n^{2}}=\frac{{\pi}^{2}}{6}$	<Σ, BS, n, =, 1, BE, TS, ∞, TE, ∕, 1, @, n, NS, 2, NE, =, n, 𝜋, NS, 2, NE, @, 6>	OP3,BS,V,OP1,D,BE,TS,OP1,TE,OP1,D,@,V,NS,D,NE,OP1,OP1,V,NS,D,NE,@,D
$\displaystyle \int _{-1}^{1}\frac{dx}{x}$	<∫ , BS, −1, BE, TS, +1, TE, n, d, x, @, x>	OP2,BS,M,BE,TS,M,TE,OP1,V,V,@,V

In this module, a mapping table consisting of 12 kinds of groups of operators created by The Pennsylvania State University¹ has been employed. A common mapping term is generated for a particular group of operators, and the complete mapping table is shown Table 1. Based on Table 1, patterns have been generated for MEs. Examples of pattern generation for some MEs are shown in Table 2.

Fig. 5.

A pattern based trie indexing scheme along with posting list associated with the leaf node.

3. Indexing

One of the predominant data structures for indexing text is Trie, additionally called a Digital Tree or Prefix Tree. A trie facilitates a convenient way for storing strings as it provides one node for every common prefix and the whole string is stored in additional leaf nodes. The Trie data structure eradicates the need of storing overlapping prefixes which are stored only once thereby making it a compact structure [27].

TheoremA . (Trie).

Formally, let Σ be some fixed alphabet.

•
A trie is a tree where each node stores

–
A bit indicating end of a string
–
An array of |Σ| pointers, one for each character

•
Each node x corresponds to some string given by the path traced from the root to that node.

In vector space model, bag of words representation is used where order of terms is ignored [18,25,28]. But for the formula retrieval, the order of operators is a crucial feature. So, for effective formula retrieval, a pattern based trie (PB-Trie) indexing structure is constructed to preserve the order.

In a pattern based trie (PB-Trie), which is a prefix tree of patterns, a node is defined as a triplet < P, C, E > where P contains the pattern information, C denotes a pointer to the child node of the current node and E flags end of the information.

The leaf node marks the end of the pattern setting E to true and pointing to its posting list that contains the records (M, Loc), where M contains SES information and Loc contains the canonical path of the actual document.

Considering the ME, a ² + b ² = c ², the generated SES is $\begin{eqnarray}\displaystyle a,Ns,2,Ne,+,b,Ns,2,Ne,=,c,Ns,2,Ne. & & \displaystyle \nonumber\end{eqnarray}$ For the above-mentioned SES, the pattern $\begin{eqnarray}\displaystyle V,NS,D,NE,OP1,V,NS,D,NE,OP1,V,NS,D,NE & & \displaystyle \nonumber\end{eqnarray}$ is generated. This pattern is inserted into the PB-Trie scanning each symbol of the pattern from left to right. The same procedure is repeated for other MEs and desired PB-Trie is constructed as illustrated in Fig. 5.
3.1. Algorithm for indexing and searching in PB-Trie

4. Online retrieval

The goal in this phase is to efficiently retrieve relevant mathematical expressions (MEs) based on a given ME query, both in terms of speed and accuracy. The constructed PB-Trie is thus used, with the online retrieval algorithm to achieve the purpose.

4.1. Query processing and pattern generation

The proposed approach takes LATEX string as ME query and transforms it into its corresponding Presentation MathML (P-MML) using SnuggleTeX² which is an open-source Java library developed at the University of Edinburgh. This transformed P-MML serves as input to the SES Generator module from which SES of the query is obtained. The SES is then again converted to its corresponding pattern through Pattern Generation module.

4.2. Matching and retrieval

Let Σ^∗ be the set of all possible strings over Σ. The Jaro measure d _j: Σ^∗×Σ→∗→ [0,1] is a string similarity measure approach which was developed originally for name comparison in the U.S. Census. It is a string similarity measure which accounts for insertions, deletions and transpositions. The algorithm computes the total number of common characters c between two strings and the number of transpositions of c considering the greatest integer of half the length of the longer string [29,30].

Consider a character s _i of string S and a character t _j of another string T to be the common characters of S and T if $\begin{eqnarray}\displaystyle s_{i}=t_{j}\quad \text{and}\quad |i-j|\lt \left\lfloor {\displaystyle \frac{n}{2}}\right\rfloor , & & \displaystyle \nonumber\end{eqnarray}$ where, n is the length of the longer string.

The Jaro similarity measure for the two strings is then given by $\begin{eqnarray}\displaystyle d_{j}=\left\{\begin{array}{@{}ll@{}}0, & \text{if }c=0\\ \frac{1}{3}\left(\frac{c}{|s|}+\frac{c}{|t|}+\frac{c-t}{c}\right), & \text{otherwise}.\end{array}\right. & & \displaystyle\end{eqnarray}$ (1) The time and space complexities of Jaro similarity algorithm are O (|s| + |t|) [31,32].

The Winkler modification improves the Jaro similarity measure that puts more emphasis on matching prefixes (up to four) if Jaro similarity exceeds a certain “boost threshold” b _t, originally set to 0.7. It is calculated as $\begin{eqnarray}\displaystyle d_{w}=\left\{\begin{array}{@{}ll@{}}d_{j}, & \text{if }d_{j}<b_{t}\\ d_{j}+\left(\frac{d_{j}}{10}(1-d_{j})\right), & \text{otherwise}\end{array}\right. & & \displaystyle\end{eqnarray}$ (2) Here, l _p denotes the length of the common prefix.

A given ME query is first searched in the PB-Trie and consequently in its posting list as well if a match is found. Thereafter, scores of all the MEs are calculated existent in the current posting list by using Jaro-Winkler similarity algorithm recursively. The scores so generated, are sorted in a descending order. First come first serve approach is used if more than one mathematical expressions acquire the same score. Top k results are then shown to the user.

4.2.1. A running example

For a given LATEX query x ² + y ² = z ², the pre-processed pattern of the query string is shown in Table 3.

Table 3
Query pattern generation

Query SES Pattern

x ² + y ² = z ² < x, Ns, 2, Ne, +, y, Ns, 2, Ne, =, z, Ns, 2, Ne > V,NS,D,NE,OP1,V,NS,D,NE,OP1,V,NS,D,NE

Query	SES	Pattern
x ² + y ² = z ²	< x, Ns, 2, Ne, +, y, Ns, 2, Ne, =, z, Ns, 2, Ne >	V,NS,D,NE,OP1,V,NS,D,NE,OP1,V,NS,D,NE

Assume that a leaf node of the proposed trie matches with the example query which contains the following three sets of <SES, DocID> pairs:

$\{<\text{a},\text{Ns},2,\text{Ne},+,\text{b},\text{Ns},2,\text{Ne},=,\text{c},\text{Ns},2,\text{Ne}>, \text{Doc}1\}$ ,

$\{{\lt }\text{a},\text{Ns},2,\text{Ne},+,\text{b},\text{Ns},2,\text{Ne},=,\text{c},\text{Ns},2,\text{Ne}{>}, \text{Doc}2\}∼\text{and},$

$\{{<}\text{x},\text{Ns},2,\text{Ne},+,\text{b},\text{Ns},2,\text{Ne},=,l,\text{Ns},2,\text{Ne}{>},∼\text{Doc3}\}.$

As shown in Table 4, the similarity scores obtained for each of these documents yield the following

ranking: Doc3 > Doc1 > Doc2.

Table 4

Similarity scores for the query x ² + y ² = z ²

ME (DocID)	SES	Score
a ² + b ² = c ² (Doc1)	<a,Ns,2,Ne,+,b,Ns,2,Ne, =,c,Ns,2,Ne>	0.9242919389978214
𝛼² + 𝛽² = 𝛾² (Doc2)	<𝛼,Ns,2,Ne,+,𝛽,Ns,2,Ne, =,𝛾,Ns,2,Ne>	0.8996496496496497
x ² + b ² = l ² (Doc3)	<x,Ns,2,Ne,+,b,Ns,2,Ne, =,l,Ns,2,Ne>	0.9660130718954247

5. Experimental results

The performance of the proposed system AlongMath has been evaluated using Wikipedia Corpus [24] publicly available at NTCIR-12 MathIR task³, which contains mathematical formulae written for normal users.

5.1. Corpus characteristics

The corpus contains 31,839 MathTag articles and 287,850 Text articles which contribute approximately 10% and 90% of the collection respectively. The corpus contains a total of 592,443 formulae out of which 580,068 marked formulae are distributed amongst the MathTag articles and 12,375 marked formulae in the Text articles. The size of the corpus is 5.15 GB in uncompressed format [33]. Also, the query set is taken from the same source which contains 100 queries written in LATEX format.

5.2. Evaluation measures

To evaluate the system performance, Precision and Discounted Cumulative Gain (DCG) are calculated respectively, for 100 queries.

TheoremB . (Precision).

“It measures the exactness of the retrieval process. If the actual set of relevant documents is denoted by I and the retrieved set of documents is denoted by O, then the precision is given by: $\begin{eqnarray}\displaystyle \text{Precision}={\displaystyle \frac{|I\cap O|}{|O|}}. & & \displaystyle\end{eqnarray}$ (3) Precision takes all retrieved documents into account, but it can also be evaluated at a given cut-off rank. Thus, the list of relevant documents I is cut-off at rank k. Only documents up to rank k are considered as the retrieved set of documents. This measure is called precision at k or P@k" [34,35].

TheoremC . (Discounted Cumulative Gain (DCG)).

“DCG measures the usefulness, or gain, of a document based on its position in the result list. DCG of the top-k retrieved results can be calculated using: $\begin{eqnarray}\displaystyle DCG_{k}=\displaystyle \mathop{\sum }_{i=1}^{k}{\displaystyle \frac{2^{rel_{i}}-1}{\log _{2}(i+1)}}. & & \displaystyle\end{eqnarray}$ (4) Here, the list is named rel in which the ith element (rel _i) denotes whether the ith retrieved document is relevant to the query (rel _i = 1) or not (rel _i = 0).

Like precision at k, it is evaluated over some number k of top search results" [34,35].

5.3. Results and discussion

To measure the relevance of the top-k retrieved formulae for each test query, each result has been evaluated and labelled manually depending on the subject. On an average about 676ms is incurred in searching math formulae for each query.

Fig. 6.

P@5 comparison of AlongMath with MIaS.

Fig. 7.

P@5 comparison of AlongMath with WikiMirs.

Fig. 8.

P@10 comparison of AlongMath with MIaS.

Fig. 9.

P@10 comparison of AlongMath with WikiMirs.

It can be observed from Figs 6, 7, 8 and 9 that the proposed system AlongMath performs better than MIaS and WikiMirs in terms of P@5 and P@10. In case of queries having isolated symbols or wildcards the MIaS system achieves better results than AlongMath since MIaS also incorporates a tf-idf scheme using Lucene framework for indexing the content. In some cases, WikiMirs precision is much higher than AlongMath because of its two-level indexing and incorporation of a structural weighting scheme. It can also be observed that, in many cases a single relevant result could not be fetched by WikiMirs in terms of P@5 and P@10. Also, in the Table 5 and Table 6, a comparison of mean average precision and mean reciprocal rank of AlongMath, WikiMirs and MIaS is shown respectively which highlights the proficiency of AlongMath over others.

Fig. 10.

DCG@5 comparison of AlongMath with MIaS.

Fig. 11.

DCG@5 comparison of AlongMath with WikiMirs.

Fig. 12.

DCG@10 comparison of AlongMath with MIaS.

Fig. 13.

DCG@10 comparison of AlongMath with WikiMirs.

Similarly, Figs 10, 11, 12 and 13 indicate the performance comparison of AlongMath with MIaS and WikiMirs in the context of DCG@5 and DCG@10 respectively. In some of the cases of both MIaS and WikiMirs, DCG values are closer to zero due to non-retrieval of relevant results higher in the ranking. Experiments on accuracy also have been carried out in context of several different evaluation measures. Overall, AlongMath obtains more accurate results and ranks formulae better than MIaS and WikiMirs.

Table 5

Mean average precision (MAP) comparison

AlongMath	WikiMirs	MIaS
0.595	0.475	0.462

Table 6

Mean reciprocal rank (MRR) comparison

AlongMath	WikiMirs	MIaS
0.678	0.489	0.472

5.3.1. Limitations

To this end, following critical observations of the proposed system AlongMath are presented below:

AlongMath primarily aims at retrieving scientific documents based on mathematical queries. However, it does not support textual content and wildcards.

Indexing with trie also has its consequence in terms of index size. This is because each pattern is an extension of an existing SES and these extensions made the trie very sparse. For this reason, Patricia trie [36] could be used, as this optimization may reduce the size to approximately one third of its uncompressed version.

6. Conclusion

In this paper, a scientific document retrieval system AlongMath is presented that facilitates scientific document retrieval using math queries. The system employed the concept of structure encoded string to convert the mathematical expression into a string that enabled to achieve term generalization. AlongMath utilized pattern-based trie for normalization and indexing. The system achieves a MAP score of 0.595 and MRR score of 0.678 outperforming the other systems in consideration. As part of the future work, the system aims the following:

Equipping the system to handle co-occurrence of formula and keywords.

Introducing multi-level indexing for text and formulae.

Considering Content MathML for achieving better performance.

To explore weighting schemes and other similarity measures.

To use machine learning algorithms for ranking and re-ranking like RankSVM.

Footnotes

1

2

3

References

Sojka

, Exploiting semantic annotations in math information retrieval. in: Proceedings of the Fifth Workshop on Exploiting Semantic Annotations in Information Retrieval, ESAIR’12. ACM, New York, NY, USA, 2012, pp. 15–16. ISBN 978-1-4503-1717-7. doi:10.1145/2390148.2390157.

Pathak

Pakray

and Gelbukh

, A formula embedding approach to math information retrieval, Computacion y Sistemas (2018), 819–833. doi:10.13053/CyS-22-3-3015.

Larson

R.R.

Reynolds

and Gey

F.C.

, The abject failure of keyword IR for mathematics search: Berkeley at NTCIR-10 Math. in: NTCIR Kando

and Kato

(eds), National Institute of Informatics (NII), 2013. ISBN 978-4-86049-062-1.

Pathak

Pakray

Sarkar

Das

and Gelbukh

A.F.

, MathIRs: Retrieval system for scientific documents, Computación y Sistemas 21(2) (2017).

Zanibbi

and Blostein

, Recognition and retrieval of mathematical expressions, Int. J. Doc. Anal. Recognit. 15(4) (2012), 331–357. doi:10.1007/s10032-011-0174-4.

Castellanos

K.D.

, Symbolic and Visual Retrieval of Mathematical Notation using Formula Graph Symbol Pair Matching and Structural Alignment, PhD thesis, Rochester Institute of Technology, 2017.

Kristianto

G.Y.

, Retrieval and Disambiguation of Mathematical Expressions for Mathematical Information Access, PhD thesis, Information Science and Technology in Computer Science, Graduate School of the University of Tokyo, 2017.

Gao

Lin

Tang

Lin

and Baker

J.B.

, WikiMirs: A mathematical information retrieval system for wikipedia. in: Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries on WikiMirs: A Mathematical Information Retrieval System for Wikipedia, JCDL’13. ACM, New York, NY, USA, 2013, pp. 11–20. ISBN 978-1-4503-2077-1. doi:10.1145/2467696.2467699.

Kristianto

G.Y.

Topic

and Aizawa

, The MCAT Math Retrieval System for NTCIR-11 Math Track, 2014. ISBN 978-1-4503-1717-7. http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings11/pdf/NTCIR/Math-2/06-NTCIR11-MATH-KristiantoGY.pdf.

10.

W3C, Mathematical Markup Language, Accessed June 1, 2018. https://www.w3.org/TR/WD-math-980106/.

11.

LATEX – A document preparation system, Accessed June 15, 2018. https://www.latex-project.org/.

12.

Home

OpenMath

, Accessed June 1, 2018. https://www.openmath.org/.

13.

Graf

, Substitution tree indexing. in: Rewriting Techniques and Applications Hsiang

(ed.), Springer, Berlin, Heidelberg, 1995, pp. 117–131. ISBN 978-3-540-49223-8.

14.

Schellenberg

Yuan

and Zanibbi

, Layout-based substitution tree indexing and retrieval for mathematical expressions, Proc. SPIE 8297 (2012). doi:10.1117/12.912502.

15.

Sojka

and Líška

, Indexing and searching mathematics in digital libraries. in: Intelligent Computer Mathematics Davenport

J.H.

Farmer

W.M.

Urban

and Rabe

(eds), Springer, Berlin, Heidelberg, 2011, pp. 228–243. ISBN 978-3-642-22673-1.

16.

Guidi

and Sacerdoti Coen

, A survey on retrieval of mathematical knowledge. in: Intelligent Computer Mathematics Kerber

Carette

Kaliszyk

Rabe

and Sorge

(eds), Springer International Publishing, Cham, 2015, pp. 296–315. ISBN 978-3-319-20615-8.

17.

Davila

and Zanibbi

, Layout and semantics: Combining representations for mathematical formula search. in: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR’17. ACM, New York, NY, USA, 2017, pp. 1165–1168. ISBN 978-1-4503-5022-8. doi:10.1145/3077136.3080748.

18.

Wang

Gao

Wang

Tang

Liu

and Yuan

, WikiMirs 3.0: A Hybrid MIR System Based on the Context, Structure and Importance of Formulae in a Document. in: Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries 2015, pp. 173–182. ISBN 978-1-4503-3594-2. doi:10.1145/2756406.2756918.

19.

Mišutka

and Galamboš

, System description: EgoMath2 As a tool for mathematical searching on Wikipedia.Org. in: Proceedings of the 18th Calculemus and 10th International Conference on Intelligent Computer Mathematics 2011, pp. 307–309. ISBN 978-3-642-22672-4. http://dl.acm.org/citation.cfm?id=2032713.2032746.

20.

Sojka

and Líška

, The Art of Mathematics Retrieval. in: Proceedings of the 11th ACM Symposium on Document Engineering 2011, pp. 57–60. ISBN 978-1-4503-0863-2. doi:10.1145/2034691.2034703.

21.

Miner

and Munavalli

, An approach to mathematical search through query formulation and data normalization. in: Towards Mechanized Mathematical Assistants Kauers

Kerber

Miner

and Windsteiger

(eds), Springer, Berlin, Heidelberg, 2007, pp. 342–355. ISBN 978-3-540-73086-6.

22.

Kohlhase

and Prodescu

C.-C.

, MathWebSearch at NTCIR-10. in: NTCIR Kando

and Kato

(eds), National Institute of Informatics (NII), 2013. ISBN 978-4-86049-062-1.

23.

Kohlhase

and Sucan

, A search engine for mathematical formulae. in: AISC Calmet

Ida

and Wang

(eds), Lecture Notes in Computer Science, Vol. 4120, Springer, 2006, pp. 241–253. ISBN 3-540-39728-0.

24.

NTCIR, NII Testbeds and Community for Information Access Research, Accessed March 1, 2018. http://research.nii.ac.jp/ntcir/index-en.html.

25.

Pavan Kumar

Agarwal

and Bhagvati

, A string matching based algorithm for performance evaluation of mathematical expression recognition, Sadhana 39(1) (2014), 63–79. doi:10.1007/s12046-013-0221-6.

26.

Kumar

P.P.

Agarwal

and Bhagvati

, A structure based approach for mathematical expression retrieval. in: Multidisciplinary Trends in Artificial Intelligence, 6th International Workshop, MIWAI 2012, Ho Chi Minh City, Vietnam, December 26–28, 2012 Sombattheera

Loi

N.K.

Wankar

and Quan

T.T.

(eds),Proceedings, Vol. 7694, Springer, 2012, pp. 23–34. doi:10.1007/978-3-642-35455-7_3.

27.

Peral

and Ferrández

, MergedTrie: Efficient textual indexing, PLoS ONE 14 (2019). doi:10.1371/journal.pone.0217958.

28.

Lee

H.-J.

and Wang

J.-S.

, Design of a mathematical expression understanding system, Pattern Recognition Letters 18: (1997), 289–298.

29.

Jaro

M.A.

, Advances in record-linkage methodology as applied to matching the 1985 Census of Tampa, Florida, Journal of the American Statistical Association 84(406) (1989), 414–420. doi:10.1080/01621459.1989.10478785.

30.

Winkler

W.E.

, Overview of record linkage and current research directions, Technical Report, BUREAU OF THE CENSUS, 2006.

31.

Christen

, A comparison of personal name matching: Techniques and practical issues. in: Proceedings of the Sixth IEEE International Conference on Data Mining - Workshops 2006, pp. 290–294., ISBN 0 7695-2702-7. doi:10.1109/ICDMW.2006.2.

32.

Yancey

W.E.

and Yancey

W.E.

, Evaluating string comparator performance for record linkage, Technical Report, Bureau of the Census, 2005.

33.

Zanibbi

Aizawa

Kohlhase

Ounis

Topic

and Davila

, NTCIR-12 MathIR task overview. in: Proceedings of the 12th NTCIR Conference on Evaluation of Information Access Technologies, National Center of Sciences, Tokyo, Japan, June 7–10, 2016 Kando

Sakai

and Sanderson

(eds), National Institute of Informatics (NII), 2016. http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings12/pdf/ntcir/OVERVIEW/01-NTCIR12-OVMathIR-ZanibbiR.pdf.

34.

Datta

, Ranking in information retrieval., Technical Report, Department of Computer Science and Engineering, Indian Institute of Technology, Bombay, 2013.

35.

Manning

C.D.

Raghavan

and Schütze

, Introduction to Information Retrieval. Cambridge University Press, 2008, Web publication at https://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-ranked-retrieval-results-1.html/fig:precision-recall.

36.

Morrison

D.R.

, PATRICIA—Practical algorithm to retrieve information coded in alphanumeric, J. ACM 15(4) (1968), 514–534. doi:10.1145/321479.321481.

Scientific document retrieval using structure encoded string with trie indexing

Abstract

Keywords

1. Introduction

2.1. Overview

TheoremA . (Trie).

4. Online retrieval

4.1. Query processing and pattern generation

4.2. Matching and retrieval

Table 3 Query pattern generation Query SES Pattern x 2 + y 2 = z 2 < x, Ns, 2, Ne, +, y, Ns, 2, Ne, =, z, Ns, 2, Ne > V,NS,D,NE,OP1,V,NS,D,NE,OP1,V,NS,D,NE

5.1. Corpus characteristics

5.2. Evaluation measures

TheoremB . (Precision).

6. Conclusion

Footnotes

1

2

3

References

Table 3
Query pattern generation

Query SES Pattern

x ² + y ² = z ² < x, Ns, 2, Ne, +, y, Ns, 2, Ne, =, z, Ns, 2, Ne > V,NS,D,NE,OP1,V,NS,D,NE,OP1,V,NS,D,NE