Sage Journals: Discover world-class research

Abstract

In the scientific field, mathematical formulae are a significant factor in communicating the ideas and the fundamental principles of any scientific knowledge. Nowadays, the scientific research community generates a huge number of documents that comprise both textual and mathematical formulae. For the retrieval of textual information, numerous retrieval systems are present that generate excellent results. Nevertheless, these textual information retrieval systems are insufficient to handle the structure and scripting styles of the mathematical formulae. The recent past has perceived the research, which intends to retrieve the textual and mathematical formulae, but their impoverished results are symptomatic to the scope of improvement. In this article, we have implemented the formula-embedding approach, which encodes the formulae into fixed dimensional embedding vectors. For encoding of formula, we have used universal sentence encoder–based sentence-embedding model, which relies on transformer architecture and deep averaging network. The proposed models take the latex formula as an input and produce an output of fixed dimensional embedding representation. To achieve more promising results, the transformer model follows stacked self-attentions, point-wise fully connected layers and positional encoding for both the encoder and decoder. The obtained results have been compared with state-of-the-art existing approaches, and the comparison study revealed that the proposed approach offers better retrieval accuracy in terms of $n D C G'$ = 0.217, $M A P'$ = 0.178 and P@10 = 0.378 measures.

Keywords

Deep averaging network embedding information retrieval transformer architecture

1. Introduction

Mathematical information retrieval (MIR) is a well-known fast-growing research field in the domain of Natural Language Processing (NLP), and it depicts the significant demand for enhancement in mathematical knowledge management [1]. The prime goal of an MIR system is to retrieve scientific documents/formulae that are relevant to a queried formula [2]. The existing search engines have efficiently handled plain text, image and audio, but are insufficient to handle the mathematical notations due to their scripting style, scientific symbols and mathematical tags [3]. To enable the search and access of such information, a retrieval system for mathematical information is required. A large amount of effort has been performed manually in mathematical knowledge management systems to support searchable tools [4]. As this area develops rapidly, manual management is no longer sufficient and MIR technology is required to achieve an efficient search.

The recent advancement in the education domain mostly utilises digital resources, which includes digital classroom [5], game-based learning [6] and knowledge sharing platforms such as quora¹ and math stack exchange (MSE).² For students, the web is a primary source to search the information relevant to their studies. The students’ searched information may contain the text and/or mathematical formulae. Besides this, students are bound to search relevant stuff (articles, documents) due to their limited background knowledge. The digitally available scientific documents represent textual information as a string of characters, while mathematical notations are represented in $L^{A} T_{E} X$ and/or MathML formats, which have predefined structure and scripting style [7]. The conventional search engines (such as Google and Yahoo) can effectively handle and search text-based sources but are insufficient to handle the tags of mathematical formulae. Existing information retrieval systems assume that formulae are statistically independent of one another. Moreover, it has been seen that formulae and their surrounding context are dependent on each other, and the mapping of formula identifiers with their descriptive context is a rigorous one [8]. The continuously increasing potential of web-browsers improves the mathematical information scripts rendering ability and advances the tools to handle such scripts. Moreover, the researchers have mainly worked with the tree-based approach where formulae are represented in Operator Tree (OPT) and Symbol Layout Tree (SLT) format [9]. The ultimate goal of these representations (SLT and OPT) is to preserve the syntactic and semantic meaning of the formula. In addition to this, neural network-based approaches have brightened the capability of formula representation and retrieval. In the neural network-based approach, the embedding-based model [10,11] attains remarkable results as compared with tf-idf (term frequency–inverse document frequency). Here, word embedding makes it feasible to employ deep learning to NLP applications, for instance, text summarisation, machine translation and question answering. Meanwhile, several neural network–based retrieval approaches have been introduced, and only some of them have made significant improvements [12 –14]. However, in mathematical knowledge management, there have been only a few similar attempts. In the MIR, the formula representation is the key constituent that maps the formula into the vector format and tries to preserve its semantic and syntactic structure. Based on the prior studies, it is observed that the vector-based approaches are unable to preserve the semantic structure of the formula. For instance, $x^{2}$ and $2^{x}$ have different synthetic and semantic meanings, but vector-based approaches considered them the same [15].

The ambiguity is the most common problem in NLP tasks [16]. Similarly, mathematical language processing also suffers from ambiguity problems. In mathematics, some formulae have the same representation but hold a different meaning, for instance, P(x), which means that the probability of ‘x’ or ‘P’ is multiplied with ‘x’. Sometimes, mathematical notations possess the alternative representation such as a permutation of k events selected from n distinct events have several representations: $P_{k}^{n}$ , ${{}_{n}P}_{k}$ , $P_{n, k}$ , ${{}^{n}P}_{k}$ and P(n, k). The conventional search engines are unable to recognise the meaning of all such formulae as well as unable to treat all such differently expressed formulae as equivalent. Moreover, a large number of symbols are present in the mathematical formulae, which have different scripting styles, predefined structures and typefaces. These give rise to the symbol identification problem. The challenges mentioned above have exhibited that MIR is a challenging task in terms of syntax and semantics. In addition to this, some research question needs to be addressed to achieve state-of-the-art results in MIR.

RQ1: How can we include neural network–based representation strategies to ‘mathematical language’ (for instance, word embedding)?

RQ2: How can we use neural-network-based representation technologies to promote MIR efficiency?

RQ3: How can we use the joint embedding model?

RQ4: How can mathematical search assessment be based on a representative task?

In this article, we have implemented the mathematical formula-embedding approach, which encodes the formula into the embedded vector. We have used a universal sentence encoder (USE)-based sentence-embedding model for encoding a formula, which relies on transformer architecture and deep averaging network (DAN). The proposed embedding model takes the latex formula as input and produces an output as a fixed dimensional embedding representation for the same. The transformer model follows the stack of self-attention and point-wise fully connected layers for both the encoder and decoder to achieve more refined results. The performance of the proposed approach has been tested using an MSE corpus of ARQMath 2020 and obtained results compared with the existing state-of-the-art MIR approaches. The experimental results showed that the proposed approach has a remarkable contribution in the field of MIR and holds the potential to provide a brighter future to MIR-based applications.

In the following section, we have highlighted the prior work related to the MIR domain, a detailed account of the dataset, a detailed description of the methodology, experimental results and conclusions and further research direction.

2. Related work

There are numerous forms that have been proposed to represent the mathematical formulae in a unified and precise structure, such as vector-based [15], tree-based [17] and neural network–based approach [10]. The prior research works of MIR have resolved most of the challenges such as mapping of syntactic and semantic forms of mathematical expression [18], similarity estimation of semantically similar formulae [19], and association of mathematical expression with context [20]. The promising signs of continuous progress and sophisticated proposed technologies have shown remarkable growth in the field of MIR. At NTCIR-10, MIaS [21] has one of the well-performed MIR system, which has used several preprocessing operations to handle the mathematical data. At NTCIR-11 Math-2 Task [22], MIaS has been strengthened with query expansion technique and improvised canonicalisation on both Presentation and Content MathML format of the formulae and attested that Content MathML format of the formula has less ambiguity than Presentation MathML format [23]. Moreover, team IFISB_QUALIBETA [24] combined the features extracted from formulae and their context, which includes the category of the formulae, the sets of identifiers, constants, operators, noun-context and verb-context. The extracted features from formulae and context have been indexed using elastic search engine.³ This system aims to capture both the semantic meaning of the formulae and the syntactic structure, and combine them to find relevant searches. MATHWEBSEARCH (MWS) system of team KWARC [25] is a web application, and a complete system for crawling, indexing and searching documents that contain formulae and text. It provides the low-latency answers to full-text queries, which consist of keywords and formulae. MWS comprises a custom math search engine, which uses compressed formula representation (using substitutions) to build an in-memory index, and a text engine system based on Apache Solr-ElasticSearch. MWS front-ends convert formula schemata (with query variables) into content MathML expressions, which the MWS formula indexer answers by unification, and combines the results with keyword results from a text search engine. The variable typing approach [26] has assigned the mathematical type (technical words in mathematics) to their corresponding mathematical symbol, and it has defined on the four basic assumptions: first, typing is performed at the sentence level (type assigned to a variable that occurred in the same sentence). Second, variables and types in the sentence are known as a priori. Third, edges in the same sentence are independent of each other. Fourth, edges in the different sentences are independent of each other. Furthermore, the performance of the variable typing approach has been tested using two baseline systems that is, nearest type and the SVM proposed by Kristianto et al. [27,28], and three newly proposed approaches that is, extended version of SVM baseline, convolutional neural network and bidirectional long-short term memory (LSTM) [29,30]. Among these approaches, the bidirectional LSTM achieved remarkable results. The SciMath system [31] has used the Presentation MathML format of formula and translated it to a string by Structure Encoded String (SES). This SES string is then transformed to a bit vector using a mapping table and indexed using a B-Tree indexing scheme. The prime contribution of this system is to preserve the structural meaning of the formulae.

In MIR, the system’s efficiency and user satisfaction mostly depend on the symbol entities and the structural format of the formula rather than their semantic. The Maximum Subtree Similarity (MSS) approach [32] performed the formula retrieval using best query match, unification and wildcard support. The deployment of MSS on massive data required huge costs and time-consuming. To achieve this, the Tangent-3 system [33] retrieved the formulae using inverted index over the pair of symbols and ranked using dice coefficient. Final retrieval results have been ranked using MSS and returned the top-k relevant results with respect to the queried formula. The formula2vec approach [11] has analysed the distinct traits between the natural and mathematical language, and learned the distributive representation for a mathematical symbol. The experimental results depict that the formula2vec with language model achieved remarkable results compared with the individual model. The unsupervised equation embeddings (EqEmb) approach [10] learned the distributed representations of the mathematical formula. In this approach, each mathematical formula has been considered as a single word and provided a strong semantic interpretation to the mathematical formula that occurs in a potentially larger window around the original word sense. The natural premise selection approach [8] has gathered supporting interpretations and hypotheses that are useful to produce implicit mathematical evidence for a specific declaration.

The extraction of mathematical knowledge is beneficial to various tasks, from retrieval of mathematical knowledge to creating pathways to access visually impaired scientific articles. For instance, a rules-based strategy [34] has extracted the identifiers from the formula expressed in $L^{A} T_{E} X$ and connected it to its context explanations. The TopicEq approach [35] has used the shared latent topic to correlate the significance of the mathematical equation and their context. This model has shown excellent results and set the standard for other MIR systems. The semantic enhancement approach has boosted the formula format conversion process [36] and highlights that the account of word meaning of the formulae leads to the minimum error rate. The contributions of this approach are: First, developed a public repository for the conversion of mathematical format, test queries, a manually designed gold standard and customised evaluation metrics. Second, a quantitative assessment advanced methods for the conversion of mathematical formats. Third, proposed a novel approach for mathematical format conversion, which considers the semantic background of formulae to minimise the error rate. To illustrate the importance of the MIR to a wide range of applications, the semantic extraction system [37] has the first contribution in math recommendation systems to enhance scientific search engine performance. The word embedding [38] is one of the best performing framework, which depicts the individual words with semantically fixed-length vectors and allows to apply deep learning to NLP applications such as question answering and machine translation. Scientific documents contain both natural text and mathematical expressions, which hold linear associations with contextual characteristics. Motivated from this, the math-word embedding approach [39] has discussed the effective contribution of embedding to estimate the similarities between the mathematical terms, numeric concept modelling based on the core keywords, mathematical search using the query extension and semantic extraction.

The structured and sufficient amount of data is the key factor of any well-performed machine learning and NLP applications. To evaluate the performance of the information retrieval system, the collection of training and test data should be adequate. Stathopoulos et al. [40] prepared a real-time research level test collection for mathematical information need with their relevant judgements. This collection has been attained from the MathOverflow websites, consisting of 160 test queries derived from 120 MathOverflow discussion threads. Generally, the mathematical formulae are diverse in terms of syntax and semantic. To optimise the retrieval efficiency of the semantically and syntactically similar formulae, the HFS-BERT based (Hesitation Fuzzy Sets-Bidirectional Encoder Representations from Transformer) approach [41] has considered formulae as a context. The HFS determined the membership degree of the symbol of mathematical formulae, whereas BERT aims to formula context similarity calculation. As a final retrieval result, scientific documents have been ranked and retrieved based on their context similarity.

The variable size formula-embedding approach [42] has been one of the participant of ARQMath-2020 [43] formula search task. In this approach, the formula (Presentation MathML Format) has been represented in vector format whose size depends on the number of entities present in the formula and each entity associated with their position in Bit Position Information Table (BPIT). At ARQMath-2020 [43], the DPRL has one of the well-performed research team, which introduced the Tangent Combined FastText (Tangent-CFT) system [44]. The Tangent-CFT system has used both SLT and OPT representations of formulae to consider the formulae’s appearance and syntax. Tangent + CFT has the extension of Tangent-CFT embedding model in which each formula has two vector representations: Formula Vector: Vector representation of size 300 obtained by Tangent-CFT; Text Vector: Vector representation of size 100, which is the fastText default value obtained by treating the formula as a word. Moreover, team MIRMU [45] has participated with two different approaches, i.e. Formula2Vec and Soft Cosine Measure (SCM). Here the Formula2Vec system has inferred the document and formula embeddings using the Doc2Vec DBOW model, whereas the SCM represents the formula using TF-IDF with unsupervised word embeddings. Most of the existing MIR system has successfully retrieved the syntactically similar formulae but failed to retrieve semantically similar formulae. To solve this obstacle, the formula embedding and generalisation approach [46] transformed the formulae into the vectors of size 202-bit size. In addition to this, the context of the formula has been taken into consideration, which also highlighted the significance of the dissimilarity factor in the computation of similarities between the formulae. To identify the strength and frailty of formula representation techniques, the learning-to-rank model [47] has used the SVM rank over similarity score as a feature and showed that the combined features of the different similarities achieved the state-of-the-art performance.

3. Dataset description

The organisers of the ARQMath 2020 task [43] has provided the MSE⁴ corpus, which contains the knowledge sharing posts and mathematics-based question answers. The MSE is a knowledge sharing question-answer platform in which users can search for the needed mathematical information, answer and share knowledge for math-based questions using both text and mathematical notation. The MSE corpus of ARQMath 2020 contains 28,320,920 formulae derived from the question, answer and comment posts of MSE. To facilitate the appearance of the formula and the syntactical hierarchy of operators and arguments, the organiser provided the formulae in Presentation and Content MathML format. In addition to these, the organiser provided the formulae in $L^{A} T_{E} X$ format, which is the most direct representation of the mathematical formulae. $L^{A} T_{E} X$ is easy to type, and it mostly renders the presentation sense of a math formula. In our proposed work, we have used only the $L^{A} T_{E} X$ format of the formulae, and to test the proficiency of the proposed approach, 45 formula-based query are used, which are also expressed in $L^{A} T_{E} X$ format. The statistics of the MSE corpus of ARQMath 2020 is shown in Table 1.

Table 1.

Math stack exchange corpus of ARQMath 2020.

Corpus	Math Stack Exchange of ARQMath 2020
Type	Formula
Formats	$L^{A} T_{E} X$
Size	1.5 GB
No. of Formulae	28,320,920
No. of Formulae in uppercase	3,031,743
No. of Formulae in lowercase	25,289,277
No. of test queries	45

4. Methodology

4.1. Preprocessing

Formulae such as $A^{2} + B^{2}$ and $a^{2} + b^{2}$ have the same syntactic and semantic meaning, but when it is not converted into the same case, both are represented as two different formulae in the vector space model.⁵ To handle this, we have converted all formulae into the lower case. In the MSE corpus, 3,031,743 formulae are represented in uppercase, which is ≈ 12% of the total formulae. This upper-to-lowercase conversion process is eventually guided to more relevant searches.

4.2. System architecture

Word embedding is one of the most common text vocabulary representations. It captures the meaning of words, semantic and syntactic correlation and similarity within the words. Word embeddings describe the word in low dimensional vector form, and to obtain this, an appropriate composition function is required. Composition function is a mathematical framework that combines multiple words into a single vector. The prior research work witnessed that the embeddings of longer size input strings or sentences achieved excellent performance in the semantic textual similarity (STS) [48]. Motivated by this, we have performed formula embedding, which encodes the formulae to embedding vectors. For encoding of formula, we have used USE model [49] that is based on two state-of-the-art sentence encoding frameworks, that is, transformer architecture [50] and DAN [51]. The proposed embedding approach takes the latex formula as input and produces an output as a fixed dimensional embedding vector. Both models (transformer architecture and DAN) are implemented in TensorFlow and have different design objectives. The transformer architecture is aimed at high precision at the expense of higher model complexity and massive resources depletion, while the DAN aims for the effective deduction with slightly less precision.

4.2.1. Transformer

The encoder–decoder-based transformer architecture transforms one sequence to another. The encoder and the decoder consist of several modules that can be stacked on the top of each other and represented by Nx, which are set to 6. The encoder maps the symbol representation of the input sequence (x₁,…,x_n) to a continuous representation of the sequence z = (x₁,…,z_n). Then z is passed to the decoder to render the output sequence of symbols (y₁,…,ym) with one element at a time. The model is automatically managed at every stage and by using the formerly created tokens as add-on input, the next one is generated. The transformer model uses the stack of self-attention and point-wise fully connected layers to adopt this process for both encoder and decoder. The overall framework of transformer architecture is shown in Figure 1.

Figure 1.

The transformer model architecture [50].

4.2.2. Encoder

The encoder module follows the stack of identical layers (N = 6) where each layer constituted the two sub-layers, that is, multi-head self-attention mechanism and position-wise fully connected feed-forward network. In addition to this, the residual connections are inserted across each sub-layer lead by the layer normalisation. The outcome of each sub-layer is LayerNorm(x+Sublayer(x)), in which Sublayer(x) is implemented by the sub-layer function itself. All the sub-layers of the model and the embedding layers generate the output of $d_{m o d e l} = 512$ to make residual connections easier where $d_{m o d e l}$ is an overall dimensionality of the model and inputs shape. The visual presentation of the encoder is shown in Figure 2.

Figure 2.

The transformer’s encoder [50].

4.2.3. Decoder

The decoder module follows the same stack of identical layers (N = 6). In addition to the two sub-layers of each encoder layer, the decoder adds the third sub-layer. The prime task of this third layer is to attain the multi-head attention over the encoder’s output. Similar to the encoder, the residual connections are inserted across each sub-layer lead by the layer normalisation. To prevent the position from observing the subsequent positions, we have also incorporated the self-attention sub-layer in the decoder stack. The visual presentation of the decoder is shown in Figure 3.

Figure 3.

The transformer’s decoder [50].

4.2.4. Multi-head attention

The main element of the transformer architecture is the multi-head self-attention framework, which is pictorially represented in Figure 4. The transformer examines the encoded representation of the input as a set of key-value pairs (K, V) of dimension n (input sequence length) in which the keys and values are the encoder hidden layers. On the decoder side, the prior output is compact to a query ‘Q’ of dimension ‘m’ and considered this query and key-value pair to generate the next output. For this, the transformer architectures acquire the scaled dot-product attention: output is the weighted sum of the values, and the dot-product of the query and keys determined the weight assigned to each value

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{n}}) V

(1)

Figure 4.

Multi-head scaled dot-product attention mechanism [50].

Instead of measuring the attention only once, the multi-head attention mechanism moves multiple times in parallel via scaled dot-product attention. The individual attention outcomes are simply compiled and translated linearly into the intended dimensions. According to the study by Vaswani et al. [50], multi-head attention enables the model to gather knowledge from numerous representation sub-spaces in various positions

M u l t ii H e a d (Q, K, V) = C o n c a t (h e a d_{1} \dots, h e a d_{h}) W^{o}

(2)

where

h e a d_{i} = A t t e n t i o n ({Q W}_{i}^{Q}, {K W}_{i}^{K}, {V W}_{i}^{v})

(3)

where the projections are parameter matrices $W_{i}^{Q} ∊ R^{d_{m o d e l} \times d_{k}}$ , $W_{i}^{K} ∊ R^{d_{m o d e l} \times d_{k}}$ , $W_{i}^{V} ∊ R^{d_{m o d e l} \times d_{v}}$ and $W^{o} ∊ R^{\times h d_{v} d_{m o d e l}}$

In our work, we have used h = 8 parallel attention layers or heads. For each of these, we have used $d_{k} = d_{v} = d_{m o d e l} / h =$ 64. Because of the reduced dimension of each head, the overall computational cost is identical to the total dimensionality of single-head attention.

4.2.5. Position-wise feed-forward networks

Besides the attention sub-layers, each layer of encoder–decoder comprises a fully connected feed-forward network, which is deployed individually and uniformly to each position and comprises two linear transformations with a ReLU activation in between. The input and output dimension is $d_{m o d e l} = 512$ , the inner-layer is $d_{f f} = 2048$ dimensional.

4.2.6. Embeddings and softmax

The input and output tokens are transformed to a $d_{m o d e l}$ dimensional vector using learned embedding. Moreover, the learned linear transformation and the softmax function are used to translate the output of the decoder into the next expected probabilities. We have used the identical weights between the two embedding layers and the pre-softmax linear transformation, and these weights are multiplied by $\sqrt{d_{m o d e l}}$ in the embedding layers.

4.2.7. Positional encoding

In default, the transformer model does not contain recurrence and convolution to preserve and utilise the order of the tokens (words/symbols). As each token of the formula flows through the transformer’s encoder/decoder stack simultaneously, the model cannot sense the position/order of tokens. The one feasible way to offer a sense to the order of tokens is to add a piece of information to each token about its position in the formula, and this process is referred to as positional encoding. The approach of position encoding satisfies the two essential criteria: First, it’s not a single value parameter. Rather it is a d-dimensional vector, which contains information about the token location in a formula. Second, this encoding is not integrated into the model itself. Instead, this vector is used to equip each symbol with information about its position in a formula.

In our proposed model, we have included this ‘positional encoding’ with the input embedding at the foot of the encoder and decoder stacks that have the same dimension $d_{m o d e l}$ as the embeddings, so that the two can be added together. There are several options for positional encodings. In this work, we have employed the sine and cosine functions on various frequencies

P E_{(p o s, 2 i)} = \sin (p o s {/ 10000}^{2 i / d_{m o d e l}})

(4)

P E_{(p o s, 2 i + 1)} = \cos (p o s {/ 10000}^{2 i / d_{m o d e l}})

(5)

where pos is the position, and i is the dimension, such that every dimension of the position encoding is a sinusoid. The wavelengths form a geometric progression from $2 π$ to $10, 000 \times 2 π$ . We decided to use this function because we believed that the model could learn easier to take relative positions, and $P E_{p o s + k}$ can be interpreted as a straight linear feature of $P E_{p o s}$ for any fixed offset k.

4.2.8. Deep averaging network

The DAN is a straightforward model and it just simply takes the average word embeddings of the input tokens and bigrams and then passed through the deep feed-forward neural networks to generate the final embedding for the larger sentences. The basic assumption of the feed-forward deep neural networks is that each layer learns a more abstract representation of the input compared to its prior one [52]. To apply this principle to the model of neural bag-of-words (NBOW) [53], each layer is required to have increasingly small but meaningful differences in the word-embedding averages. To be more realistic, consider a formula $f_{1}$ $‘ x^{2} + y^{2} = z^{2} ’$ and generate $f_{2}$ and $f_{3}$ by replacing superscript ‘2’ with ‘3’ and then again by ‘a’ . The average vector for those three formulae are approximately equivalent; however, the average of the compared sentence $f_{1}$ and $f_{2}$ are significantly more closer than the sentence $f_{3}$

z = g (ω ∊ X) \frac{1}{| X |} \sum_{ω ∊ X} υ_{ω}

(6)

The equation 6 computes the z, by averaging the vectors of the symbols $υ_{ω ∊ X}$ . In equation (6), g is the average embedding of the vector representation of the input formula X. Then z can be transformed further by adding more layers. Instead of transmitting this representation directly to the output layer, the softmax function is applied

z_{i} = g (z_{i - 1}) = f (W_{i} \cdot z_{i - 1} + b_{i})

(7)

This model is still unordered, but its depth allows it to capture small input variations better than the regular NBOW model. The aforementioned overall process is known as the DAN, and we make use of this DAN, where input embeddings of the symbols and bigrams are averaged and then processed via a feed-forward deep neural network to generate the formula embeddings. The visual delineation of the DAN model is shown in Figure 5.

Figure 5.

The deep averaging network architecture.

4.3. Similarity

The cosine similarity is used to compute the similarity between the documents and based on that it provides the ranking to the documents with respect to the user entered query. Statistically, it calculates the cosine angle in a multidimensional space between two vectors [54].

In this work, two vectors containing the embedding of the formulae and user entered query formula is compared. The mathematical form of cosine similarity is given as below

\cos (x, y) = \frac{x \cdot y}{| x | \cdot | y |} = \frac{\sum_{i = 0}^{n - 1} x_{i} \cdot y_{i}}{\sqrt{\sum_{i = 1}^{n - 1} {(x_{i})}^{2}} \times \sqrt{\sum_{i = 1}^{n - 1} {(y_{i})}^{2}}}

(8)

where $| x |$ is the Euclidean norm of vector $x = (x_{1}, x_{2}, \dots, x_{n})$ , defined as ${x_{1}}^{2} + {x_{2}}^{2} + \dots {x_{n}}^{2}$ . Basically, it is the length of the vector. Similarly, the Euclidean norm of vector y is $| y |$ . A cosine function of 0 indicates that the two vectors have no correlation and hold the 90 degrees angle. The cosine function of 1 indicates the lower angle and the higher vector match. The cosine similarity is useful because even though the two identical texts are very distant due to their size from the euclidean, they still have a closer angle. The angle is lower, the resemblance is higher.

5. Experimental design and results

5.1. Experimental environment

The experimental platform is a standalone Ubuntu 18.04 desktop to validate our claim. The configuration of the experimental environment is demonstrated in Table 2. At the time of the experiment, we have carefully validated our approach and avoided any kind of noise.

Table 2.

Experimental environment.

Name	Features
CPU	Intel(R) Xeon(R) W-2155 CPU @ 3.30 GHz
Number of CPU	20
L1d cache	32 K
L1i cache	32 K
L2 cache	1024 K
L3 cache	14,080 K
RAM	64 GB
Operating system	Ubuntu 18.04 LTS
HDD	2 TB
Programming language	Python
Version of the language	3.7

CPU: central processing unit; RAM: random access memory; HDD: hard disk drive.

5.2. Queryset description

The queryset contained 45 mathematical formulae extracted from the question post of the year 2019 of MSE [43]. The queryset contained a simple as well as a complex mathematical formula, and each query is coupled with query ID. To retrieve the relevant and more refined formulae, queries are converted into a lower case and maintained the symmetry with trained data. The organiser of the ARQMath 2020 task provided the queries (Topics) in an XML file with a predefined format as shown in Figure 6. Each query in the XML file is tagged by < topic > and </topic > tags and has a unique query number, that is, B.x, where ‘x’ represents the query number. Formula_Id shows the id of formula, Latex depicts the latex representation of formula, Title shows the question title of the post, Question shows the question body from which the formula is selected, and Tags shows the comma-separated tags of the question. The 45 used queries are shown in Table 8.

Figure 6.

Query representation.

5.3. Gold dataset description

The performance validation of the proposed approach is done against the gold dataset released by ARQMath 2020 [43] and the structure of the gold dataset adhere to the Text REtrieval Conference (TREC) qrel format [55]. The gold dataset contained a set of formulae that have been judged as relevant (3, 2 and 1) or irrelevant (0) by humans. Gold dataset has four attributes: (1) Query_ID, (2) Iteration, (3) Formula_ID and (4) Relevance. Query_Id represents the specific query in queryset. Iteration is an immaterial attribute, which is ignored by the TREC tool. Formula_ID specifies the unique identity number of formula and relevance attribute specify the humans judgement in 3, 2 and 1 (relevant) and 0 (irrelevant). The format of the gold dataset is shown in Table 3. In the gold dataset, the relevance decisions are relatively biased towards the non-relevant formulae. There are total 12,116 decisions contained in the gold dataset, out of which 7891 decisions (65.13%) belong to 0 (irrelevant decision), 718 decisions (5.93%) to 1 (partially relevant), 553 decisions (4.56%) to 2 (nearly similar) and 2954 decisions (24.38%) to 3 (relevant).

Table 3.

Format of gold dataset.

Query_ID	Iteration	Formula_ID	Relevance
B.2	xxx	27291574	2
B.2	xxx	4679833	0
B.2	xxx	11093012	0
B.2	xxx	23183548	1
B.2	xxx	25501475	0
B.4	xxx	11733326	2
B.4	xxx	18893803	0
B.4	xxx	22771057	0

5.4. Format of result set

The result set enclosed the search results, that is, formulae retrieved by the proposed approach. The result set follows the TREC format [55] and it constitutes six distinct fields namely Query_ID, Formula_ID, Post_ID, Rank, Similarity and RunID to denote the retrieved search results for each query present in the queryset. Out of these six fields, three of them namely Query_ID, Formula_ID and Similarity have been examined through an assessment tool whereas, the rest of the fields have been discarded. The similarity score is a floating-point value that ranges from 0.0 to 1.0 and attains distinguished values for each formula which contains exact query term (and/or) sub-query term. The rank field is implicitly defined and is being ruled out. For each queried formula, the proposed approach retrieved the top 1000 formulae. The format of the result set is shown in Table 4.

Table 4.

Format of result set.

Query_ID	Formula_ID	Post_ID	Rank	Similarity	RunID
B.12	11626499	1245768	4	0.14925373	1
B.12	889751	78995	4	0.14925373	1
B.12	6881533	685293	4	0.14925373	1
B.12	1305147	112877	4	0.14925373	1
B.12	9542908	1001221	4	0.14925373	1
B.12	3391591	578758	4	0.3880597	1
B.17	7568847	765608	5	0.3880597	1
B.17	13401688	1445265	5	0.3880597	1
B.17	8188373	839466	5	0.3880597	1
B.17	44553	2952227	5	0.3880597	1

5.5. Evaluation parameters

The proficiency of the proposed approach is measured in terms of $n D C G'$ , $M A P'$ and P@10. All these parameters are calculated for each query and then the average is taken for all queries present in queryset. The parameter P@10 depicts the precision of top-10 retrieved documents. The parameter $n D C G'$ is based on normalised discounted cumulative gain (nDCG) and is a commonly used scale when ratings for relevance judgements are available and a single value figure is needed over a series of ranking lists. The $n D C G'$ is mathematically defined as

n D C G' = \begin{matrix} a v g \\ q ∊ Q \end{matrix} \frac{D C G'}{{I D C G'}_{q}} I D C G

(9)

where

I D C G = \sum_{i = 1}^{| R E L_{q} |} \frac{g a i n_{q} (R E L_{q, i})}{l o g_{2} (i + 1)}

(10)

D C G' = \sum_{i = 1}^{| R E {S'}_{q} |} \frac{g a i n_{q} (R E S_{q, i})}{l o g_{2} (i + 1)}

(11)

where Q is the total number of query in queryset, $R E L_{q}$ is a list of relevant documents for query q, $R E S_{q}$ is a list of results produced for query q, $R E S'_{t} = R E L_{t} R E S_{t}$ and $g a i n_{t} (R)$ is the gain of result R for query q as specified by relevance judgements.

Besides $n D C G'$ , we have also calculated the mean average precision (MAP) [56] with removed unjustified posts and mathematically defined as

M A P = \frac{\sum_{q = 1}^{Q} A v e P ()}{Q}

(12)

where Q is the number of queries.

5.6. Results analysis

For evaluating the performance of the proposed approach, trec_tool [55] is used, which compares the gold dataset with the result set obtained from the proposed system. The proposed approach delivers the best precision result compared with FormulaEmbedding, Tangent +CFT, Formula2Vec and SCM system as shown in Table 5 and Figure 7, respectively. The obtained results outperformed existing state-of-the-art formula retrieval approaches and successfully handled the retrieval of exact match formula, sub-formula and parent formula. In any search engine or retrieval system, the response time or retrieval time is an essential parameter. If the retrieval system has a very good retrieval accuracy with high retrieval time, such system is still not considered as a well-performed retrieval system. Descriptive and qualitative studies reveal that the search session for mathematical information is usually longer than the general search sessions of textual information and less efficient [57]. A well-performed retrieval system should have the ability of good retrieval accuracy with minimum retrieval time. As aforesaid, to show the time retrieval efficiency of the proposed approach, the minimum, maximum and average retrieval time of the proposed approach is shown in Table 6. Moreover, the aim of the test queries is to verify the properties of math-aware search engines such as retrieval of sub-formula, parent formula, similar formula, nearly similar formula, retrieval time and formula representation ability, which are briefly explained as follows:

Table 5.

Performance comparison of the team NLP_NITS system, team DPRL system, team MIRMU system and our proposed system.

System	Data	$n D C G'$	$M A P'$	P@10
NLP_NITS [42] FormulaEmbedding	Math	0.026	0.005	0.042
DPRL [44] $T a n g e n t + C F T$	Both (Text +Math)	0.135	0.047	0.207
MIRMU [45]
Formula2Vec	Math	0.108	0.047	0.076
SCM	Math	0.059	0.018	0.049
Proposed approach	Math	0.217	0.178	0.378

SCM: soft cosine measure.

Figure 7.

Result comparison.

Table 6.

The minimum, maximum and average run time in seconds of team DPRL system, team MIRMU system and our proposed system.

System	Minimum	Maximum	Average
DPRL $T a n g e n t + C F T$	0.848	0.857	0.851
MIRMU
Formula2Vec	61.61	5448.65	164.5
SCM	54.81	2720.14	108.86
Proposed approach	49	264	62

SCM: soft cosine measure.

As mentioned four research questions related to neural network–based language representation to mathematical embedding in the ‘Introduction’ section, the proposed formula-embedding model has successfully addressed the first, second and third. However, the fourth research question is on the edge of the workable path and has been somewhat solved but will soon be addressed completely in our future work.

In any retrieval system, the response time or retrieval time is the significant factor to depict their performance. The proposed approach takes an average of 62 s to retrieve the relevant searches, as highlighted in Table 6. The average retrieval time of our proposed approach is considerable, and in future work, we will try to minimise the retrieval time of our proposed approach.

The proposed approach provides coherent semantic representations to equations, better fits than existing embedding approaches and has successfully inferred the meaningful semantic relationships between equations.

The retrieval efficiency of math-aware search engines is not only about the retrieval of the exact-match formula but also how it handles the sub-formula and parent formula. Usually, a sub-formula is a part of a queried formula, for example, the obtained results of query B.27 depicted in Table 7 where USE-based formula-embedding model effectively handles the retrieval of sub-formula.

Moreover, a parent formula is a formula that holds the existence of the queried formula, for example, the obtained results of query B.15 show the evaluation of the parent formula search for a queried formula $a^{3} + b^{3} + c^{3} - 3 a b c$ (Tables 7).

The nearly similar formula is a formula that has a similar meaning to the formula in the dataset, for example, the obtained results of the query formula B.7 and B.30 depict the retrieval accuracy of the nearly similar search.

The obtained results of query B.44 have shown that the preprocessing module of the proposed approach leads to more relevant searches. In the retrieved formula, the first two formulae are exactly similar to the queried formula, the only difference is in their case representation. This behaviour of the proposed approach infers that the preprocessing module enhances the retrieval ability of the syntactically similar formula.

As concerned to information expressed in natural language, we inferred that the proposed model can also account for the order of the operators and operands, apart from its ability to understand the long-term alignment of the formula.

The proposed formula-embedding model has successfully disentangled the spatial structure of the formulae. For example, $2^{a}$ and $a^{2}$ both formulae have the same variable, constant and mathematical function (power). The existing approaches such as by Pathak et al. [15] considered them as a similar formula, and this may explain the frailty of this approach. The proposed formula-embedding model has successfully addressed this problem and leads to more accurate search results.

The neural network–based representational models have shown remarkable performance in distributed word representations and semantic similarities between words. The obtained results value concludes that the natural language representational model can represent the formula in distributed vector space and has the capability to preserve their syntax and semantic. Therefore, the research questions RQ1 and RQ2 highlighted in the ‘Introduction’ section are successfully achieved in terms of their applicability for mathematical language.

As we have highlighted the research question RQ3 in the ‘Introduction’ section, the proposed formula-embedding model has successfully combined the strengths of two different embedding approaches to produce more accurate results.

The noticeable difference in the obtained results value denotes that preserving the positions of symbols using position encoding is an influential factor in the retrieval of mathematical formulae.

The DAN model achieves comparable slightly less retrieval accuracy to syntactic functions and is trained in less time than the transformer encoder.

Table 7.

Retrieved search results.

Query ID	Query formula	Retrieved formula
B.7	${(1 + i \sqrt{3})}^{1 / 2}$	${(2 + \sqrt{2})}^{1 / 3}$
		${(1 - \sqrt{3} i)}^{1 / 2}$
		$(1 + i \sqrt{3}) / 2$
		$(- 1 + \sqrt{3 i}) / 2$
		$(- 1 / 2 + \sqrt{3} i / 2)$
B.9	$\int_{0}^{1} \frac{l n (1 + x) l n (1 - x)}{1 + x} d x$	$\int_{0}^{1} \frac{l n (1 + x)}{x} d x$
		$\int_{0}^{1} (- x^{a} c o s (l n x) l n (1 - x)) d x$
		$\int_{0}^{1} \frac{l n (1 + x^{a})}{1 + x} d x$
		$\int_{0}^{1} l n (1 + x) d x$
		$\int_{0}^{1} \frac{l n (1 + x)}{x} d x$
B.15	$a^{3} + b^{3} + c^{3} - 3 a b c$	$a^{3} + b^{3} + c^{3} - 3 a b c$
		$a^{3} + b^{3} + c^{3} - 3 a b c ?$
		$n = a^{3} + b^{3} + c^{3} - 3 a b c$
		$3 \| (a^{3} + b^{3} + c^{3} - 3 a b c)$
		$p = a^{3} + b^{3} + c^{3} - 3 a b c$
B.27	$A B = 1 \Rightarrow B A = 1$	$a b = b a = 1$
		$a b - b a = 1$
		$b a = a b = 1$
		$a b + b a = 1$
		$A b + B a = 1$
B.30	$\underset{n \to \infty}{l i m} a_{n}$	$\underset{n \to \infty}{limsup} A_{n}$
		$li m_{n \to \infty} a_{n} = L$
		$\underset{n \to \infty}{limsup} a_{n}$
		$li m_{n \to \infty} \| a_{n} \|$
		$\underset{n \to \infty}{l i m} a_{n} = 0$
B.39	$Ø, {1}, {2}, {1, 2}, {3}, {1, 3}, {2, 3}, {1, 2, 3}, {4}, \dots$	$Ø, {1}, {2}, {1, 2}, {3}, {1, 3}, {2, 3}, {1, 2, 3}, {4}, \dots$
		$Ø, {1}, {2}, {3}, {1, 2}, {1, 3}, {2, 3}, {1, 2, 3}$
		$Ø, {1}, {2}, {3}, {1, 2}, {1, 3}, {2, 3}, {1, 2, 3} .$
		${{1}, {2}, {3}, {1, 2}, {1, 3}, {3, 2}, {1, 2, 3}, Ø}$
		$Ø, {1}, {2}, {3}, {1, 2}, {1, 3}, {2, 3}, S$
B.44	$A^{2} + B^{2} = C^{2} + D^{2}$	$a^{2} + b^{2} = c^{2} + d^{2}$
		$A^{2} + B^{2} = C^{2} + D^{2}$
		$A^{2} + B^{2} = C^{2}$
		$2 (a^{2} + b^{2}) = c^{2} + d^{2}$
		$a^{2} + b^{2} = d^{2} - c^{2}$

Table 8.

Query set.

Query ID	Formula query
B.1	$f (x) = \frac{x^{2} + x + c}{x^{2} + 2 x + c}$
B.2	$\frac{df}{d x} = f (x + 1)$
B.3	$\sum_{k = 0}^{n} (\begin{matrix} n \\ k \end{matrix}) k$
B.4	$\sum_{n = 0}^{N} n x^{n}$
B.5	$\int_{0}^{\infty} \frac{s i n x}{x^{a}}$
B.6	$\int \int_{V} f (x, y) d x dy = \int \int_{Q} f (Φ (u, v) \| \frac{\partial Φ}{\partial u} \times \frac{\partial Φ}{\partial v} \|$
B.7	${(1 + i \sqrt{3})}^{1 / 2}$
B.8	$y = xy' + \frac{1}{2} {(y')}^{2}$
B.9	$\int_{0}^{1} \frac{l n (1 + x) l n (1 - x)}{1 + x} d x$
B.10	$\int_{x = 0}^{\infty} \frac{s i n (x)}{x} \frac{s i n (x)}{x}$
B.11	$x - \frac{x^{3}}{3 \times 3!} + \frac{x^{5}}{5 \times 5!} - \frac{x^{7}}{7 \times 7!} + \dots = \sum_{n = 0}^{\infty} {(- 1)}^{n} \frac{x^{(2 n + 1)}}{(2 n + 1) \times (2 n + 1)!}$
B.12	$e^{3 π i / 2}$
B.13	$s i n (18^{°}) = \frac{a + \sqrt{b}}{c}$
B.14	$i = \sqrt{- 1}$
B.15	$a^{3} + b^{3} + c^{3} - 3 a b c$
B.16	$Empty (x) \Leftarrow \ n o t \exists y (y \in x)$
B.17	$\frac{\partial^{3} f}{\partial x^{3}}$
B.18	$\neg P \to A_{1} \to \dots \to A_{n} \to P$
B.19	$q, r : a = bq + r$
B.20	$\sum_{r = 1}^{n} {(- 1)}^{(n - r)} (\begin{matrix} n \\ r \end{matrix}) {(r)}^{m}$
B.21	$(M_{2 \times 2} (Q), \times)$
B.22	$s i n (x), s i n (2 x), s i n (3 x), \dots, s i n (nx)$
B.23	$rq \equiv 1 mod p$
B.24	${(x + y)}^{k} \geq x^{k} + y^{k}$
B.25	$\sum \frac{1}{n^{2 + c o s n}}$
B.26	$n = n_{1} n_{2} \dots n_{k} + 1$
B.27	$A B = 1 \Rightarrow B A = 1$
B.28	$\frac{1}{\sqrt{- 1}} = \sqrt{- 1}$
B.29	$\exists p (p i s p r i m e \to \forall x (p i s p r i m e))$
B.30	$\underset{n \to \infty}{l i m} a_{n}$
B.31	$lcm (n_{1}, n_{2}) = \frac{n_{1} n_{2}}{gcd (n_{1}, n_{2})}$
B.32	${(x^{T} Ah)}^{T} = h^{T} A^{T} x$
B.33	$a^{n} + 1$
B.34	$(\begin{matrix} s \\ s \end{matrix}) + (\begin{matrix} s + 1 \\ s \end{matrix}) + \dots + (\begin{matrix} n \\ s \end{matrix}) = (\begin{matrix} n + 1 \\ s + 1 \end{matrix})$
B.35	$\sum_{j = 0}^{N - 1} c o s (l \frac{(2 j + 1) π}{2 N}) = 0$
B.36	$f (x) = x + \frac{1}{x}$
B.37	$\underset{u \to \infty}{l i m} \frac{u^{m}}{e^{u}} = 0$
B.38	$(- 1) (- 1) = 1$
B.39	$Ø, {1}, {2}, {1, 2}, {3}, {1, 3}, {2, 3}, {1, 2, 3}, {4}, \dots$
B.40	$1, 1 + \frac{1}{2}, 1 + \frac{1}{2} + \frac{1}{3}, 1 + \frac{1}{2} + \frac{1}{3} + \frac{1}{4}, \dots$
B.41	$p_{n} = \frac{1}{2} p_{n - 1}$
B.42	$\sum_{k = 0}^{n} k \cdot (\begin{matrix} n \\ k \end{matrix}) = O (2^{n \underset{3}{l o g} n}) ?$
B.43	$Z [x]$
B.44	$A^{2} + B^{2} = C^{2} + D^{2}$
B.45	$A^{- 1} A = I_{n}$

Therefore, we have sufficient evidence to conclude that, the proposed model has performed better for the retrieval of mathematical formulae and sets the benchmark for other systems on MSE ARQMath 2020 data.

6. Conclusion and future scope

The progress and wide availability of data creation, storage and processing encouraged the development of large data repositories and, therefore, enabled the information retrieval technology to grow towards effective and efficient search engines. In this study, we have explored the USE model, which follows the joint embedding approach of transformer architecture and DAN for the retrieval of mathematical information. The proposed model takes the latex formula as input and produces an output of fixed dimensional embedding representation. To accomplish more effective results, the transformer model follows the multi-head self-attention mechanism, position-wise feed-forward networks and positional encoding framework for both the encoder and decoder. As an individual, the DAN model takes less training time with slightly less accuracy than the transformer model. The experimental results showed that the proposed model has satisfactory outcomes and retrieves more accurate search results than the existing state-of-the-art formula retrieval systems (FormulaEmbedding, Tangent+CFT, Formula2Vec and SCM).

The proposed approach currently refrains from handling the text content in accordance with mathematical formulae. In future work, the text content will be mapped with formulae that may possibly invalidate some of the irrelevant search results. In addition to this, the resemblance of the mathematical search will be explored to achieve the defined research question (RQ4). The efficacy of other embedding methods will also be explored in future studies.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship and/or publication of this article.

ORCID iDs

Pankaj Dadure

Partha Pakray

Notes

References

Novotný

. Interpretable document representations for fast and accurate retrieval of mathematical information. In: Proceedings of the 44th international ACM SIGIR conference on research and development in information retrieval, Virtual Event Canada, 11–15 July 2021, pp. 2705–2705. New York: ACM.

Kohlhase

Sucan

. A search engine for mathematical formulae. In: International conference on artificial intelligence and symbolic computation, 2006, pp. 241–253. Springer, https://doi.org/10.1007/11856290_21

Schubotz

Youssef

Markl

et al. Challenges of mathematical information retrievalin the NTCIR-11 math wikipedia task. In: Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval, 2015, pp. 951–954, http://dx.doi.org/10.1145/2766462.2767787

Elizarov

Kirillovich

Lipachev

et al. Digital ecosystem ontomath: mathematical knowledge analytics and management. In: International conference on data analytics and management in data intensive domains, 2016, pp. 33–46. Springer, https://doi.org/10.1007/978-3-319-57135-5_3

John

Wheeler

The digital classroom: harnessing technology for the future of learning and teaching. Routledge, 2015, https://doi.org/10.4324/9780203821008

Pho

Dinscore

. Game-based learning, https://acrl.ala.org/IS/wp-content/uploads/2014/05/spring2015.pdf

Padovani

. On the roles of latex and mathml in encoding and processing mathematical expressions. In: International conference on mathematical knowledge management, 2003, pp. 66–79. Springer, https://doi.org/10.1007/3-540-36469-2_6

Ferreira

Freitas

. Natural language premise selection: finding supporting statements for mathematical text. In: Proceedings of the 12th language resources and evaluation conference, 2020, pp. 2175–2182, https://arxiv.org/abs/2004.14959

Davila

Zanibbi

. Layout and semantics: combining representations for mathematical formula search. In: Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval, 2017, pp. 1165–1168, https://doi.org/10.1145/3077136.3080748

10.

Krstovski

Blei

DM.

Equation embeddings. arXiv preprint arXiv:180309123, pp. 233–242, 2018, https://arxiv.org/abs/1803.09123

11.

Gao

Jiang

Yin

et al. Preliminary exploration of formula embedding for mathematical information retrieval: can mathematical formulae be embedded like a natural language? arXiv preprint arXiv:170705154, 2017, https://arxiv.org/abs/1707.05154.

12.

Guo

Fan

et al. A deep relevance matching model for ad-hoc retrieval. In: Proceedings of the 25th ACM international on conference on information and knowledge management, 2016, pp. 55–64, http://dx.doi.org/10.1145/2983323.2983769

13.

Novak

Batko

Zezula

. Large-scale image retrieval using neural net descriptors. In: Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval, 2015, pp. 1039–1040, http://dx.doi.org/10.1145/2766462.2767868

14.

Mahmood

Al-Kubaisy

Al-Khateeb

. Using artificial neural network for multimedia information retrieval. J Southwest Jiaotong Univ 2019; 54(3): 1–9.

15.

Pathak

Pakray

Gelbukh

. A formula embedding approach to math information retrieval. Comput Sistemas 2018; 22(3): 819–833.

16.

Chantree

Nuseibeh

De Roeck

et al. Identifying nocuous ambiguities in natural language requirements. In: 14th IEEE international requirements engineering conference (RE’06), 2006, pp. 59–68. IEEE, https://doi.org/10.1109/RE.2006.31

17.

Davila

Joshi

Setlur

et al. Tangent-v: math formula image search using line-of-sight graphs. In: European conference on information retrieval, 2019, pp. 681–695. Springer, https://doi.org/10.1007/978-3-030-15712-8_44

18.

Guidi

Coen

. A survey on retrieval of mathematical knowledge. Mathematics in Computer Science 2016; 10(4): 409–427.

19.

Kamali

Tompa

. Improving mathematics retrieval. In: Towards a digital mathematics library, Grand Bend, ON, Canada, 2009, pp. 37–48, https://dml.cz/handle/10338.dmlcz/702556

20.

Kristianto

. MCAT math retrieval system for NTCIR-12 mathIR task. In: Proceedings of the 12th NTCIR conference on evaluation of information access technologies, Tokyo, Japan, 2016, pp. 323–330, http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings12/pdf/ntcir/MathIR/04-NTCIR12-MathIR-KristiantoGY.pdf

21.

Liska

Sojka

Ruzicka

. Similarity search for mathematics: Masaryk University team at the NTCIR-10 math task. In: Proceedings of the 11th NTCIR conference on evaluation of information access technologies, Tokyo, Japan, 2013, pp. 686–691, https://www.fi.muni.cz/usr/sojka/papers/liska-sojka-ruzicka-ntcir2013.pdf

22.

Aizawa

Kohlhase

Ounis

et al. NTCIR-11 math-2 task overview. In: Proceedings of the 11th NTCIR conference on evaluation of information access technologies, Tokyo, Japan, 2014, vol. 11, pp. 88–98, https://research.nii.ac.jp/ntcir/workshop/OnlineProceedings11/pdf/NTCIR/OVERVIEW/01-NTCIR11-OV-MATH-AizawaA.pdf

23.

Ružicka

Sojka

Líška

. Math indexer and searcher under the hood: history and development of a winning strategy. In: Proceedings of the 11th NTCIR conference on evaluation of information access technologies, Tokyo, Japan, 2014, pp. 127–134, https://research.nii.ac.jp/ntcir/workshop/OnlineProceedings11/pdf/NTCIR/Math-2/07-NTCIR11-MATH-RuzickaM.pdf

24.

Pinto

JMG

Barthel

Balke

. Qualibeta at the NTCIR-11 math 2 task: an attempt to query math collections. In: Proceedings of the 11th NTCIR conference on evaluation of information access technologies, Tokyo, Japan, 2014, pp. 103–107, http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings11/pdf/NTCIR/Math-2/03-NTCIR11-MATH-GonzalezPintoJM.pdf

25.

Hambasan

Kohlhase

Prodescu

. Mathwebsearch at NTCIR-11. In: Proceedings of the 11th NTCIR conference on evaluation of information access technologies, Tokyo, Japan, 2014, pp. 114–119, https://research.nii.ac.jp/ntcir/workshop/OnlineProceedings11/pdf/NTCIR/Math-2/05-NTCIR11-MATH-HambasanR.pdf

26.

Stathopoulos

Baker

Rei

et al. Variable typing: assigning meaning to variables in mathematical text. In: Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: human language technologies, 2018, vol. 1 (Long Papers), pp. 303–312, https://doi.org/10.17863/CAM.30845

27.

Kristianto

Nghiem

Matsubayashi

et al. Extracting definitions of mathematical expressions in scientific papers. In: Proceedings of the 26th annual conference of JSAI, 2012, pp. 1–7, https://doi.org/10.11517/pjsai.JSAI2012.0_3P1IOS2a3

28.

Kristianto

Topić

Aizawa

. Exploiting textual descriptions and dependency graph for searching mathematical expressions in scientific papers. In: Ninth international conference on digital information management (ICDIM 2014), 2014, pp. 110–117. IEEE, https://doi.org/10.1109/ICDIM.2014.6991403

29.

Lample

Ballesteros

Subramanian

et al. Neural architectures for named entity recognition. In: Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, 2016, pp. 260–270, http://hdl.handle.net/10230/27725

30.

Rei

Crichton

Pyysalo

. Attending to characters in neural sequence labeling models. In: Proceedings of the 26th international conference on computational linguistics: technical papers, 2016, pp. 309–318, http://hdl.handle.net/10044/1/76305

31.

Dhar

Biswas

Singh

. SciMath: a mathematical information retrieval system using signature based B tree indexing. In: International journal of innovative technology and exploring engineering (IJITEE), 2019, vol. 8, pp. 234–244, https://doi.org/10.35940/ijitee.K1298.0981119

32.

Zanibbi

Davila

Kane

et al. Multi-stage math formula search: using appearance-based similarity metrics at scale. In: Proceedings of the 39th international ACM SIGIR conference on research and development in information retrieval, 2016, pp. 145–154, https://doi.org/10.1145/2911451.2911512

33.

Pattaniyil

Zanibbi

. Combining TF-IDF text retrieval with an inverted index over symbol pairs in math expressions: the tangent math search engine at NTCIR 2014. In: Proceedings of the 11th NTCIR conference on evaluation of information access technologies, Tokyo, Japan, 2014, pp. 135–142, https://research.nii.ac.jp/ntcir/workshop/OnlineProceedings11/pdf/NTCIR/Math-2/08-NTCIR11-MATH-PattaniyilN.pdf

34.

Alexeeva

Sharp

Valenzuela-Escárcega

et al. MathAlign: linking formula identifiers to their contextual natural language descriptions. In: Proceedings of the 12th language resources and evaluation conference, 2020, pp. 2204–2212. https://aclanthology.org/2020.lrec-1.269

35.

Yasunaga

Lafferty

. Topiceq: a joint topic and mathematical equation model for scientific texts. In: Proceedings of the AAAI conference on artificial intelligence, 2019, vol. 33, pp. 7394–7401, https://doi.org/10.1609/aaai.v33i01.33017394

36.

Schubotz

Greiner-Petter

Scharpf

et al. Improving the representation and conversion of mathematical formulae by considering their textual context. In: Proceedings of the 18th ACM/IEEE on joint conference on digital libraries, 2018, pp. 233–242, https://doi.org/10.1145/3197026.3197058

37.

Greiner-Petter

Schubotz

Muller

et al. Discovering mathematical objects of interest – a study of mathematical notations. In: Proceedings of the web conference 2020, 2020, pp. 1445–1456, https://doi.org/10.1145/3366423.3380218

38.

Salehi

Cook

Baldwin

. A word embedding approach to predicting the compositionality of multiword expressions. In: Proceedings of the 2015 conference of the North American chapter of the association for computational linguistics: human language technologies, 2015, pp. 977–983, https://doi.org/10.3115/v1/N15-1099

39.

Greiner-Petter

Youssef

Ruas

et al. Math-word embedding in math search and semantic extraction. Scientometrics 2020; 125: 3017–3046.

40.

Stathopoulos

Teufel

. Retrieval of research-level mathematical information needs: a test collection and technical terminology experiment. In: Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (vol. 2: short papers), 2015, pp. 334–340, http://aclweb.org/anthology/P/P15/P15-2055.pdf

41.

Tian

Wang

. Retrieval of scientific documents based on HFS and BERT. IEEE Access 2021; 9: 8708–8717.

42.

Dadure

Pakray

Bandyopadhyay

. An analysis of variable-size vector based approach for formula searching. In: Working notes of CLEF 2020-conference and labs of the evaluation forum, 2020, http://ceur-ws.org/Vol-2696/paper_150.pdf

43.

Zanibbi

Oard

Agarwal

et al. Overview of ARQMath 2020 (updated working notes version): CLEF lab on answer retrieval for questions on math. In: Working notes of CLEF 2020–conference and labs of the evaluation forum, 2020, pp. 1–27, http://ceur-ws.org/Vol-2696/paper_271.pdf

44.

Mansouri

Oard

Zanibbi

. DPRL systems in the CLEF 2020 ARQMath lab. In: Working notes of CLEF 2020-conference and labs of the evaluation forum, 2020, http://ceur-ws.org/Vol-2696/paper_223.pdf

45.

Novotný

Sojka

Štefánik

et al. Three is better than one: ensembling math information retrieval systems. In: Working notes of CLEF 2020–conference and labs of the evaluation forum, Thessaloniki, 22–25 September 2020, CEUR Workshop Proceedings, vol. 2696, Ceur-ws.org, 2020, http://ceur-ws.org/Vol-2696/paper_235.pdf

46.

Dadure

Pakray

Bandyopadhyay

. Embedding and generalization of formula with context in the retrieval of mathematical information. J King Saud Univ Comput Inf Sci. Epub ahead of print 9 June 2021. DOI: 10.1016/j.jksuci.2021.05.014.

47.

Mansouri

Zanibbi

Oard

. Learning to rank for mathematical formula retrieval. In: Proceedings of the 44th international ACM SIGIR conference on research and development in information retrieval (SIGIR ‘21), Virtual Event Canada, 2021, pp. 1–10, https://doi.org/10.1145/3404835.3462956

48.

Cer

Diab

Agirre

et al. Semeval-2017 task 1: semantic textual similarity multilingual and crosslingual focused evaluation. In: Proceedings of the 11th international workshop on semantic evaluation (SemEval-2017), 2017, pp. 1–14, https://www.aclweb.org/anthology/S17-2001.pdf

49.

Cer

Yang

Kong

et al. Universal sentence encoder. arXiv preprint arXiv:180311175, 2018, https://arxiv.org/abs/1803.11175

50.

Vaswani

Shazeer

Parmar

et al. Attention is all you need. In: Advances in neural information processing systems, 2017, pp. 5998–6008, https://dl.acm.org/doi/abs/10.5555/3295222.3295349

51.

Iyyer

Manjunatha

Boyd-Graber

et al. Deep unordered composition rivals syntactic methods for text classification. In: Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (vol. 1: long papers), 2015, pp. 1681–1691, https://doi.org/10.3115/v1/P15-1162

52.

Bengio

Courville

Vincent

. Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell 2013; 35(8): 1798–1828.

53.

Kalchbrenner

Grefenstette

Blunsom

. A convolutional neural network for modelling sentences. In: Proceedings of the 52nd annual meeting of the association for computational linguistics (vol. 1: long papers), 2014, pp. 655–665, https://aclanthology.org/P14-1062

54.

Rahutomo

Kitasuka

Aritsugi

. Semantic cosine similarity. In: 7th international student conference on advanced science and technology (ICAST), University of Seoul, South Korea, 29–30 October 2012, vol. 4, pp. 1–2.

55.

Voorhees

Harman

et al. TREC: experiment and evaluation in information retrieval. Comput Linguist 2005; 32: 563–567.

56.

Sakai

Kando

. On information retrieval metrics designed for evaluation with incomplete relevance assessments. Inform Retr 2008; 11(5): 447–470

57.

Mansouri

Zanibbi

Oard

. Characterizing searches for mathematical concepts. In: 2019 ACM/IEEE joint conference on digital libraries (JCDL). IEEE, 2019, pp. 57–66, https://doi.org/10.1109/JCDL.2019.00019

MathUSE: Mathematical information retrieval system using universal sentence encoder model

Abstract

Keywords

1. Introduction

2. Related work

3. Dataset description

4. Methodology

4.1. Preprocessing

4.2. System architecture

4.2.1. Transformer

4.2.2. Encoder

4.2.3. Decoder

4.2.4. Multi-head attention

4.2.5. Position-wise feed-forward networks

4.2.6. Embeddings and softmax

4.2.7. Positional encoding

4.2.8. Deep averaging network

4.3. Similarity

5. Experimental design and results

5.1. Experimental environment

5.2. Queryset description

5.3. Gold dataset description

5.4. Format of result set

5.5. Evaluation parameters

5.6. Results analysis

6. Conclusion and future scope

Footnotes

Declaration of conflicting interests

Funding

ORCID iDs

Notes

References