Recognising formula entailment using long short-term memory network

Abstract

The article presents an approach to recognise formula entailment, which concerns finding entailment relationships between pairs of math formulae. As the current formula-similarity-detection approaches fail to account for broader relationships between pairs of math formulae, recognising formula entailment becomes paramount. To this end, a long short-term memory (LSTM) neural network using symbol-by-symbol attention for recognising formula entailment is implemented. However, owing to the unavailability of relevant training and validation corpora, the first and foremost step is to create a sufficiently large-sized symbol-level MATHENTAIL data set in an automated fashion. Depending on the extent of similarity between the corresponding symbol embeddings, the symbol pairs in the MATHENTAIL data set are assigned ‘entailment’ or ‘neutral’ labels. An improved symbol-to-vector (isymbol2vec) method generates mathematical symbols (in L^AT_EX) and their embeddings using the Wikipedia corpus of scientific documents and Continuous Bag of Words (CBOW) architecture. Eventually, the LSTM network, trained and validated using the MATHENTAIL data set, predicts formulae entailment for test formulae pairs with a reasonable accuracy of 62.2%.

Keywords

Formula entailment LSTM math information retrieval

Get full access to this article

View all access options for this article.

References

Dagan

Glickman

Magnini

. The PASCAL recognising textual entailment challenge. In: Proceedings of the machine learning challenges: evaluating predictive uncertainty, visual object classification, and recognizing textual entailment, Southampton, 11–13 April 2005, pp. 177–190. Berlin: Springer.

Aizawa

Kohlhase

Ounis

. NTCIR-10 math pilot task overview. In: Proceedings of the 10th NTCIR conference, Tokyo, Japan, 2013, pp. 654–661, https://research.nii.ac.jp/ntcir/workshop/OnlineProceedings10/pdf/NTCIR/MATH/04-NTCIR10-MATH-KohlhaseM_intro.pdf

Líška

Sojka

Ružicka

. Similarity search for mathematics: Masaryk University team at the NTCIR-10 math task. In: Proceedings of the 10th NTCIR conference on evaluation of information access technologies, Tokyo, Japan, 2013, pp. 686–691, https://is.muni.cz/publication/1112631/en/Similarity-Search-for-Mathematics-Masaryk-University-team-at-the-NTCIR-10-Math-Task/Liska-Sojka-Ruzicka

Zanibbi

Aizawa

Kohlhase

, et al. NTCIR-12 MathIR task overview. In: Proceedings of the 12th NTCIR conference, Tokyo, Japan, 2016, pp. 299–308, https://www.cs.rit.edu/~rlaz/files/ntcir12-mathir.pdf

Ruzicka

Sojka

Líska

. Math indexer and searcher under the hood: fine-tuning query expansion and unification strategies. In: Proceedings of the 12th NTCIR conference on evaluation of information access technologies, Tokyo, Japan, 2016, pp. 331–337, https://research.nii.ac.jp/ntcir/workshop/OnlineProceedings12/pdf/ntcir/MathIR/05-NTCIR12-MathIR-RuzickaM_slides.pdf

Kumar

Agarwal

Bhagvati

. A structure based approach for mathematical expression retrieval. In: Proceedings of the international workshop on multidisciplinary trends in artificial intelligence, Ho Chi Minh City, Vietnam, 26–28 December 2012, pp. 23–34. Berlin: Springer.

Kamali

Tompa

. Retrieving documents with mathematical content. In: Proceedings of the 36th international ACM SIGIR conference on research and development in information retrieval, Dublin, 28 July–1 August 2013, pp. 353–362. New York: ACM.

Krstovski

Blei

. Equation embeddings, https://arxiv.org/abs/1803.09123

Yasunaga

Lafferty

. TopicEq: a joint topic and mathematical equation model for scientific texts. In: Proceedings of the AAAI conference on artificial intelligence, 2019, vol. 33, pp. 7394–7401, https://ojs.aaai.org/index.php/AAAI/article/view/4728

10.

Mansouri

Rohatgi

Oard

, et al. Tangent-CFT: an embedding model for mathematical formulas. In: Proceedings of the 2019 ACM SIGIR international conference on theory of information retrieval, 2019, pp. 11–18, https://www.cs.rit.edu/~rlaz/files/Mansouri_ICTIR2019.pdf

11.

Greiner-Pette

Youssef

Ruas

, et al. Math-word embedding in math search and semantic extraction. Scientometrics 2020; 125(3): 3017–3046.

12.

Gao

Jiang

Yin

, et al. Preliminary exploration of formula embedding for mathematical information retrieval: can mathematical formulae be embedded like a natural language?https://arxiv.org/abs/1707.05154

13.

Mikolov

Chen

Corrado

, et al. Efficient estimation of word representations in vector space, https://arxiv.org/abs/1301.3781

14.

Thanda

Agarwal

Singla

, et al. A document retrieval system for math queries. In: Proceedings of the 12th NTCIR conference on evaluation of information access technologies, Tokyo, Japan, 2016, pp. 346–353, http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings12/pdf/ntcir/MathIR/07-NTCIR12-MathIR-ThandaA.pdf

15.

Mikolov

. Distributed representations of sentences and documents. In: Proceedings of the international conference on machine learning, Beijing, China, 2014, pp. 1188–1196, https://cs.stanford.edu/~quocle/paragraph_vector.pdf

16.

Deng

Kanervisto

Rush

. What you get is what you see: a visual markup decompiler, https://www.researchgate.net/publication/308262743_What_You_Get_Is_What_You_See_A_Visual_Markup_Decompiler

17.

Evans

Saxton

Amos

, et al. Can neural networks understand logical entailment? In: Proceedings of the sixth international conference on learning representations (ICLR), Vancouver, BC, Canada, 2018, pp. 1–15, https://openreview.net/pdf?id=SkZxCk-0Z

18.

Khot

Sabharwal

Clark

. SciTaiL: a textual entailment dataset from science question answering. In: Proceedings of the AAAI conference on artificial intelligence, 2018, pp. 1–9, https://ojs.aaai.org/index.php/AAAI/article/view/12022

19.

Yin

Roth

Schütze

. End-task oriented textual entailment via deep explorations of inter-sentence interactions. In: Proceedings of the 56th annual meeting of the association for computational linguistics, Melbourne, VIC, Australia, 2018, vol. 2, pp. 540–545, https://aclanthology.org/P18-2086/

20.

Zaremba

Kurach

Fergus

. Learning to discover efficient mathematical identities. In: Proceedings of the advances in neural information processing systems, Montreal, QC, Canada, 2014, pp. 1278–1286, https://proceedings.neurips.cc/paper/2014/file/08419be897405321542838d77f855226-Paper.pdf

21.

Allamanis

Chanthirasegaran

Kohli

, et al. Learning continuous semantic representations of symbolic expressions, https://arxiv.org/abs/1611.01423

22.

Meuschke

Schubotz

Hamborg

, et al. Analyzing mathematical content to detect academic plagiarism. In: Proceedings of the 2017 ACM conference on information and knowledge management, Singapore, 2017, pp. 2211–2214, https://d-nb.info/1163536059/34

23.

Hambasan

Kohlhase

Prodescu

. MathWeb-search at NTCIR-11. In: Proceedings of the 11th NTCIR conference, Tokyo, Japan, vol. 11, 2014, pp. 114–119, http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings11/pdf/NTCIR/Math-2/05-NTCIR11-MATH-HambasanR.pdf

24.

Pathak

Pakray

Sarkar

, et al. MathIRs: retrieval system for scientific documents. Comput Sist 2017; 21(2): 253–265.

25.

Kristianto

Topić

Aizawa

. Utilizing dependency relationships between math expressions in math IR. Inform Retriev J 2017; 20(2): 132–167.

26.

Schubotz

Grigorev

Leich

, et al. Semantification of identifiers in mathematics for better math information retrieval. In: Proceedings of the 39th international ACM SIGIR conference on research and development in information retrieval, Pisa, 2016, pp. 135–144, https://d-nb.info/1163536121/34

27.

Schubotz

Krämer

Meuschke

, et al. Evaluating and improving the extraction of mathematical identifier definitions. In: Proceedings of the international conference of the cross-language evaluation forum for European languages, Dublin, 2017, pp. 82–94, https://d-nb.info/1163534730/34

28.

Elizarov

Kirillovich

Lipachev

, et al. Semantic formula search in digital mathematical libraries. In: Proceedings of the second Russia and Pacific conference on computer technology and applications, Vladivostok, 25–29 September 2017, pp. 39–43. New York: IEEE.

29.

Wang

Gao

Wang

, et al. WikiMirs 3.0: a hybrid MIR system based on the context, structure and importance of formulae in a document. In: Proceedings of the 15th ACM/IEEE-CS joint conference on digital libraries, Knoxville, TN, 21–25 June 2015, pp. 173–182. New York: ACM.

30.

Yuan

. Multi-dimensional formula feature modeling for mathematical information retrieval. In: Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval, Tokyo, Japan, 7–11 August 2017, p. 1381. New York: ACM.

31.

Sojka

Líška

. The art of mathematics retrieval. In: Proceedings of the 11th ACM symposium on document engineering, 2011, pp. 57–60, https://www.researchgate.net/publication/221352888_The_art_of_mathematics_retrieval

32.

Munavalli

Miner

. MathFind: a math-aware search engine. In: Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval, Seattle, WA, 6–11 August 2006, p. 735. New York: ACM.

33.

Zanibbi

Yuan

. Keyword and image-based retrieval for mathematical expressions, https://www.cs.rit.edu/~rlaz/files/ZanibbiYuanDRR2011.pdf

34.

Rocktäschel

Grefenstette

Hermann

, et al. Reasoning about entailment with neural attention. In: Proceedings of the 4th international conference on learning representations, San Juan, Puerto Rico, 2016, pp. 1–9, https://www.researchgate.net/publication/282181875_Reasoning_about_Entailment_with_Neural_Attention

35.

Chicco

Jurman

. The advantages of the Matthews correlation 609 coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genom 2020; 21: 6.