Sage Journals: Discover world-class research

Abstract

The constant increase in the production of scientific literature is making it very difficult for experts to keep up to date with the state-of-the-art knowledge in their fields. The use of Natural Language Processing (NLP) is becoming a necessary aid to tackle this challenge. In the NLP field, the task of measuring semantic similarity between two sentences plays a vital role. It is a cornerstone for tasks like Q&A, Information Retrieval, Automatic Summarization, etc., and it is a crucial element in the ultimate goal of computers being able to decode what is conveyed in human language expression.

Measuring Semantic Similarity (SS) in short texts has specific challenges. Because there are fewer words to be compared, the meaning contribution of each word is more relevant, and it is important to take into account the syntax’s contribution to the composed meaning. In addition, the highly specific and specialized vocabulary — Microbial Transcriptional-Regulation—implies the lack of massive training resources. Our approach has been to use an ensemble of similarity metrics including string, distributional, and knowledge-based metric and to combine the results of such analyses. We have trained and tested these methods in a similarity corpus developed in-house.

The task has proved very challenging, and the ensemble strategy has proved to be a good approach. Even though there is still much room for improvement in the precision of our methods concerning the human evaluation, we have managed to improve them reaching a strong correlation (ρ = 0.700).

Keywords

Natural Language Processing Semantic Textual Similarity

Get full access to this article

View all access options for this article.

References

Arbor

, SemEval-Task 2: Semantic Textual Similarity, English, Spanish and Pilot on Interpretability. SemEval2015, (SemEval), 2015, pp. 252–263.

Baroni

, Bernardi

and Zamparelli

, Frege in space: A program of compositional distributional semantics, Linguistic Issues in Language Technology9(6) (2014), 242–346.

Batet

and Sánchez

, A Review on Semantic Similarity, Encyclopedia of Information Science and Technology, Third Edition (May 2016), 2014, pp. 7575–7583.

Beltagy

, Erk

and Mooney

, Semantic Parsing using Distributional Semantics and Probabilistic Logic, Proceedings of the ACL 2014 Workshop on Semantic Parsing, (SemEval), 2014, pp. 7–11.

Clark

, Vector Space Models of Lexical Meaning, 2012, pp. 1–42.

Dolan

W.B.

and Brockett

, Automatically Constructing a Corpus of Sentential Paraphrases, Proceedings of the Third InternationalWorkshop on Paraphrasing (IWP2005), 2005, pp. 9–16.

Elavarasi

S.A.

, Akilandeswari

and Menaga

, A survey on semantic similarity measure, International Journal of Research in Advent Technology2(3) (2014), 389–398.

Firth

J.R.

, A synopsis of linguistic theory 1930-55. Studies in Linguistic Analysis (special volume of the Philological Society), 1957, pp. 195259:1–32.

Galitsky

, Machine learning of syntactic parse trees for search and classification of text, Engineering Applications of Artificial Intelligence26(3) (2013), 1072–1091.

10.

Gan

, Dou

and Jiang

, From ontology to semantic similarity: Calculation of ontology-based semantic similarity, The Scientific World Journal2013 (2013), 793091.

11.

Harispe

, Ranwez

, Janaqi

and Montmain

, Semantic Similarity from Natural Language and Ontology Analysis, 2017.

12.

Harris

, Distributional structure, Word10(23) (1954), 146–162.

13.

Janaqi

, Sebastien

, Ranwez

, Janaqi

and Montmain

, Semantic Measures for the Comparison of Units of Language, Concepts or Instances from Text and Knowledge Representation Analysis A Comprehensive Survey and a Technical Introduction to Knowledge-based Measures Using Semantic Graph Analysis book for a more, 1(1) (2016).

14.

Kashyap

, Han

, Yus

, Sleeman

, Satyapanich

, Gandhi

and Finin

, Robust semantic text similarity using LSA, machine learning, and linguistic resources, volume 50, SpringerNetherlands, 2016.

15.

Kondrak

, N -gram similarity and distance, Lecture Notes in Computer Science3772 (2005), 115–126.

16.

and Mikolov

, Distributed Representations of Sentences and Documents, International Conference on Machine Learning - ICML 2014, 32, 2014, pp. 1188–1196.

17.

Lee

M.C.

, Chang

J.W.

and Hsieh

T.C.

, A grammar-based semantic similarity algorithm for natural language sentences, Scientific World Journal2014 (2014).

18.

Lithgow-Serrano

O.W.

, Gama-Castro

, Ishida-Gutiérrez

, Mejía-Almonte

, Tierrafría

, Martínez-Luna

, Santos-Zavaleta

, Velázquez-Ramírez

and Collado-Vides

, Similarity corpus on microbial transcriptional regulation, doi.org, page 219014, 2017.

19.

Manning

C.D.

, Bauer

, Finkel

, Bethard

S.J.

, Surdeanu

and McClosky

, The Stanford CoreNLP Natural Language Processing Toolkit, Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 2014, pp. 55–60.

20.

Mikolov

, Chen

, Corrado

and Dean

, Distributed representations of words and phrases and their compositionality, Nips (2013), 1–9.

21.

Miller

G.A.

, WordNet: A lexical database for english, Communications of the ACM38(11) (1995), 39–41.

22.

Mitchell

and Lapata

, Composition in distributional models of semantics, Cognitive Science34(8) (2010), 1388–1429.

23.

Mukaka

M.M.

, Statistics Corner: A guide to appropriate use of Correlation coefficient in medical research, 24(September) (2012), 69–71.

24.

Pennington

, Socher

and Manning

C.D.

, GloVe: Global Vectors for Word Representation, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 2014, pp. 1532–1543.

25.

Schuster

, Probabilistic models of natural language semantics, (August), 2016.

26.

Schutze

, Dimensions of meaning, Proceedings Supercomputing ’92, 1992, pp. 787–796.

27.

Socher

, Perelygin

and Wu

, Recursive deep models for semantic compositionality over a sentiment treebank, Proceedings of the…, 2013, pp. 1631–1642.

28.

Tag

and D X X, Combining sentence similarities measures to identify paraphrases I, Computer Speech & Language47 (2018), 59–73.

29.

Tian

, Okazaki

and Inui

, Learning Semantically and Additively Compositional Distributional Representations, 2016.

30.

Triantafillou

, Kiros

J.R.

, Urtasun

and Zemel

, Towards generalizable sentence embeddings, Workshop on Representation Learning for NLP (2016), 239–248.

31.

Turney

P.D.

, The latent relation mapping engine: Algorithm and experiments, Journal of Artificial Intelligence Research33 (2008), 615–655.

32.

van Eijck

and Lappin

, Probabilistic semantics for natural Language,(January), Unpublished Manuscript (2011), 1–15.

33.

Wongpakaran

, Wongpakaran

, Wedding

and Gwet

K.L.

, A comparison of Cohen’ s Kappa and Gwet’ s AC1 when calculating inter-rater reliability coefficients: A study conducted with personality disorder samples, 2013, pp. 1–7.

34.

Zadeh

, From computing with numbers to computing withwords, Int J Appl Math Comput Science12(3) (2002), 307–324.

In the pursuit of semantic similarity for literature on microbial transcriptional regulation

Abstract

Keywords

Get full access to this article

References