Effective and scalable kernel-based language learning via stratified Nyström methods

Abstract

Expressive but complex kernel functions, such as Sequence or Tree kernels, are usually underemployed in NLP tasks as for their significant computational cost in both learning and classification stages. Recently, the Nyström methodology for data embedding has been proposed as a viable solution to scalability problems. It improves scalability of learning processes acting over highly structured data, by mapping data into low-dimensional compact linear representations of kernel spaces. In this paper, a stratification of the model corresponding to the embedding space is proposed as a further highly flexible optimization. Nyström embedding spaces of increasing sizes are combined in an efficient ensemble strategy: upper layers, providing higher dimensional representations, are invoked on input instances only when the adoption of smaller (i.e., less expressive) embeddings provides uncertain outcomes. Experimental results using different models of such an uncertainty show that state-of-the-art accuracy on three semantic inference tasks can be obtained even when one order of magnitude fewer kernel computations is carried out.

Keywords

Nyström method scalability kernel methods structured language learning

Get full access to this article

View all access options for this article.

References

Annesi

, Croce

and Basili

, Semantic compositionality in tree kernels. In Proceedings of CIKM 2014, pages 1029–1038. ACM, 2014.

Baker

C.F.

, Fillmore

C.J.

and Lowe

J.B.

, The Berkeley Frame Net project. In Proc. of COLING-ACL, Montreal, Canada, 1998.

Baroni

, Bernardini

, Ferraresi

and Zanchetta

, The wacky wide web: A collection of very large linguistically processed web-crawled corpora, Language Resources and Evaluation 43(3) (2009), 209–226.

Barrón-Cedeño

, Martino

G.D.S.

, Joty

, Moschitti

, Al-Obaidli

, Romeo

, Tymoshenko

and Uva

, Convkn at semeval-2016 task 3: Answer and question selection for question answering on arabic and english fora. In Proceedings of SemEval-2016, pages 896–903, San Diego, California, June 2016.

Borgwardt

K.M.

, Ong

C.S.

, Schönauer

, Vishwanathan

S.V.N.

, Smola

A.J.

and Kriegel

H-P.

, Protein function prediction via graph kernels, Bioinformatics 21(suppl 1) (2005), i47–i56.

Cancedda

, Gaussier

É.

, Goutte

and Renders

J-M.

, Word-sequence kernels, Journal of Machine Learning Research 3 (2003), 1059–1082.

Cesa-Bianchi

and Gentile

, Tracking the best hyperplane with a simple budget perceptron. In In proc. of the nineteenth annual conference on Computational Learning Theory, pages 483–498, Springer-Verlag, 2006.

Chang

C-C.

and Lin

C.-J.

, Libsvm: A library for support vector machines, ACM Trans Intell Syst Technol 2(3) (2011), 27:1–27:27.

Collins

and Duffy

, Convolution kernels for natural language. InProceedings of Neural Information Processing Systems (NIPS’2001), 2001. pages 625–632.

10.

Cortes

and Vapnik

, Support-vector networks, Mach Learn 20(3) (1995), 273–297.

11.

Crammer

, Dekel

, Keshet

, Shalev-Shwartz

and Singer

, Online passive-aggressive algorithms, Journal of Machine Learning Research 7 (2006), 551–585.

12.

Croce

and Basili

, Large-scale kernel-based language learning through the ensemble nystrom methods. In Advances in Information Retrieval - 38th European Conference on IR Research, ECIR 2016, Padua, Italy, March 20-23, 2016. Proceedings, pages 100–112, 2016.

13.

Croce

, Moschitti

and Basili

, Structured lexical similarity via convolution kernels on dependency trees. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’11, pages 1034–1046, 2011.

14.

San Martino

G.D.

, Navarin

and Sperduti

, A tree-based kernel for graphs. In Proceedings of the Twelfth SIAM International Conference on Data Mining, Anaheim, California, USA, April 26-28, 2012., pages 975–986, 2012.

15.

Dekel

and Singer

, Support vector machines on a budget. In Schölkopf

Bernhard

, Platt

John

, and Hoffman

Thomas

, editors, NIPS, pages 345–352. MIT Press, 2006.

16.

Drineas

and Mahoney

M.W.

, On the nyström method for approximating a gram matrix for improved kernelbased learning, Journal of ML Research 6, 2005.

17.

Filice

, Castellucci

, Croce

and Basili

, Effective kernelized online learning in language processing tasks. In Proceedings of ECIR 2014, pages 347–358, 2014.

18.

Filice

, Castellucci

, Croce

and Basili

, Kelp: A kernel-based learning platform for natural language processing. In Proceedings of ACL: System Demonstrations, Beijing, China, July 2015.

19.

Filice

, Croce

and Basili

, A Stratified Strategy for Efficient Kernel-based Learning. In AAAI Conference on Artificial Intelligence, 2015.

20.

Filice

, Croce

, Basili

and Zanzotto

F.M.

, Linear online learning over structured data with distributed tree kernels. In Proceedings of ICMLA 2013, 2013.

21.

Filice

, Croce

, Moschitti

and Basili

, KeLP at SemEval-2016 task 3: Learning antic relations between questions and comments. In Proceedings of the 10th International Workshop on Semantic Evaluation, SemEval ’16, San Diego, California, June 2016.

22.

Filice

, San Martino

G.D.

and Moschitti

, Structural representations for learning relations between pairs of texts. In Proceedings of ACL 2015, pages 1003–1013, Beijing, China, July 2015. Association for Computational Linguistics.

23.

Fillmore

C.J.

, Frames and the semantics of understanding, Quaderni di Semantica 6(2) (1985), 222–254.

24.

Freund

and Schapire

R.E.

, A decision-theoretic generalization of on-line learning and an application to boosting, J Comput Syst Sci 55(1) (1997), 119–139.

25.

Gildea

and Jurafsky

, Automatic Labeling of Semantic Roles, Computational Linguistics 28(3) (2002), 245–288.

26.

Harris

, Distributional structure. In Katz

Jerrold J.

and Fodor

Jerry A.

, editors, The Philosophy of Linguistics. Oxford University Press, 1964.

27.

Hirschberg

and Manning

C.D.

, Advances in natural language processing, Science 349(6245) (2015), 261–266.

28.

Hsieh

C-J.

, Chang

K-W.

, Lin

C-J.

, Sathiya Keerthi

and Sundararajan

, A dual coordinate descent method for large-scale linear svm. In Proceedings of the ICML 2008, pages 408–415, New York, NY, USA, 2008. ACM.

29.

Hudak

M.J.

, RCE classifiers: Theory and practice, Cybernetics and Systems 23(5) (1992), 483–515.

30.

Johansson

and Nugues

, The effect of syntactic representation on semantic role labeling. In Proceedings of COLING, Manchester, UK, August 18-22, 2008.

31.

Kumar

, Mohri

and Talwalkar

, Sampling methods for the nyström method, J Mach Learn Res 13 (2012), 981–1006.

32.

Leslie

C.S.

, Eskin

, Cohen

, Weston

and Noble

W.S.

, Mismatch string kernels for discriminative protein classification, Bioinformatics 20(4) (2004), 467–476.

33.

and Roth

, Learning question classifiers: The role of semantic information, Natural Language Engineering 12(3) (2006), 229–249.

34.

Mihaylov

and Nakov

, Semanticz at semeval-2016 task 3: Ranking relevant answers in community question answering using semantic similarity based on fine-tuned word embeddings. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), pages 879–886, June 2016.

35.

Mikolov

, Chen

, Corrado

and Dean

, Efficient estimation of word representations in vector space, CoRR, abs/1301.3781, 2013.

36.

Mitchell

and Lapata

, Composition in distributional models of semantics, Cognitive Science 34(8) (2010), 1388–1429.

37.

Morik

, Brockhausen

and Joachims

, Combining statistical learning with a knowledge-based approach - a case study in intensive care monitoring. In ICML, pages 268–277, San Francisco, CA, USA, 1999, Morgan Kaufmann Publishers Inc.

38.

Moschitti

, Efficient convolution kernels for dependency and constituent syntactic trees. In ECML, Berlin, Germany, September 2006.

39.

Moschitti

, State-of-the-art kernels for natural language processing. In ACL (Tutorial Abstracts), page 2. The Association for Computer Linguistics, 2012.

40.

Moschitti

, Pighin

and Basili

, Tree kernels for semantic role labeling, Computational Linguistics 34, 2008.

41.

Moschitti

, Quarteroni

, Basili

and Manandhar

, Exploiting syntactic and shallow semantic kernels for question/answer classification. In Proceedings of ACL’ 07, 2007.

42.

Nakov

, Márquez

, Moschitti

, Magdy

, Mubarak

, Alhakim Freihat

, Glass

and Randeree

, SemEval-2016 task 3: Community question answering. In Proceedings of the 10th International Workshop on Semantic Evaluation, SemEval ’16, June 2016.

43.

Orabona

, Keshet

and Caputo

, The projectron: A bounded kernel-based perceptron. In Proceedings of ICML’ 08, pages 720–727, USA, 2008. ACM.

44.

Pado

and Lapata

, Dependency-based construction of semantic space models, Computational Linguistics 33(2), 2007.

45.

Müller

K.R.

, Mika

, Rätsch

, Tsuda

and Schölkopf

, An introduction to kernelbased learning algorithms, IEEE Transactions on Neural Networks 12(2) (2001), 181–201.

46.

Sahlgren

, The Word-Space Model. PhD thesis, Stockholm University, 2006.

47.

Settles

, Active Learning Literature Survey. Technical Report 1648, University of Wisconsin–Madison, 2010.

48.

Severyn

, Nicosia

and Moschitti

, Building structures from classifiers for passage reranking. In Proceedings of CIKM 2013, pages 969–978, New York, NY, USA, 2013. ACM.

49.

Shawe-Taylor

and Cristianini

, Kernel Methods for Pattern Analysis. Cambridge University Press, New York, NY, USA, 2004.

50.

Tesniere

, Elements de syntaxe structural. Klincksiek, Paris, France, 1959.

51.

Vapnik

V.N.

, Statistical Learning Theory. Wiley-Interscience, 1998.

52.

Vedaldi

and Zisserman

, Efficient additive kernels via explicit feature maps, Pattern Analysis and Machine Intelligence, IEEE Transactions on 34(3), 2012.

53.

Vishwanathan

S.V.N.

and Smola

Alexander J.

, Fast kernels on strings and trees. In Proceedings of Neural Information Processing Systems, pages 569–576, 2002.

54.

Wang

, Zhao

and Hoi

S.C.

, Exact soft confidence-weighted learning. In Proceedings of the ICML 2012, New York, NY, USA, 2012. ACM.

55.

Wang

and Vucetic

, Online passiveaggressive algorithms on a budget, Journal of Machine Learning Research - Proceedings Track 9 (2010), 908–915.

56.

Weiss

D.J.

, Sapp

and Taskar

, Structured prediction cascades, CoRR abs/1208.3279, 2012.

57.

Williams

C.K.I.

and Seeger

, Using the nyström method to speed up kernel machines. In Proceedings of the 13th International Conference on Neural Information Processing Systems, NIPS’00, pages 661–667, Cambridge, MA, USA, 2000. MIT Press.

58.

Zanzotto

F.M.

and Dell’Arciprete

, Distributed tree kernels. In Proceedings of ICML 2012, 2012.

59.

Zanzotto

F.M.

, Pennacchiotti

and Moschitti

, A machine learning approach to textual entailment recognition, Natural Language Engineering 15-04 (2009), 551–582.

60.

Zhang

and Lee

W.S.

, Question classification using support vector machines. In Proceedings of SIGIR 2003, pages 26–32, New York, NY, USA, 2003. ACM.