Probabilistic Finite-State morphological segmenter for Wixarika (huichol) language 1

Abstract

In this work, we present a morphological segmenter for the Mexican indigenous language Wixarika. Segmentation is fundamental for rich morphological languages, a common aspect of the native American languages, to improve other tasks like machine translation, dialogue systems, summarization, etc. On top of the agglutinative nature of the language, the low amount of resources and the lack of an orthographic standard among dialects add to the challenge. Our proposal is based on a probabilistic finite-state approach that exploits regular agglutinative patterns and requires little linguistic knowledge. We show that our approach outperforms unsupervised and semi-supervised methods in a low-resource context. The dataset used in this work was openly released for future work by the community.

Keywords

Morphology low resources finite-state transducer Wixarika endangered languages

Get full access to this article

View all access options for this article.

References

Calderón

H.D.

, Mamani Calderón

V.C.D.

, Cárdenas Mariño

F.C.

and Mamani Calderón

E.F.

, Automatic translator in line Spanish a Quechua, based on free and open source platform apertium, Revista Investig (Esc Post Grado)5(3) (2009), ISSN 1997–4035.

Campbell

, Grondona

The indigenous languages of South America: A comprehensive guide, volume 2. Walter de Gruyter, 2012.

Creutz

, Lagus

, Unsupervised discovery of morphemes. In Proceedings of the ACL-02 workshop on Morphological and phonological learning-Volume 6, pp. 21–30. Association for Computational Linguistics, 2002.

Creutz

, Lagus

, Unsupervised morpheme segmentation and morphology induction from text corpora using Morfessor 1.0. Helsinki University of Technology, 2005.

Forcada

M.L.

, Ginestí-Rosell

, Nordfalk

, O’Regan

, Ortiz-Rojas

, Pérez-Ortiz

J.A.

, Sánchez-Martínez

, Ramírez-Sánchez

, Tyers

F.M.

, Apertium: A free/opensource platform for rule-based machine translation, Machine Translation25 (2) (2011), 127–144. ISSN 1573–0573.

Goldsmith

, Unsupervised learning of the morphology of a natural language, Computational Linguistics27 (2) (2001), 153–198.

Gómez

Huichol de San Andrés Cohamiata, Jalisco. Archivo de lenguas indígenas de México. Colegio de México, 1999. ISBN 968120851X.

Gonzalo

, App para traducir zapoteco – DIDXAZAPP. http://aprendezapoteco.blogspot.mx/2016/03/apppara-traducir-zapoteco-didxazapp.html, 2016. [Online; accessed 2017-04-25].

Grimes

, Huichol Syntax. Mouton, The Hague; 1964.

10.

Gutierrez-Vasques

, Sierra

, Pompa

I.H.

Axolotl: A web accessible parallel corpus for spanish-nahuatl. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Paris, France, may 2016. European Language Resources Association (ELRA). ISBN 978-2-9517408-9-1.

11.

Harris

Z.S.

Methods in structural linguistics; 1951.

12.

Homola

, Parsing a polysynthetic language. In Proceedings of the International Conference Recent Advances in Natural Language Processing 2011, pp. 562–567, Hissar, Bulgaria, September 2011. RANLP 2011 Organising Committee. URL http://www.aclweb.org/anthology/R11-1079.

13.

Iturrio

J.L.

, Gómez López

, Gramática Wixarika

. Archivo de lenguas indígenas de México. Lincom Europa; 1999.

14.

Kann

, Cotterell

, Schútze

, Neural multi-source morphological reinflection. In Proceedings of the 2017 Conference European Chapter of the Association for Computational Linguistics, Valencia, Spain, April 2017.

15.

Kneser

, Ney

, Improved backing-off for m-gram language modeling. In ICASSP;, 1995.

16.

Kohonen

, Virpioja

, Leppánen

, Lagus

, Semisupervised extensions to morfessor baseline. In Proceedings of the Morpho Challenge 2010 Workshop, pp. 30–34, 2010.

17.

Mager Hois

J.M.

, Barron Romero

and Meza Ruíz

I.V.

, Traductor estadístico wixarika - español usando descomposición morfológica, COMTEL (6), 2016.

18.

Porta

A.O.

, The use of formal language models in the typology of the morphology of amerindian languages. In Proceedings of the ACL 2010 Student Research Workshop, pp. 109–114, Uppsala, Sweden July 2010. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/P10-3019.

19.

Ruokolainen

, Kohonen

, Sirts

, Grónroos

S.-A.

, Kurimo

, Virpioja

, A comparative study of minimally supervised morphological segmentation, Computational Linguistics (2016).

20.

Spiegler

, Monson

, Emma: A novel evaluation metric for morphological analysis. In Proceedings of the 23rd International Conference on Computational Linguistics, COLING ’10, pp. 1029–1037, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics.

21.

Spoustová

D.J.

, Hajič

, Votrubec

, Krbec

and Květoň

, The best of two worlds: Cooperation of statistical and rulebased taggers for czech. In Proceedings of the Workshop on Balto-Slavonic Natural Language Processing: Information Extraction and Enabling Technologies, ACL ’07, pp. 67–74, Stroudsburg, PA, USA, 2007. Association for Computational Linguistics. URL http://dl.acm.org/citation.cfm?id=1567545.1567558.

22.

Virpioja

, Turunen

V.T.

, Spiegler

, Kohonen

, Kurimo

, Empirical comparison of evaluation methods for unsupervised learning of morphology, TAL52(2) (2011), 45–90.

23.

Virpioja

, Smit

, Grónroos

S.-A.

, Kurimo

, Morfessor 2.0: Python implementation and extensions for Morfessor baseline. D4 julkaistu kehittámis- tai tutkimusraportti tai -selvitys, 2013. URL http://urn.fi/URN:ISBN:978-952-60-5501-5.