Enhancing HMM-based POS tagger for Mizo language

Abstract

The process of associating words with their relevant parts of speech is known as part-of-speech (POS) tagging. It takes a substantial amount of well-organized data or corpora and significant target language research to obtain good performance for a tagger. Mizo is a language that needs more research attention in computational linguistics due to its under-resourced nature. The limited availability of corpora and relevant literature adds complexity to the task of assigning POS labels to Mizo text. This paper explores two methods to potentially improve the Hidden Markov Model (HMM)-based POS tagger for the Mizo language. The proposed taggers are compared with the baseline HMM tagger and the N-gram taggers on the designed Mizo corpus, which consists of 72,077 manually tagged tokens. The experimental results proved that the two proposed taggers enhanced the HMM-based Mizo POS tagger, achieving 81.52% and 84.29% accuracy, respectively. Moreover, a comprehensive analysis of the performance of the suggested hybrid tagger was conducted, yielding a weighted average precision, recall, and F1-score of 83.09%, 77.88%, and 79.64% respectively.

Keywords

Hybrid POS tagger rule-based POS tagger N-gram tagger Mizo POS tagger Hidden Markov Model

Get full access to this article

View all access options for this article.

References

Jurafsky

Speech & language processing, Pearson Education India (2000).

Voutilainen

Part-of-speech tagging, volume 219. The Oxford handbook of computational linguistics, 2003.

Zothanliana

A Study of the Development of Mizo Language in Relation toWord Formation, PhD thesis, Mizoram University, 2020.

Unesco atlas of the world’s languages in danger. http://www.unesco.org. Last accessed: 2021-09-13.

Ranjan

, Basu

Part of speech tagging and local word grouping techniques for natural language parsing in hindi. In Proceedings of the 1st International Conference on Natural Language Processing (ICON 2003), Citeseer, 2003.

Ekbal

, Haque

and Bandyopadhyay

, Maximum entropy based bengali part of speech tagging, A. Gelbukh (Ed.), Advances in, Natural Language Processing and Applications, Research in Computing Science (RCS) Journal 33 (2008), 67–78.

Narayan

, Chakraverty

and Singh

V.P.

, Neural network based parts of speech tagger for hindi, IFAC Proceedings 47(1) (2014), 519–524.

Jahara

, Barua

, Iqbal

M.D.

, Das

, Sharif

, Hoque

M.M.

, Sarker

I.H.

Towards pos tagging methods for bengali language: A comparative analysis, In International Conference on Intelligent Computing & Optimization, Springer, 2020, pp. 1111–1123.

Sharma

S.K.

, Lehal

G.S.

Using hidden markov model to improve the accuracy of punjabi pos tagger, In 2011 IEEE International Conference on Computer Science and Automation Engineering, IEEE, 2011, volume 2, pp. 697–701.

10.

Priyadarshi

and Saha

S.K.

, Towards the first maithili part of speech tagger: Resource creation and system development, Computer Speech & Language 62 (2020), 101054.

11.

Jobanputra

, Parikh

, Vora

, Bharti

S.K.

Parts-of-speech tagger for gujarati language using longshort-term-memory, In 2021 International Conference on Artificial Intelligence and Machine Vision (AIMV), IEEE, 2021, pp. 1–5.

12.

Tailor

, Patel

Hybrid pos tagger for gujarati text, In International Conference on Soft Computing and its Engineering Applications, Springer, 2020, pp. 134–144.

13.

Gamit

, Joshi

, Patel

A review on part-of-speech tagging on gujarati language, International Research Journal of Engineering and Technology (IRJET) (2019).

14.

Modi

, Nain

and Nehra

, Part-of-speech tagging for hindi corpus in poor resource scenario, Journal of Multimedia Information System 5(3) (2018), 147–154.

15.

Mundotiya

R.K.

, Kumar

, Mehta

, Singh

A.K.

Attention-based domain adaption using transfer learning for part-of-speech tagging: An experiment on the hindi language, In Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation, 2020, pp. 471–477.

16.

Dalal

, Nagaraj

, Sawant

, Shelke

Hindi part-of-speech tagging and chunking: A maximum entropy approach, Proceeding of the NLPAI Machine Learning Competition, 2006.

17.

Swamy

and Srinath

, Pos tagging and ner system for kannada using conditional random fields, International Journal of Information Retrieval Research (IJIRR) 11(4) (2021), 1–13.

18.

Daimary

, Goyal

, Barbora

and Singh

, Development of part of speech tagger for assamese using hmm, International Journal of Synthetic Emotions (IJSE) 9(1) (2018), 23–32.

19.

Pathak

, Nandi

, Sarmah

Aspos: Assamese part of speech tagger using deep learning approach, In 2022 IEEE/ACS 19th International Conference on Computer Systems and Applications (AICCSA), IEEE, 2022, pp. 1–8.

20.

Singh

T.D.

, Ekbal

, Bandyopadhyay

Manipuri pos tagging using crf and svm: A language independent approach, In Proceeding of 6th International conference on Natural Language Processing (ICON-2008), 2008, pp. 240–245.

21.

Tham

M.J.

, A hybrid pos tagger for khasi, an under resourced language, International Journal of Advanced Computer Science and Applications 11(10) (2020).

22.

Warjri

, Pakray

, Lyngdoh

S.A.

and Maji

A.K.

, Part-of-speech (pos) tagging using deep learning-based approaches on the designed khasi pos corpus, Transactions on Asian and Low-Resource Language Information Processing 21(3) (2021), 1–24.

23.

Vaishali

P.K.

, Kalpana

, Namrata

M.C.

A rule-based approach for marathi part-of-speech tagging, In ICT with Intelligent Applications, Springer, 2022, pp. 773–785.

24.

Antony

P.J.

and Soman

K.P.

, Parts of speech tagging for indian languages: A literature survey, International Journal of Computer Applications 34(8) (2011), 0975–8887.

25.

Harish

B.S.

and Rangan

R.K.

, A comprehensive survey on indian regional language processing, SN Applied Sciences 2(7) (2020), 1–16.

26.

Kumar

and Josan

G.S.

, Part of speech taggers for morphologically rich indian languages: A survey, International Journal of Computer Applications 6(5) (2010), 32–41.

27.

The linguistic data consortium for indian languages (ldc-il). https://www.ldcil.org/default.aspx

28.

Thangzikpuia

P.C.

Mizo Tawng Grammar(Based on its usage and unique features). P.C. Thangzikpuia, 2019.

29.

Santorini

Part-of-speech tagging guidelines for the penn treebank project, 1990.

30.

Lalzarzova

Mizo Tawng Grammar Composition. R. Lalrawna, 2016.