Low-resource neural character-based noisy text normalization

Abstract

User generated data in social networks is often not written in its standard form. This kind of text can lead to large dispersion in the datasets and can lead to inconsistent data. Therefore, normalization of such kind of texts is a crucial preprocessing step for common Natural Language Processing tools. In this paper we explore the state-of-the-art of the machine translation approach to normalize text under low-resource conditions. We also propose an auxiliary task for the sequence-to-sequence (seq2seq) neural architecture novel to the text normalization task, that improves the base seq2seq model up to 5%. This increase of performance closes the gap between statistical machine translation approaches and neural ones for low-resource text normalization.

Keywords

Noisy text normalization recurrent neural networks low-resource autoencoding

Get full access to this article

View all access options for this article.

References

, Zhang

, Xiao

and Su

, A phrase-based statistical model for sms text normalization, In Proceedings of the COLING/ACL on Main Conference Poster Sessions, 2006, pp. 33–40. Association for Computational Linguistics.

Azawi

M.A.

, Afzal

M.Z.

and Breuel

T.M.

, Normalizing historical orthography for ocr historical documents using lstm, In Proceedings of the 2nd International Workshop on Historical Document Imaging and Processing, 2013, pp. 80–85. ACM.

Bahdanau

, Cho

and Bengio

, Neural machine translation by jointly learning to align and translate, ICLR, 2015.

Baldwin

, de Marneffe

M.-C.

, Han

, Kim

Y.-B.

, Ritter

and Xu

, Shared tasks of the 2015 workshop on noisy usergenerated text: Twitter lexical normalization and named entity recognition, In Proceedings of the Workshop on Noisy Usergenerated Text, 2015, pp. 126–135.

Blackwood

, Ballesteros

and Ward

, Multilingual neural machine translation with task-specific attention, arXiv preprint arXiv:1806.03280 (2018).

Bollmann

and Søgaard

, Improving historical spelling normalization with bi-directional lstms and multi-task learning, In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, 2016, pp. 131–139.

Brown

P.F.

, Pietra

V.J.D.

, Pietra

S.A.D.

and Mercer

R.L.

, The mathematics of statistical machine translation: Parameter estimation, Computational Linguistics19(2) (1993), 263–311.

Cho

, van Merrienboer

, Bahdanau

and Bengio

, On the properties of neural machine translation: Encoder–decoder approaches, In SSST, 2014.

Domingo

and Casacuberta

, Spelling normalization of historical documents by using a machine translation approach, 2018.

10.

Dorantes

, Sierra

, Pérez

T.Y.D.

, Bel-Enguix

and Rosales

M.J.

, Sociolinguistic corpus of whatsapp chats in Spanish among college students, In Proceedings of the Sixth International Workshop on Natural Language Processing for Social Media, 2018, pp. 1–6.

11.

Gouws

, Hovy

and Metzler

, Unsupervised mining of lexical variants from noisy text, In Proceedings of the First workshop on Unsupervised Learning in NLP, Edinburgh, Scotland, 2011, pp. 82–90.

12.

Han

and Baldwin

, Lexical normalisation of short text messages: Makn sens a# twitter, In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, 2011, pp. 368–378.

13.

Jin

, Ncsu-sas-ning: Candidate generation and feature engineering for supervised lexical normalization, In Proceedings of the Workshop on Noisy User-generated Text, 2015, pp. 87–92.

14.

Kann

, Cotterell

and Schütze

, Neural multi-source morphological reinflection, In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, volume 1, 2017, pp. 514–524.

15.

Kann

, Bjerva

, Augenstein

, Plank

and Søgaard

, Character-level supervision for low-resource pos tagging, In Proceedings of the Workshop on Deep Learning Approaches for Low-Resource NLP, 2018, pp. 1–11.

16.

Kann

, Mager

, Ruiz

I.V.M.

and Schütze

, Fortification of neural morphological segmentation models for polysynthetic minimal-resource languages, In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), volume 1, 2018, pp. 47–57.

17.

Kingma

D.P.

and Ba

, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980 (2014).

18.

Koehn

, Hoang

, Birch

, Callison-Burch

, Federico

, Bertoldi

, Cowan

, Shen

, Moran

, Zens

, et al., Moses: Open source toolkit for statistical machine translation, In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, 2007, pp. 177–180Association for Computational Linguistics.

19.

Korchagina

, Normalizing medieval german texts: From rules to deep learning, In Proceedings of the NoDaLiDa 2017 Workshop on Processing Historical Language, 2017, pp. 12–17.

20.

Lafferty

J.D.

, McCallum

and Pereira

F.C.N.

, Conditional random fields: Probabilistic models for segmenting and labeling sequence data, In Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, San Francisco, CA, USA, 2001, pp. 282–289. Morgan Kaufmann Publishers Inc. ISBN 1-55860-778-1. URL http://dl.acm.org/citation.cfm?id=645530.655813.

21.

Leeman-Munk

, Lester

and Cox

, Ncsu_sas_sam: Deep encoding and reconstruction for normalization of noisy text, In Proceedings of the Workshop on Noisy User-generated Text, 2015, pp. 154–161.

22.

and Liu

, Improving text normalization via unsupervised model and discriminative reranking, In Proceedings of the ACL 2014 Student Research Workshop, 2014, pp. 86–93.

23.

Liu

, Weng

and Jiang

, A broad-coverage normalization system for social media language, In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Jeju Island, Korea, 2012, pp. 1035–1044.

24.

Lopez Ludeña

, San Segundo Hernández

, Montero Martínez

J.M.

, Barra Chicote

and Lorenzo Trueba

, Architecture for text normalization using statistical machine translation techniques, SpringerverlagBerlin Heidelberg (2011).

25.

Min

and Mott

, Ncsu_sas_wookhee: A deep contextual long-short term memory model for text normalization, In Proceedings of the Workshop on Noisy User-generated Text, 2015, pp. 111–119.

26.

Och

F.J.

and Ney

, A systematic comparison of various statistical alignment models, Computational Linguistics29(1) (2003), 19–51.

27.

Plank

, Søgaard

and Goldberg

, Multilingual partof-speech tagging with bidirectional long short-term memory models and auxiliary loss, In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), volume 2, 2016, pp. 412–418.

28.

Rei

, Semi-supervised multitask learning for sequence labeling, In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, 2017, pp. 2121–2130.

29.

Saloot

M.A.

, Idris

, Shuib

, Raj

R.G.

and Aw

, Toward tweets normalization using maximum entropy, In Proceedings of the Workshop on Noisy User-Generated Text, 2015, pp. 19–27.

30.

Tang

, Cap

, Pettersson

and Nivre

, An evaluation of neural machine translation models on historical spelling normalization, In Proceedings of the 27th International Conference on Computational Linguistics, 2018, pp. 1320–1331. Association for Computational Linguistics. URL http://aclweb.org/anthology/C18-1112.

31.

van der Goot

and van Noord

, Monoise: Modeling noise using a modular normalization system, arXiv preprint arXiv:1710.03476 (2017).

32.

van der Goot

, Plank

and Nissim

, To normalize, or not to normalize: The impact of normalization on part-of-speech tagging, In Proceedings of the 3rd Workshop on Noisy Usergenerated Text, 2017, pp. 31–39.

33.

Vaswani

, Shazeer

, Parmar

, Uszkoreit

, Jones

and Gomez

A.N.

, Ł. Kaiser and I. Polosukhin, Attention is all you need, In Advances in Neural Information Processing Systems, 2017, pp. 5998–6008.

34.

Wołk

and Marasek

, Building subject-aligned comparable corpora and mining it for truly parallel sentence pairs, Procedia Technology18 (2014), 126–132.