Sage Journals: Discover world-class research

Abstract

While natural language processing (NLP) has made significant strides in high-resource languages (HRLs), many languages still lack sufficient representation in training data. Many more such languages remain underrepresented in training data. Cross-lingual transfer learning (CTLT) has been proposed as an exciting use case where an application of these text classifiers trained on HRLs will be extended to low-resource languages (LRLs), mostly using zero and few-shot techniques. This notwithstanding, conventional multilingual language models usually underperform in some languages. Using classification along with language modeling does not work for all language combinations for optimal results. The approach that we propose is MT-TC: a new hybrid framework upon which the traditional translate-test approach can be revisited and refined. That is, MT-TC combines a text classifier trained in the high-resource language with a neural machine translation (NMT) model. Because it enjoys “soft” differentiable translations, backpropagation can be performed end-to-end on an architecture to jointly fine-tune both components. We evaluate MT-TC on Taxi1500 dataset with 20 typologically diverse languages. The macro F1-score reaches 0.88 for the proposed method, outperforming baseline multilingual models by as much as 12.5% in low-resource settings. These results show that under the aspect of minimum-annotated data availability, MT-TC performs significantly well and qualifies itself as a very within-real-world-scalable solution for a lot of multilingual NLP applications.

Keywords

classification cross-lingual multilingual language models (LMs)machine translation natural language processing (NLP)zero/few-shot

Get full access to this article

View all access options for this article.

References

Lewis

Liu

Goyal

, et al. BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2020, pp. 7871–7880.

Radford

Narasimhan

Salimans

, et al. Improving language understanding by generative pre-training. Technical Report 2018.

Peters

Neumann

Iyyer

, et al. Deep contextualized word representations. In; Proceedings of the 2018 conference of the north American chapter of the association for computational linguistics: human language technologies. Stroudsburg: Association for Computational Linguistics, 2018, vol 1, pp. 2227–2237.

Devlin

Chang

Lee

, et al. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2019, vol 1, pp. 4171–4186.

Raffel

Shazeer

Roberts

, et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn Res 2020; 140: 1–67.

Vaswani

Shazeer

Parmar

, et al. Attention is all you need. In: Advances in neural information processing systems. NeurIPS, 2017, vol 30, pp. 5998–6008.

Pfeiffer

Sachan

Ruder

. AdapterFusion: non-destructive task composition for transfer learning. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2022, pp. 1240–1253.

Conneau

Khandelwal

Goyal

, et al. Unsupervised cross-lingual representation learning at scale. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2020, pp. 8440–8451.

Liu

Cheng

Chen

, et al. Multilingual denoising pre-training for neural machine translation. In: Transactions of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2020, pp. 1024–1037.

10.

Kreutzer

Garcia

Ruder

, et al. Quality at a glance: an audit of web-crawled multilingual datasets. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2022, pp. 6516–6528.

11.

Joshi

Santy

Budhiraja

, et al. The state and fate of linguistic diversity and inclusion in the NLP world. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2020, pp. 6282–6293.

12.

Jauregi Unanue

Martin

Mitchell

. End-to-end differentiable tokenization: avoiding error propagation in neural machine translation. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2021, pp. 8984–8996.

13.

Chang

. Translation-induced cross-lingual text classification. In: Proceedings of the 22nd International Conference on Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2008, pp. 1117–1124.

14.

Klementiev

Titov

Bhattarai

. Inducing cross-lingual distributed representations of words. In: Proceedings of the 24th International Conference on Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2012, pp. 1459–1474.

15.

Conneau

Lample

. Cross-lingual language model pretraining. In: Advances in neural information processing systems. NeurIPS, 2019, vol 32, pp. 7057–7067.

16.

Devlin

Chang

Lee

17.

Liang

Ott

Zhang

, et al. XGLUE: a new benchmark dataset for cross-lingual pre-training, understanding, and generation. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2020, pp. 6007–6020.

18.

Chi

Huang

Dong

, et al. XLM-E: cross-lingual language model pre-training via ELECTRA. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2022, pp. 6170–6182.

19.

Luo

Cao

Liu

, et al. Locality-aware cross-lingual language model pretraining. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2021, pp. 5169–5180.

20.

Ogueji

Afyouni

Klie

, et al. Small data? No problem! Exploring the viability of pretrained multilingual language models for low-resourced languages. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2021, pp. 9204–9223.

21.

Houlsby

Giurgiu

Jastrzebski

, et al. Parameter-efficient transfer learning for NLP. In: Proceedings of the 36th International Conference on Machine Learning. Stroudsburg: Association for Computational Linguistics, 2019, pp. 2790–2799.

22.

Artetxe

Ruder

Yogatama

. On the cross-lingual transferability of monolingual representations. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2020, pp. 4623–4637.

23.

Tiedemann

Agić

, et al. The multilinEval benchmark: evaluating cross-lingual transfer across NLP tasks with multilingual automatic evaluations. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2022, pp. 1993–2007.

24.

Huang

Dong

Wei

, et al. Unicoder: a universal language encoder by pre-training with multiple cross-lingual tasks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2019, pp. 2485–2494.

25.

Ponti

Zhang

Joshi

, et al. Parameter space factorization for zero-shot cross-lingual transfer learning. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2021, pp. 878–893.

26.

Tebbifakhr

Jauregi Unanue

Mitchell

, et al. End-to-end training of neural machine translation using differentiable tokenization. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2019, pp. 4222–4231.

Cross-lingual text classification using hybrid machine translation language models

Abstract

Keywords

Get full access to this article

References