Abstract
While natural language processing (NLP) has made significant strides in high-resource languages (HRLs), many languages still lack sufficient representation in training data. Many more such languages remain underrepresented in training data. Cross-lingual transfer learning (CTLT) has been proposed as an exciting use case where an application of these text classifiers trained on HRLs will be extended to low-resource languages (LRLs), mostly using zero and few-shot techniques. This notwithstanding, conventional multilingual language models usually underperform in some languages. Using classification along with language modeling does not work for all language combinations for optimal results. The approach that we propose is MT-TC: a new hybrid framework upon which the traditional translate-test approach can be revisited and refined. That is, MT-TC combines a text classifier trained in the high-resource language with a neural machine translation (NMT) model. Because it enjoys “soft” differentiable translations, backpropagation can be performed end-to-end on an architecture to jointly fine-tune both components. We evaluate MT-TC on Taxi1500 dataset with 20 typologically diverse languages. The macro F1-score reaches 0.88 for the proposed method, outperforming baseline multilingual models by as much as 12.5% in low-resource settings. These results show that under the aspect of minimum-annotated data availability, MT-TC performs significantly well and qualifies itself as a very within-real-world-scalable solution for a lot of multilingual NLP applications.
Keywords
Get full access to this article
View all access options for this article.
