Detecting errors in machine translation using residuals and metrics of automatic evaluation

Abstract

Errors and residuals are closely related measures of the deviation. An error is a deviation of the observed value (PEMT output) from the expected value (MT output), while the residual of the observed value is the difference between the observed and predicted value of quality. We propose an exploratory data technique representing an ideal instrument to evaluate and improve machine translation (MT) systems. The main contribution consists of a rigorous technique (a statistical method), novel to the research of MT evaluation given by residual analysis to identify differences between MT output and post-edited machine translation output regarding human translation (reference). The residual analysis of the automatic metrics can help us to discover significant differences between MT and PEMT and to identify questionable issues regarding the one reference. In this study, we show the usage of residuals in MT evaluation. Using residual analysis, we identified sentences, in which significant differences were found in the scores of automatic metrics between MT output and post-edited (PE) MT output from Slovak into English.

Keywords

Machine translation evaluation residuals analytical language inflectional language MT errors

Get full access to this article

View all access options for this article.

References

Guerberof Arenas

, Correlations between productivity and quality when post-editing in a professional context, Machine Translation28(3-4) (2014), 165–186.

Han

A.L.-F.

, Wong

D.F.

and Chao

L.S.

, A robust evaluation metric for machine translation with augmented factors, Proceedings of COLING (2012), 441–450.

Chen

, Kuhn

and Foster

, Improving amber, an mt evaluation metric, Proceedings of the 7th Workshop on Statistical Machine Translation2012, pp. 59–63.

Tillmann

, Vogel

, Ney

, Zubiaga

and Sawaf

, Accelerated dp based search for statistical translation, Proceeding of Eurospeech (1997), 2667–2670.

C.-K.

and Wu

, Meant: An inexpensive, high-accuracy, semi-automatic metric for evaluating translation utility via antic frames, Proceedings of ACL (2011), 220–229.

C.-K.

and Wu

, Structured vs. flat semantic role representations for machine translation evaluation, Proceedings of the 5th Workshop on Syntax and Structure in Statistical Translation (2011), pp. 10–20.

C.-K.

, Turmuluru

A.K.

and Wu

, Fully automatic semantic mt evaluation, Proceedings of the 7th Workshop on Statistical Machine Translation2012, pp. 243–252.

Voss

C.R.

and Tate

R.R.

, Task-based evaluation of machine translation (mt) engines: Measuring how well people extract who, when, where-type elements in mt output, Proceedings of 11th Annual Conference of the European Association for Machine Translation2006, pp. 203–212.

Vilar

, Xu

, D’Haro

L.F.

and Ney

, Error analysis of statistical machine translation output, Proceedings of the 5th International Conference on Language Resources and Evaluation2006, 697–702.

10.

Doddington

, Automatic evaluation of machine translation quality using n-gram co-occurrence statistics, HLT Proceedings (2002), 138–145.

11.

Toury

, Descriptive Translation Studies and Beyond, 1995. Amsterdam and Philadelphia: John Benjamins Publishing Company.

12.

Toury

, In Search of a Theory of Translation, Tel Aviv: The Poter Institute for Poetics and Semiotics, 1980.

13.

Giménez

and Márquez

, Linguistic features for automatic evaluation of heterogenous mt systems, Proceedings of the Second ACL Workshop on Statistical Machine Translation2007, 256–264.

14.

House

, Translation quality assessment: Past and present, London and NY: Routledge, 2015.

15.

Laoudi

, Tate

R.R.

and Voss

C.R.

, Task-based mt evaluation: From who/when/where extraction to event understanding, Proceedings of LREC06 (2006), 2048–2053.

16.

Carroll

J.B.

, An experiment in evaluating the quality of translation, Mechanical Translation and Computational Linguistics9(34) (1966), 67–75.

17.

Nobre

J.S.

and da

, Motta Singer, Residual analysis for linear mixed models, Biometrical Journal49(6) (2007), 863–875.

18.

White

J.S.

and Taylor

K.B.

, A task-oriented evaluation metric for machine translation, pp, First International Conference on Language Resources & Evaluation (1998), 21–25.

19.

Church

and Hovy

, Good applications for crummy machine translation, Machine Translation8(4) (1993), 239–258.

20.

Papineni

, Roukos

, Ward

and Zhu

W.-J.

, Bleu: A method for automatic evaluation of machine translation, pp, Proceedings of ACL (2002), pp. 311–318.

21.

Bollen

K.A.

and Arminger

, Observational residuals in factor analysis and structural equation models, Sociological Methodology21 (1991), 235–262.

22.

K.-Y.

, Wu

M.-W.

and Chang

J.-S.

, A new quantitative quality measure for machine translation systems, Proceedings of the 14th Conference on Computational Linguistics, 1992, 433–439.

23.

Hildreth

L.A.

, Residual analysis for structural equation modelling, Graduate Theses and Dissertations 13400, http://lib.dr.iastate.edu/etd/13400.

24.

Koponen

and Salmi

, On the correctness of machine translation: A machine translation post-editing task, Journal of Specialised Translation23 (2015), 118–136.

25.

Munk

, Drlik

and Vrabelova

, Probability modelling of accesses to the course activities in the web-based educational system, Computational Science and its Applications – ICCSA 20116768 (2011), 485–499.

26.

Plitt

and Masselot

, A productivity test of statistical machine translation post-editing in a typical localisation context, The Prague Bulletin of Mathematical Linguistics93 (2010), 7–16.

27.

Popovic

and Ney

, Word error rates: Decomposition over pos classes and applications for error analysis, Proceedings of the Second ACL Workshop on Statistical Machine Translation (2007), pp. 48–55.

28.

Snover

, Dorr

, Schwartz

, Micciulla

and Makhoul

, A study of translation edit rate with targeted human annotation, Proceedings of the 7th Conference of the Association for Machine Translation in the Americas (2006), pp. 223–231.

29.

Williams

, Translation quality assessment*, Mutatis Mutandis2(1) (2009), 3–23.

30.

Aranberri

, Labaka

, Díaz de Ilarraza

, and Sarasola

, Comparison of post-editing productivity between professional translators and lay users, Proceedings of the Third Workshop on Post-Editing Technology and Practice2014, pp. 20–33.

31.

NIST/SEMATECH, e-Handbook of Statistical Methods, http://www.itl.nist.gov/div898/handbook/.

32.

Bojar

, Buck

, Callison-Burch

, Federmann

, Haddow

, Koehn

, Monz

, Post

, Soricut

and Specia

, Findings of the workshop on statistical machine translation, Proceedings of the eighth Workshop on Statistical Machine Translation2013, pp. 1–44.

33.

Bojar

, Buck

, Federmann

, Haddow

, Koehn

, Leveling

, Monz

, Pecina

, Post

, Saint-Amand

, Soricut

, Specia

and Tamchyna

, Findings of the workshop on statistical machine translation, Proceedings of the Ninth Workshop on Statistical Machine Translation (2014), pp. 12–58.

34.

Bojar

, Chatterjee

, Federmann

, Haddow

, Huck

, Hokamp

, Koehn

, Logacheva

, Monz

, Negri

, Post

, Scarton

, Specia

and Turchi

, Findings of the workshop on statistical machine translation, Proceedings of the Tenth Workshop on Statistical Machine Translation (2015), pp. 1–46.

35.

Topp

and Gómez

, Residual analysis in linear regression models with an interval-censored covariate, Statistics in Medicine23(21) (2004), 3377–3391.

36.

Hoerl

R.W.

, The Reality of Residual Analysis, Quality Progress, 2008.

37.

Banerjee

and Lavie

, Meteor: An automatic metric for mt evaluationwith improvedcorrelationwith human judgments, Proceedings of the ACL (2005), 65–72.

38.

O’Brien

, Balling

L.W.

, Carl

, Simard

and Specia

eds., Post-editing of Machine Translation: Processes and Applications, Newcastle upon Tyne: Cambridge Scholars Publishing, 2014.