Automatic evaluation of Machine Translation (MT) output refers to the evaluation of translated content using automated metrics such as BLEU, NIST, METEOR, TER, CharacTER, and so on. Automated metrics emerged to address the need for objective, consistent, quick, and affordable assessment of MT output, as opposed to a human evaluation where translators or linguists are asked to evaluate segments manually.
Most of the automated metrics use the segment-level similarity-based method - they compare the output of an MT system to a human-generated “reference” translation and compute how close the machine-translated sentence is to that reference translation. It is assumed that the smaller the difference, the better the quality will be. The unit of comparison can be a word, but the metrics also use n-grams to compute the precision scores. N-gram is a contiguous sequence of n items from a given sample of text or speech. The items can be phonemes, syllables, letters, words, and similar.
The BLEU (Bi-Lingual Evaluation Understudy) score was first proposed in 2002 paper “BLEU: a Method for Automatic Evaluation of Machine Translation“(Kishore Papineni, et al.) and it is still the most widely used metric for MT evaluation, due to its presumed high correlation with human rankings of MT output that has often been brought into question. It is a segment-level algorithm that judges translations on a per-word basis.
As it looks at a specific set of source sentences and translations selected for the test, it should not be considered as a measurement of the overall translation quality. BLEU measures MT adequacy by looking at word precision and MT fluency by calculating n-gram precisions, returning a translation score on a scale from 0-1 (alternative: 0-100 scale). BLEU’s n-gram matching requires exact word matches, meaning that if different vocabulary or phrases are used in reference translation, the score will be lower.
NIST gets its name from the US National Institute for Standards and Technology, It is a metric based on BLEU with some additions. One of the differences is in the n-gram precision calculation. While BLEU calculates n-gram precision by adding equal weight to each n-gram, NIST also calculates how relevant a particular n-gram is. More weight is given to n-grams that are considered less likely to occur (rarer).
METEOR, An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments (S. Banerjee, A. Lavie), was initially released in 2004. Compared to BLEU which uses only precision-based features, METEOR focuses on a property called recall in addition to precision, as it has been confirmed by several metrics as critical for high correlation with human judgments. METEOR also allows multiple reference translations and addresses the problem of variability with flexibility in word matching - allowing for morphological variants and synonyms to be taken into account as legitimate matches. Moreover, METEOR parameters can be separately tuned for different languages, to optimize correlation with human judgments.
Translation Error Rate (TER) is a character-based automatic metric for measuring the number of edit operations needed to transform the machine-translated output into a human translated reference. It has been proposed as an alternative to the BLEU score for evaluating the quality of MT, but it is more commonly applied as a way of calculating edit-distance, used to assess the post-editing effort.
Automated MT quality metrics are very useful to developers and researchers of MT technology as the MT system development requires frequent system evaluations. They are fast and easy to run, they require minimal human labor, they don’t need bilingual speakers and can be used repeatedly during system development.
MT metrics’ primary purpose is to assess the quality of MT models, and not translation as such. Therefore, despite their usefulness when it comes to MT system development and comparison, they are not suitable for a translation production scenario. Here are some evident limitations:
Milica is a marketing professional with over 10 years in the field. As TAUS Head of Product Marketing she manages the positioning and commercialization of TAUS data services and products, as well as the development of taus.net. Before joining TAUS in 2017, she worked in various roles at Booking.com, including localization management, project management, and content marketing. Milica holds two MAs in Dutch Language and Literature, from the University of Belgrade and Leiden University. She is passionate about continuously inventing new ways to teach languages.
MT has come a long way. After seventy years of research, the technology is now taken into production. And yet, we are missing out on the full opportunities. Because the developers are preoccupied with the idea that the massive models will magically solve the remaining problems. And because the operators in the translation industry are slow in developing new MT-centric translation strategies. This article is an appeal to everyone involved in the translation ecosystem to come off the fence and realize the full benefits of MT. We can do better!
It doesn’t happen very often nowadays, but every now and then I still find in my inbox a great example of what is becoming a relic from the past: a spam email with cringy translation. Like everyone else, I’m certainly not too fond of spam, but the ones with horrendous translations do get my attention. The word-by-word translation is like a puzzle to me: I want to know if I can ‘reverse-translate’ it to its original phrasing.