Automated MT Evaluation Metrics

Automatic evaluation of Machine Translation (MT) output refers to the evaluation of translated content using automated metrics such as BLEU, NIST, METEOR, TER, CharacTER, and so on. Automated metrics emerged to address the need for objective, consistent, quick, and affordable assessment of MT output, as opposed to a human evaluation where translators or linguists are asked to evaluate segments manually.

How do Automated Metrics Work?

Most of the automated metrics use the segment-level similarity-based method - they compare the output of an MT system to a human-generated “reference” translation and compute how close the machine-translated sentence is to that reference translation. It is assumed that the smaller the difference, the better the quality will be. The unit of comparison can be a word, but the metrics also use n-grams to compute the precision scores. N-gram is a contiguous sequence of n items from a given sample of text or speech. The items can be phonemes, syllables, letters, words, and similar.

The Most Common Automated Metrics

BLEU

The BLEU (Bi-Lingual Evaluation Understudy) score was first proposed in 2002 paper “BLEU: a Method for Automatic Evaluation of Machine Translation“(Kishore Papineni, et al.) and it is still the most widely used metric for MT evaluation, due to its presumed high correlation with human rankings of MT output that has often been brought into question. It is a segment-level algorithm that judges translations on a per-word basis.

As it looks at a specific set of source sentences and translations selected for the test, it should not be considered as a measurement of the overall translation quality. BLEU measures MT adequacy by looking at word precision and MT fluency by calculating n-gram precisions, returning a translation score on a scale from 0-1 (alternative: 0-100 scale). BLEU’s n-gram matching requires exact word matches, meaning that if different vocabulary or phrases are used in reference translation, the score will be lower.

NIST

NIST gets its name from the US National Institute for Standards and Technology, It is a metric based on BLEU with some additions. One of the differences is in the n-gram precision calculation. While BLEU calculates n-gram precision by adding equal weight to each n-gram, NIST also calculates how relevant a particular n-gram is. More weight is given to n-grams that are considered less likely to occur (rarer).

METEOR

METEOR, An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments (S. Banerjee, A. Lavie), was initially released in 2004. Compared to BLEU which uses only precision-based features, METEOR focuses on a property called recall in addition to precision, as it has been confirmed by several metrics as critical for high correlation with human judgments. METEOR also allows multiple reference translations and addresses the problem of variability with flexibility in word matching - allowing for morphological variants and synonyms to be taken into account as legitimate matches. Moreover, METEOR parameters can be separately tuned for different languages, to optimize correlation with human judgments.

TER

Translation Error Rate (TER) is a character-based automatic metric for measuring the number of edit operations needed to transform the machine-translated output into a human translated reference. It has been proposed as an alternative to the BLEU score for evaluating the quality of MT, but it is more commonly applied as a way of calculating edit-distance, used to assess the post-editing effort.

Main Advantages

Automated MT quality metrics are very useful to developers and researchers of MT technology as the MT system development requires frequent system evaluations. They are fast and easy to run, they require minimal human labor, they don’t need bilingual speakers and can be used repeatedly during system development.

Disadvantages

MT metrics’ primary purpose is to assess the quality of MT models, and not translation as such. Therefore, despite their usefulness when it comes to MT system development and comparison, they are not suitable for a translation production scenario. Here are some evident limitations:

A reference translation (or, preferably, multiple translations) of the segment is required. This is not practical in a live translation production scenario.
The reference translation is assumed to be of a gold standard, but that is hard to validate. Most source sentences have multiple translations that could be considered the gold standard.
The automatically generated quality scores could give a certain level of confidence in the quality of an MT system, but do not mean much more for production-based translation activities. A 38 BLEU score might mean a good translation for one sentence and poor for another. Furthermore, it doesn’t tell a translator how much time it will take them to post-edit the segment or how much they should be paid for editing it.ed59f247-2675-4d66-a8b7-dc8689a7b942','justifycenter')}} Whether you should use automated metrics in your MT program depends on your use case. If you do, you will need to train the metrics on similar data and prepare reference translations for each of the sentences you want to score. Find out where you can get parallel language data for MT training or explore TAUS Data Library with domain-specific, high-quality data sets.