What is the single most important element that is present in every translation or localization workflow? There is one process that cannot be eliminated from any type of translation or localization task, which is evaluation. This process becomes even more important when it comes to machine translation (MT).How to evaluate an output produced by artificial intelligence? Should humans take over the task or let robots decide? Or, does something in-between need to be invented? This blog gives some more insights about the evaluation side of the long-lasting debate about humans vs. machines. Whether you are piloting MT, customizing it or implementing it in production, you need to make a decision on how you are going to evaluate its output. Three principal quality indicators emerge for the case of machine translated texts:
For human evaluation, there are several strategies that can be employed:
Error identification is the most exhaustive of all approaches, as it identifies and locates all errors present in the text. It is seen as the most objective approach thanks to the pre-established set of errors, which often include grammar, punctuation, terminology and style. However, it is also the most time-consuming and requires the most highly trained/qualified evaluators.
On the other hand, scales offer a more global view of quality. Instead of identifying specific errors, evaluators assess the overall quality of each sentence according to a particular attribute.
Thirdly, ranking aims to speed up human evaluation and to reduce the cognitive effort involved. It is particularly useful when comparing translations (and systems) but it doesn’t provide any information on the actual quality of each sentence.
Apart from these, the industry devises its own indirect MT assessment methods such as usability tests. These methods are more geared geared towards measuring the usefulness and user satisfaction of a document. In essence, they do not necessarily target the linguistic quality of a text but rather evaluate its global acceptance by the target audience. Customer feedback received at the support service or collected through like/dislike ratings on web pages can also be indicative of text quality/usefulness of a piece of text.
Much as the different evaluation strategies try to overcome frailties, human evaluation is criticized for being subjective, inconsistent, time consuming and expensive.
Obviously, each evaluator has an individual type of expertise depending on their training, experience, familiarity with the MT and personal opinions about MT, which plays a vital role on their quality judgment. Yet, humans are still the most reliable source to obtain meaningful informative evaluations. Users are also human, after all.
There is a long list of automatic evaluation metrics to choose from: BLEU, NIST, METEOR, GTM, TER, Levenshtein edit-distance, confidence estimation and so on.
In the main working principle of automatic metrics lies calculating how similar a machine-translated sentence is compared to a human reference translation or previously translated data. It is assumed that the smaller the difference, the better the quality.
For example; BLEU, NIST, GTM and the like try to calculate this similarity by counting how many words are shared by the MT output and the reference, and reward long sequences of shared words. TER and the original edit-distance are more task-oriented and try to capture the work that is needed to improve MT output to human translation standards. They seek to correlate with post-editing activity. They measure the minimum number of additions, deletions, substitutions and re-orderings that are used to transform the MT output into the reference human translation.
Lastly, confidence estimations works by obtaining data from the MT system and learning features that relate the source to the translation from previous data.
Automatic metrics emerged to address the need of objective, consistent quick and cheap evaluations. The ideal metric has been described by developers of METEOR, Banerjje and Lavie, as a fully automatic, low cost, tunable, consistent and meaningful.
No matter how many times you run the same automatic metric on the same data, scores will be consistent, which is an advantage over the subjective nature of human evaluation. The question with automatic metrics is how to calculate a quality attribute. What does quality mean in the language of computers? The algorithm used is objective, yes, but what is it calculating?
The solution so far has been to come up with an algorithm, whatever it does, that correlates with human responses. Automatic metrics are certainly quicker than human evaluation. Hundreds of sentences can be scored in the click of a mouse. Nevertheless, it also takes a lot of preparation. Metrics need to be trained on similar data and/or require reference translations for each of the sentences you want to score. That in itself can be costly and time-consuming.
There are different implementations of automatic evaluation algorithms, different approaches for calculating the final scores for measurement, and different definitions and penalizations for error categories. However, translation has evolved towards transcreation and is highly focused on localization which inevitably add the context and culture as vital variables into the equation. MT plays its part by speeding up the process, however, detailed evaluation is a must to secure the quality of the end-product. The ultimate solution would require something in-between robots and humans and let’s say we call them ‘social robots’.. Social robots that are objective yet sensible enough to take variable contexts into account may be the answer long sought after.
The diversification in content types and rapid adoption of translation technologies (including machine translation) drives the need for more dynamic and reliable methods of quality evaluation. TAUS Quality Dashboard provides the objective and neutral platform needed to evaluate translation quality. For more information about how to measure translation quality more efficiently:
Şölen is the Head of Digital Marketing at TAUS where she leads digital growth strategies with a focus on generating compelling results via search engine optimization, effective inbound content and social media with over seven years of experience in related fields. She holds BAs in Translation Studies and Brand Communication from Istanbul University in addition to an MA in European Studies: Identity and Integration from the University of Amsterdam. After gaining experience as a transcreator for marketing content, she worked in business development for a mobile app and content marketing before joining TAUS in 2017. She believes in keeping up with modern digital trends and the power of engaging content. She also writes regularly for the TAUS Blog/Reports and manages several social media accounts she created on topics of personal interest with over 100K followers.
MT has come a long way. After seventy years of research, the technology is now taken into production. And yet, we are missing out on the full opportunities. Because the developers are preoccupied with the idea that the massive models will magically solve the remaining problems. And because the operators in the translation industry are slow in developing new MT-centric translation strategies. This article is an appeal to everyone involved in the translation ecosystem to come off the fence and realize the full benefits of MT. We can do better!
It doesn’t happen very often nowadays, but every now and then I still find in my inbox a great example of what is becoming a relic from the past: a spam email with cringy translation. Like everyone else, I’m certainly not too fond of spam, but the ones with horrendous translations do get my attention. The word-by-word translation is like a puzzle to me: I want to know if I can ‘reverse-translate’ it to its original phrasing.