Beyond Numbers: Understanding Quality Estimation Scores


Learn about the distinctions between quality estimation and quality evaluation, the foundations of TAUS QE scores, and how to categorize and customize scores for optimal application.

The goal of quality estimation (QE) is to measure the quality of a (machine) translation without having access to a reference translation. In this blog, we explain how the QE score is created and how to interpret it.

What is the difference between quality evaluation and quality estimation?

While the words are fairly similar and are often used interchangeably, in fact they refer to two fundamentally different processes, particularly in the context of machine translation: 

  • Evaluation is a post-translation process, which implies comparing the MT output against human reference translation. This is usually a periodic process meant to assess the MT performance over time. Some of the most common MT evaluation metrics are BLEU, chrF, COMET and TER.
  • Estimation, on the other hand, takes place at the time of translation. It is meant to predict the quality of the MT output without human intervention and to streamline the content workflow, for example, to indicate whether the source content is suitable for MT, or identify when raw MT can be safely used without any post-editing.

What is the QE score based on?

The TAUS QE score is mostly based on the semantic similarity. To calculate this score we use sentence embedding vectors that represent the meaning of each segment in order to calculate how similar the source and target segments are. To achieve the maximum accuracy and language coverage, TAUS uses embeddings from multiple language models.

The trained model provides a QE score for each segment. The scores range from 0 to 1, and can be interpreted as follows:

  • 0.95 - 1.00: Best
  • 0.90 - 0.95: Good
  • 0.85 - 0.90: Acceptable
  • < 0.85: Bad

Are the QE scores similar to the translation memory (TM) matches?

While the concept of a QE score bears some resemblance to a TM match score, especially in terms of the application, the underlying logic and interpretations of these scores diverge significantly:

  • Translation memory retrieves previously translated text strings and indicates their similarity to the strings in a new text: it measures the textual similarity between two source sentences. If the TM is known to be of good quality, the full matches can be reused without any further editing. 
  • In contrast, the QE score measures the similarity between the source and target text string, and doesn’t require a previously translated reference.  

How reliable is the QE score?

As the word “estimation” suggests, the QE score is an approximation. It means that the value provided by a QE model is subject to the context in which it will be used. With generic models, where vast multilingual training data is available, the model tries to learn the intrinsic mathematical representations of sentences in various languages. It then attempts to assign a score based on the similarity between two sentences, signifying their equivalence in meaning. When applying this in a post-editing workflow, human reviewers need to be aware of how the score range correlates with human judgment. This range can subsequently serve as a guide for interpretation, so whether 85% should be considered good or 90%.

Model customization offers the flexibility to tailor this score according to specific requirements and scenarios, which allows more adaptability and gives more certainty to the ranges. Read here how MotionPoint set out to reduce their post-editing effort for a specific customer.   

What are the options for the QE score categorization?

TAUS can create custom models that are fine-tuned to a specific domain and language pair. The training data should be labeled, but the type and values of the labels can vary per use case. Labels can be discrete, such as "poor", "below average", "average", "good", "excellent", or 1, 2, 3, 4, or they can be continuous. While it is possible to train a single model to work for many language pairs or topics/domains, we have found that the best results are obtained by training custom models that are both topic-/domain- and language-pair specific, e.g., French-German for the Health domain.





Dace is a product and operations management professional with 15+ years of experience in the localization industry. Over the past 7 years, she has taken on various roles at TAUS ranging from account management to product and operations management. Since 2020 she is a member of the Executive Team and leads the strategic planning and business operations of a team of 20+ employees. She holds a Bachelor’s degree in Translation and Interpreting and a Master’s degree in Social and Cultural Anthropology.

Related Articles
Discover the advancements in TAUS QE Metrics V2.0, featuring a state-of-the-art cross-lingual transformer architecture for precise translation quality predictions.
Find out how companies integrate QE into their workflows and explore real-world use cases and benefits of quality estimation. From mitigating risk in global chat communication to minimizing post-editing in machine translation workflows.
Unlock the secrets of Machine Translation Quality Estimation (MTQE) with three key facts that demystify its distinctiveness, data-driven automation, and broader applications beyond cost reduction.