Case Study

Unlocking Efficiency: Leveraging DeMT™ Estimate API to Optimize MT Workflows

MotionPoint, a global technology solutions company, partnered with TAUS to determine whether Machine Translation Quality Estimation (MTQE) models could be used to remove the human post-editing (PE) from certain machine translation (MT) workflows.

TAUS built a custom English to Spanish MTQE model in the domain of telecommunications with preset quality thresholds (0.95 - 1.00 Best; 0.90 - 0.95 Good; 0.85 - 0.90 Acceptable; < 0.85 Bad). MotionPoint tested the model by comparing the MTQE model scoring against human judgment and translation quality scoring using the Open AI’s Large Language Model (LLM) GPT4.

The pilot demonstrated that a combination of the metrics can be accurately used to predict that the top 15 – 28% of segments can reliably skip the post-editing step and therefore contribute to more streamlined, efficient translation workflows.

Ready to get started?
The Client


MotionPoint is a managed website translation platform that delivers concierge-level, end-to-end translation to meet the needs of brands across different languages and markets. Far more than the world’s most effective website translation service, MotionPoint combines intelligent applications, big data, and expert services to localize, translate, and optimize websites for strategic markets.
The Challenge

While MT usage is common practice, its full potential remains hard to establish, especially when it comes to identifying if the raw MT is good enough and does not require human review. The aim of this pilot was to explore MTQE technology and determine whether a correlation can be achieved between human scoring, and one or more automatic MTQE methods.

MotionPoint set out to explore the following questions:

1.     To what extent can correlation be observed between human quality evaluation and automatic MTQE scoring?

2.     Does one MTQE method produce a better correlation than another method, or can improved correlation be observed via some combination of two scoring metrics?

The Solution

MotionPoint partnered with TAUS to run a pilot project using the TAUS Estimate API.

MotionPoint provided the training and test datasets representing the content to be translated, consisting of a bilingual dataset of 150,000 segments, a glossary and a style guide. The test set consisted of 175 source segments

Custom Model Creation

TAUS built a custom MTQE model for the EN-ES language pair in the telecommunications domain. 

TAUS MTQE score is TAUS's own proprietary metric, with scoring ranging from 0 to 1. It measures the fluency and adequacy of MT output, with higher scores indicating better quality translation. It is based on a BERT-based LLM, fine-tuned with TAUS data for various languages. 

The ability to customize the MTQE model is one of the core features of the DeMT Estimate offering. It is achieved through an offline process where our NLP engineers collaborate with the customer to fine-tune the model based on the unique needs and requirements. This way, we ensure that the resulting model accurately reflects the specific domain and use case, and provides the most relevant and reliable quality estimations. The customization process involves the following steps:

- Data analysis and preprocessing 

- Synthetic data generation to augment the original dataset

- Model training

The trained model provides a MTQE score for each segment. The scores range from 0 to 1, and can be interpreted as follows:

0.95 - 1.00: Best

0.90 - 0.95: Good

0.85 - 0.90: Acceptable

< 0.85: Bad

Human Review

MotionPoint applied the following two scoring methods for human analysis of the test set:

1 - The linguist was asked to answer a simple ‘yes/no’ question as to whether each segment requires post-editing.

2 - A modified MQM scoring methodology was used to mark the errors in each segment according to the following categories/severity levels:

This resulted in giving each segment a score in the range 0-100, 100 being a ‘perfect’ translation.

GPT4 Scoring

The third scoring method that MotionPoint explored was to prompt the GPT4 models of Open AI to perform a translation quality evaluation task. The model chosen for experimentation was GPT4, since initial testing showed a significant improvement in the ability of the model to produce credible quality evaluations, over GPT3.5.

175 segments made up the final result set for comparison, along with 250 segments that underwent human scoring.

GPT4 was prompted to act as a reviewer of translation quality and provide only a JSON structured response, enumerating each error. It was instructed to utilize the MQM scoring framework and score each error with severity levels as critical, major or minor. The penalties were then deducted from 100 to produce a score for each segment.

The Results
Although each scoring method had a high number of scores at 100, human review scores were more distributed over the lower scores than either GPT4 or TAUS MTQE. Furthermore, it should be noted that the TAUS MTQE scores were more varied across the segments, whereas the majority of GPT4 scores were either 100, 99 (one minor error) or 95 (one major error). In rare cases, GPT4 scored more than one error in a segment.

Scoring Variance

The variances got higher for both methods with poorer translations with the human reviewers penalizing more than either MTQE tool does. GPT4 appeared to do a better job at identifying the best translations (variances for green line for human scores> 89) are much lower than for TAUS MTQE - with 3 anomalous results. The TAUS MTQE model seemed to do a better job at identifying the worst translations (variances for orange line for human scores >25 are much lower than for GPT4.

Post-Editing Yes/No Data

The methodology of this evaluation type had to be calibrated several times to ensure that ‘no’ was applied to ‘acceptable’ translations, and ‘yes’ to ‘unacceptable’ translations. The factors that were emphasized were:

1 - Ensure that context is not being taken into account by the linguist (since the MT and MTQE engines do not have access to document level context),

2 - Ensure that translations with minor errors can be classed as ‘no’.

The following graph shows how the final ‘yes/no’ data correlated with the TAUS MTQE model:
MotionPoint observed some correlation, with more ‘yes’ responses correlating with higher TAUS MTQE model scores. Furthermore, they observed that a high percentage of GPT4 ‘perfect’ scores correlate with segments that human reviewers said did NOT require PE. Although, it is also worth noting that a high percentage that received a score of 1 from GPT4 also were scored as YES for the need of PE. Looking at the scores in the 0.95- 0.99 range, it is also the case that, according to the human reviewers, a high percentage of these segments DID need PE.


The following chart is a visualization of PE yes/no data, based on division of GPT4 scores into two groups (high/low) and division of TAUS MTQE scores into thirds (high, mid, low).
GPT4 high means that GPT4 scored 1.
It can be observed that in all the red quadrants (i.e. all the ones where GPT4 scores low), PE is definitely required. In the green quadrant, where TAUS MTQE score is high and GPT4 is high, PE can be skipped. For the 14 segments that were scored as PE required, these were rechecked with linguists, who determined that they were acceptable for publication without PE.
Let's connect

Talk to our NLP Experts to find out how you can minimize your post-editing efforts in time and money with a customized Quality Estimation model.

Discover more Case Studies

TAUS Estimate API as the Ultimate Risk Management Solution for a Global Technology Corporation

Based on examples of texts from one of the largest technology companies in the world, TAUS generated a large dataset and customized a quality prediction model. The accuracy rate achieved was 85%.

Domain-Specific Training Data Generation for SYSTRAN

After the training with TAUS datasets in the pandemic domain, the SYSTRAN engines improved on average by 18% across all twelve language pairs compared to the baseline engines.

Customization of Amazon Active Custom Translate with TAUS Data

The customization of Amazon Translate with TAUS Data always improved the BLEU score measured on the test sets by more than 6 BLEU points on average and 2 BLEU points at a minimum.