MotionPoint, a global technology solutions company, partnered with TAUS to determine whether Machine Translation Quality Estimation (MTQE) models could be used to remove the human post-editing (PE) from certain machine translation (MT) workflows.
TAUS built a custom English to Spanish MTQE model in the domain of telecommunications with preset quality thresholds (0.95 - 1.00 Best; 0.90 - 0.95 Good; 0.85 - 0.90 Acceptable; < 0.85 Bad). MotionPoint tested the model by comparing the MTQE model scoring against human judgment and translation quality scoring using the Open AI’s Large Language Model (LLM) GPT4.
The pilot demonstrated that a combination of the metrics can be accurately used to predict that the top 15 – 28% of segments can reliably skip the post-editing step and therefore contribute to more streamlined, efficient translation workflows.
While MT usage is common practice, its full potential remains hard to establish, especially when it comes to identifying if the raw MT is good enough and does not require human review. The aim of this pilot was to explore MTQE technology and determine whether a correlation can be achieved between human scoring, and one or more automatic MTQE methods.
MotionPoint set out to explore the following questions:
1. To what extent can correlation be observed between human quality evaluation and automatic MTQE scoring?
2. Does one MTQE method produce a better correlation than another method, or can improved correlation be observed via some combination of two scoring metrics?
MotionPoint partnered with TAUS to run a pilot project using the TAUS Estimate API.
MotionPoint provided the training and test datasets representing the content to be translated, consisting of a bilingual dataset of 150,000 segments, a glossary and a style guide. The test set consisted of 175 source segments
Custom Model Creation
TAUS built a custom MTQE model for the EN-ES language pair in the telecommunications domain.
TAUS MTQE score is TAUS's own proprietary metric, with scoring ranging from 0 to 1. It measures the fluency and adequacy of MT output, with higher scores indicating better quality translation. It is based on a BERT-based LLM, fine-tuned with TAUS data for various languages.
The ability to customize the MTQE model is one of the core features of the DeMT Estimate offering. It is achieved through an offline process where our NLP engineers collaborate with the customer to fine-tune the model based on the unique needs and requirements. This way, we ensure that the resulting model accurately reflects the specific domain and use case, and provides the most relevant and reliable quality estimations. The customization process involves the following steps:
- Data analysis and preprocessing
- Synthetic data generation to augment the original dataset
- Model training
The trained model provides a MTQE score for each segment. The scores range from 0 to 1, and can be interpreted as follows:
0.95 - 1.00: Best
0.90 - 0.95: Good
0.85 - 0.90: Acceptable
< 0.85: Bad
MotionPoint applied the following two scoring methods for human analysis of the test set:
1 - The linguist was asked to answer a simple ‘yes/no’ question as to whether each segment requires post-editing.
2 - A modified MQM scoring methodology was used to mark the errors in each segment according to the following categories/severity levels:
This resulted in giving each segment a score in the range 0-100, 100 being a ‘perfect’ translation.
The third scoring method that MotionPoint explored was to prompt the GPT4 models of Open AI to perform a translation quality evaluation task. The model chosen for experimentation was GPT4, since initial testing showed a significant improvement in the ability of the model to produce credible quality evaluations, over GPT3.5.
175 segments made up the final result set for comparison, along with 250 segments that underwent human scoring.
GPT4 was prompted to act as a reviewer of translation quality and provide only a JSON structured response, enumerating each error. It was instructed to utilize the MQM scoring framework and score each error with severity levels as critical, major or minor. The penalties were then deducted from 100 to produce a score for each segment.
Post-Editing Yes/No Data
1 - Ensure that context is not being taken into account by the linguist (since the MT and MTQE engines do not have access to document level context),
2 - Ensure that translations with minor errors can be classed as ‘no’.
TAUS Estimate API as the Ultimate Risk Management Solution for a Global Technology Corporation
Based on examples of texts from one of the largest technology companies in the world, TAUS generated a large dataset and customized a quality prediction model. The accuracy rate achieved was 85%.
Domain-Specific Training Data Generation for SYSTRAN
After the training with TAUS datasets in the pandemic domain, the SYSTRAN engines improved on average by 18% across all twelve language pairs compared to the baseline engines.
Customization of Amazon Active Custom Translate with TAUS Data
The customization of Amazon Translate with TAUS Data always improved the BLEU score measured on the test sets by more than 6 BLEU points on average and 2 BLEU points at a minimum.