Case Study

Customization of Amazon Active Custom Translate with TAUS Data

Polyglot Technology LLC independently evaluated the quality of machine translation output from Amazon Translate customized with TAUS Data compared to non-customized. The customization of Amazon Translate with TAUS Data always improved the BLEU score measured on the test sets by more than 6 BLEU points on average and 2 BLEU points at a minimum.
Ready to get started?
The Client

Polyglot Technology LLC

Helps customers succeed with machine translation
Evaluate machine translation quality, enable customers to make best use of translation data, and advise on how to best integrate technology, people and processes
The Case
Online machine translation engines provide easy access to high quality machine translations. They are optimized for content like news articles and social media posts that users of online platforms frequently translate.
Businesses often want to translate text with a different style and a specific topic. For enterprise use, online machine translation engines offer customization via sets of pre-existing translations that reflect the desired style and topic. This data is often called “parallel data”.
TAUS makes such customization data available via the TAUS Data Marketplace and TAUS Matching Data platforms, and now AWS Marketplace.

TAUS asked Achim Ruopp, owner of Polyglot Technology LLC to independently evaluate the quality of machine translation of Amazon Translate customized with TAUS Data (using Amazon Translate Active Custom Translation) compared to non-customized Amazon Translate

The Approach
To judge whether machine translation is good or not, evaluation performed by a human is the best method. We can ask speakers of the source language and the target language, or better professional translators, to judge whether a machine translation is an adequate and fluent translation of the original text. Or we can ask how close the machine translation is to a human reference translation. Human evaluation however, is slow and hard to scale across language pairs and domains.

The BLEU Score Automatic Metric

Automatic metrics that also use human reference translations have been developed to calculate a numeric score for machine translation quality. For close to 20 years the predominant automatic metric is BLEU, measuring the similarity of machine translations to human reference translations on a scale from 0 to 1 (or 0 to 100 when expressed as percentages). More details on BLEU and how to interpret it can be found in the section “Interpreting BLEU Scores”

TAUS Test Set

TAUS selects the machine translation customization data by querying its large repository of highquality translation data with a domain-specific text. The resulting customization dataset is then split at random into a larger training set for Amazon Translate Active Custom Translation and a smaller 2,000 sentence test set that was provided to Polyglot Technology for evaluation with the BLEU Score.

The Results

Summary Evaluation Results

The source language is always English and the target languages are various European languages – in total we evaluated 8 language pairs for the E-Commerce domain, 18 language pairs for the Medical/Pharma domain and 4 language pairs for the Financial domain.

The customization of Amazon Translate with TAUS Data improved the BLEU score for all language pairs (as evaluated with the test sets):

- by more than 6 BLEU points, or 15.3% on average

- by 2 BLEU points at a minimum

These are significant improvements that demonstrate the superiority of this customized Amazon Translation Active Custom Translation for the Ecommerce, Medical/Pharma and Financial domain over non-customized Amazon Translate.
The evaluated datasets are also used as a part of the TAUS Data-Enhanced Machine Translation (DEMT) service that offers an end-to-end solution to those who wish to produce customized MT output for their specific domains, without the hassle of going through the actual MT training process.
See figures 1-3 for BLEU scores on the TAUS test data.

For an even more detailed evaluation including analysis of most improved translations, please see the individual evaluation reports for each language pair and domain available on request from sales@taus.net.

Figure 1: BLEU Scores for the TAUS Test Sets for the E-Commerce Domain

Figure 2: BLEU Scores for the TAUS Test Sets for the Medical/Pharma Domain
Figure 3: BLEU Scores for the TAUS Test Sets for the Financial Domain

TAUS Matching Data

When employing machine translation for a specific use case, it is advisable to evaluate the systems with usage-scenario specific source text and its human reference translation. Maybe you already have data from a previous, similar project, or your translation vendor can help you create the test data. Polyglot Technology can assist in implementing a robust evaluation program.
When you go through the effort of compiling usecase specific data it is likely worth it to consider getting personalized training data with the TAUS Matching Data service. This requires gathering usecase specific source text independent from the test data – a so called “query set”. This can then be used to create highly specific training data using TAUS Matching Data. In many cases this can improve machine translation quality even more than pre-packaged domain training data

Interpreting BLEU Scores

The paragraphs in this section are adapted from Google AutoML Translate's documentation page on evaluation which is licensed under the Creative Commons 4.0 Attribution License

BLEU (BiLingual Evaluation Understudy) is a metric for automatically evaluating machine-translated text. The BLEU score is a number between zero and one that measures the similarity of the machine-translated text to a set of high quality reference translations. A value of 0 means that the machine-translated output has no overlap with the reference translation (low quality) while a value of 1 means there is perfect overlap with the reference translations (high quality).

It has been shown that BLEU scores correlate well with human judgment of translation quality. Note that even human translators do not achieve a perfect score of 1.0 (for the reason that a source sentence can have several valid, equally appropriate translations).


Trying to compare BLEU scores across different corpora and languages is strongly discouraged. Even comparing BLEU scores for the same corpus but with different numbers of reference translations can be highly misleading.
However, as a rough guideline, the following interpretation of BLEU scores (expressed as percentages rather than decimals) might be helpful.
The following color gradient can be used as a general scale interpretation of the BLEU score:
Let's connect

Talk to our Data Experts to help you find the right type of data for your next project. Niche domains or rare languages? We have a large suite of services to generate your dataset.

Discover more Case Studies

TAUS Estimate API as the Ultimate Risk Management Solution for a Global Technology Corporation

Based on examples of texts from one of the largest technology companies in the world, TAUS generated a large dataset and customized a quality prediction model. The accuracy rate achieved was 85%.

Domain-Specific Training Data Generation for SYSTRAN

After the training with TAUS datasets in the pandemic domain, the SYSTRAN engines improved on average by 18% across all twelve language pairs compared to the baseline engines.

Speech Data Collection to Increase Performance & Diversity in Voice-based AI Systems

TAUS curated a diverse team of workers who created over 1,400 hours of speech data in English (GB) in nine specific dialects with no recurring submissions from one person.