icon_animated-list_selected_inner
Case Study

Improving Adaptive MT Outputs by on Average 22% in BLEU Scores Across Five Languages

TAUS provided language data for Pangeanic, a leading European NLP and translation services company, to train their machine translation models for the COVID-19, pandemic and healthcare domain
Ready to get started?
The Client

Pangeanic

A global leader in natural language processing (NLP)
Combine AI with human ingenuity to extract value from data in a scalable way
The Challenge
Finding high-quality data for MT training has always been a challenge on the path to generating high-performing MT output. The challenge increases when the language pairs are rare or when training data in a lesser-known domain is needed.
Due to the global pandemic caused by COVID-19, a domain that had not been so popular came into the spotlight. To enable faster, accurate and automated translations for the vital information on this topic, training datasets in the pandemic, COVID-19, viral illnesses and healthcare were required.
Machine translation systems do a great job at providing solutions for automated translation services when fed with the right training data. However, it was a challenge to find high-quality datasets necessary to build specialized automatic tools about this new topic in the healthcare domain.
The Solution
TAUS’ expertise in domain-specific training data collection and creation was instrumental in Pangeanic’s decision to partner with us.
"TAUS stands out because of their capabilities in the language data space, but we were also impressed by their expertise in fine-tuning the datasets to match the exact domain requirements. Using the datasets provided by TAUS, we’ve run experiments in English to Spanish, German, Polish, Russian, and Chinese language pairs for the pandemic and healthcare domain"

Mercedes García-Martínez, Chief Research Scientist at Pangeanic

The Data

Data Collection

TAUS provided Pangeanic a total of 1.8 million words of MT training data in English to Spanish, German, Polish, Russian, and Chinese language pairs.

Data Selection

The translation units (TUs) in the pandemic and healthcare dataset provided by TAUS were sorted by their relevance to the domain. The most relevant TUs were placed on the top of the file from strictly coronavirus related TUs to more general TUs in the pandemic and healthcare domain. This method allowed the customer to filter the data based on how specific they wanted their MT engines to be in the given domain or how broad they wanted to go in it. Pangeanic also made use of this method and, after performing automatic cleaning, they carried out manual checks to choose the TUs most related to coronavirus for their training purposes.
The Results
Using the data provided by TAUS, Pangeanic built COVID-19 domainspecific neural machine translation (NMT) models for the five language pairs on Pangeanic ECO user-friendly customer portal on which the user can adapt models using three levels of training settings.
The three levels of aggressivity is a proprietary Pangeanic technology that flexibly trains Deep Learning algorithms. Users can choose to simply add data to re-train the engine in the usual way as other ML companies (conservative), prioritize it (normal), or impact learning rates very deeply (aggressive). In Deep Adaptive Machine Translation’s “aggressive mode”, engines learn from the incoming material at much faster rates than by the traditional “addition” or “prioritization”, which results in higher parity rates.
Icons_Fast
22% Average increase on BLEU scores
Icons_Fast
7% Average increase on TER scores
Icons_Fast
8.6% Average increase ChrF scores

Language Specific Results

The highest BLEU score improvement has been recorded in the English > Russian language pair with 50%, followed by English > Chinese with 26%, English > German with 20%, English > Spanish with 9%, and English > Polish with 8%.

Quality Analysis

Quality analysis was also done comparing translation examples from the base model and the aggressive model. COVID-19 specific words were spotted to check how the model has been adapted. Based on the analysis, it was discovered that in all cases the adaptive model provides more accurate translations and can deal with different linguistic challenges better after training with the datasets provided by TAUS. Here are some examples of the quality analysis on the translations:
Let's connect

Talk to our Data Experts to help you find the right type of data for your next project. Niche domains or rare languages? We have a large suite of services to generate your dataset.

Discover more Case Studies

TAUS Estimate API as the Ultimate Risk Management Solution for a Global Technology Corporation

Based on examples of texts from one of the largest technology companies in the world, TAUS generated a large dataset and customized a quality prediction model. The accuracy rate achieved was 85%.

Domain-Specific Training Data Generation for SYSTRAN

After the training with TAUS datasets in the pandemic domain, the SYSTRAN engines improved on average by 18% across all twelve language pairs compared to the baseline engines.

Customization of Amazon Active Custom Translate with TAUS Data

The customization of Amazon Translate with TAUS Data always improved the BLEU score measured on the test sets by more than 6 BLEU points on average and 2 BLEU points at a minimum.