Case Study

Domain-specific Training Data Generation for SYSTRAN

When the global pandemic hit the world in 2020, TAUS created a starter kit in several languages to train highquality translation models customized for the pandemic domain. SYSTRAN, a leading AI-based translation technology company, partnered with TAUS to use these datasets to produce twelve translation models for English to/from French, Spanish, German, Italian, Chinese and Russian and make them available on SYSTRAN Marketplace where NMT models are offered to a network of language experts to train models in any language pair and domain.
Ready to get started?
The Client


A pioneer and global leader in translation solutions
Focus on bringing the capabilities of Neural translation to corporate users
The Challenge
Domain-specic machine translation models require a good amount of highquality training data to produce an improved and specialized output. A highquality dataset is one that has gone through a detailed cleaning process and that contains all of the necessary parameters for your model to leverage from the domain-specic nuances.
As the topic of the global pandemic, related healthcare issues and coronavirus itself had not been widely discussed before, nding clean and high-quality parallel training data in this domain was a great challenge.
The Solution
To overcome the challenge of sourcing these domain-specic training datasets, SYSTRAN partnered with TAUS. TAUS’ expertise in providing domain-specic data thanks to the Matching Data clustered search technology was instrumental in SYSTRAN’s decision to partner with us.

High-performance Search Technology

TAUS owns the largest industry-shared language data repository (as a part of TAUS Data Cloud legacy started in 2008). With TAUS Matching Data high-performance clustered search technology, we transformed vast quantities of parallel language data from different sources and owners into unique, domain-specific corpora tuned to SYSTRAN requirements.

Corpus Optimization and Cleaning

After relevant segments from the TAUS data repository were clustered to match the SYSTRAN data requirements, further cleaning and manual quality checks were performed by the TAUS data experts to make it fit for NMT model training.
"There are substantial volumes of information being produced and circulated about the virus, symptoms, new treatments, vaccines and data from all parts of the globe. However, machine translating this content accurately requires specic datasets in medical and scientic domains to build state-of-the-art translation models. To be able to create these models, we partnered with TAUS as it stands out as a longtime player in the language data business with a great deal of expertise."

JP Barraza, CIO at SYSTRAN

The Results
3.430.496 Translation units
20 Epochs (training cycles)
18% Average increase on BLEU scores across 12 languages
6.37 BLEU points average increase in EN>XX
7.21 BLEU points average increase in XX>EN
Using the data provided by TAUS, SYSTRAN launched 12 NMT translation models for English to/from French, Spanish, German, Italian, Chinese and Russian on SYSTRAN Marketplace.
After the training with the TAUS Corona datasets, the Systran engines improved on average 18% across all twelve language pairs compared to the Systran baseline engines. See in the table on th the scores for each language pair, also compared to the scores for Google Translate.

Mentioned training datasets are available in the TAUS Data Library.

Take a look now

Let's connect

Talk to our Data Experts to help you find the right type of data for your next project. Niche domains or rare languages? We have a large suite of services to generate your dataset.

Discover more Case Studies

TAUS Estimate API as the Ultimate Risk Management Solution for a Global Technology Corporation

Based on examples of texts from one of the largest technology companies in the world, TAUS generated a large dataset and customized a quality prediction model. The accuracy rate achieved was 85%.

Speech Data Collection to Increase Performance & Diversity in Voice-based AI Systems

For a multinational technology corporation, TAUS curated a diverse team of workers who created over 1,400 hours of speech data in English (GB) in nine specific dialects with no recurring submissions from one person.

Customization of Amazon Active Custom Translate with TAUS Data

The customization of Amazon Translate with TAUS Data always improved the BLEU score measured on the test sets by more than 6 BLEU points on average and 2 BLEU points at a minimum.