Case Study
Google, Naver Labs and University of Catalonia Use Cases for TAUS Corona Dataset

Powering Automated Translation in times of Crisis

In an effort to help battle the corona crisis from a language and information access perspective, TAUS coordinated an industry collaboration effort to gather translation memories covering this domain. The result was six datasets containing a total of 3,403,681 segments in the following language pairs: English-French, English-German, English-Spanish, English-Italian, English-Russian, and English-Chinese.
Ready to get started?
The Client


Naver Labs

University of Catalonia

The Challenge

1. Information Availabilty

Understanding the social rules and regulations along with medical advice has been a key aspect of the global fight against the spread of the virus. It has therefore been vital that information such as medical data, expert findings, and guidelines were both readily available and accurate in the languages that people best understand.

2. Sufficient Language Data

Machine translation (MT) is an important technology in the event of a crisis. When integrated into the rapid-response communication plans, MT increases not only the speed with which the information is passed on, but also the language coverage. The one precondition is that there must be enough data available on the topic at hand.

3. In-Domain Data

Translating substantial volumes of content accurately and quickly requires specic data, including medical and scientic terminology both in the case of human translation and machine translation. Yet, as COVID-19 was a brand-new domain and not many had access to translation data in relevant domains, this created a big challenge.
The Solution
TAUS made a call for industry collaboration to gather as much translation data as possible in the relevant domains such as medical, virology, epidemic, and healthcare in as many language pairs as possible. On top of that, TAUS applied its Matching Data technology on TAUS Data Cloud and ParaCrawl data pool to create custom corpora in the given domains. ModelFront helped in ltering the corpora further and removed misaligned or bad translations. Then, SYSTRAN contributed to this initiative by producing Corona Crisis Translation Models in 12 language pairs, based on quality parallel data provided by TAUS
Use Cases & Results

Rapid Domain Adaptation for Machine Translation with Monolingual Data for Google

Orhan Firat from Google Research explains that one challenge of machine translation is how to quickly adapt to unseen domains in face of surging events like COVID-19, in which case timely and accurate translation of in-domain information into multiple languages is critical while little parallel data is available. They used these corpora in their research where they studied rapid domain adaptation when such data is available.

Training of Neural Machine Translation Systems by the Open University of Catalonia

Antoni Oliver González, Director of the Translation and Technologies Postgraduate Degree Course, Explains that they have been using these corpora along with other available medical corpora and glossaries to train neural machine translation systems. These systems are used to translate abstracts of scientific papers about COVID-19

Development of a Multilingual Neural Machine Translation Model for Biomedical Data by Naver Labs Europe

Vassilina Nikoulina from the Naver Labs Europe Natural Language Processing Group explains that they have used these corpora in a multilingual and multi-domain neural machine translation model specialized for biomedical data and that enables translation into English from five languages (French, German, Italian, Spanish, and Korean). The usage of the TAUS Corona Crisis Corpora was used in combination with other corpora.

Discover more Case Studies

TAUS Estimate API as the Ultimate Risk Management Solution for a Global Technology Corporation

Based on examples of texts from one of the largest technology companies in the world, TAUS generated a large dataset and customized a quality prediction model. The accuracy rate achieved was 85%.


Speech Data Collection to Increase Performance & Diversity in Voice-based AI Systems

For a multinational technology corporation, TAUS curated a diverse team of workers who created over 1,400 hours of speech data in English (GB) in nine specific dialects with no recurring submissions from one person.

Customization of Amazon Active Custom Translate with TAUS Data

The customization of Amazon Translate with TAUS Data always improved the BLEU score measured on the test sets by more than 6 BLEU points on average and 2 BLEU points at a minimum.