Powering Language Data

7 minute read

TAUS has now transitioned from a think tank to a language data network. Today we offer a comprehensive range of data services and software for our many users. Working on the launch of the TAUS Program for 2020 we have chosen Powering Language Data as our theme for the new season.

It appears that massive-scale machine translation is possible, if only we have access to large volumes of data in all languages and domains. Isn’t it time to bridge this gap and unleash the power of all our language data?

Human Genome Project

Data constitute the amazing, almost unreasonable power behind many technology breakthroughs of the recent past. The most striking example is the Human Genome Project.

 In 2003 a big international project led by institutions in the USA and Europe resulted in a complete deciphering of the secrets of the human body. It took thirteen years and cost 2.7 Billion dollars to undertake this mega-data-project of documenting all three billion chemical units in the human genetic instruction set. To say that it was worth it is a gross understatement. The full discovery of our DNA represents a huge milestone in human evolution: we can now cure diseases, extend our lives and even start thinking about reproducing life (putting aside the moral implications).

Massive-scale Machine Translation

In 2009, three years after the launch of Google Translate, three Google researchers (Alon Halevy, Peter Norvig, and Fernando Pereira) wrote an article entitled The Unreasonable Effectiveness of Data. They reported on the consistent gains in BLEU points with every extra billion words added to the training data, regardless of whether the data contained ‘noise’ such as translation errors.

Now, ten years later, a Google research team has published a new article: Massively Multilingual Neural Machine Translation in the Wild: Finding and Challenges. They report on their efforts to build a universal Neural Machine Translation system that translates between every pair of languages. What’s needed to make this magic happen is a combination of algorithms and data. The data set-up takes a prominent position in this article. For their experiments, the Google team used 25 billion words of parallel sentences in 103 languages that they crawled from the ‘wild’ web. The challenges they report on include the wide imbalance in data volumes across the languages and domains, the inevitable dataset noise (bad quality) as a result of the fact that they had to rely on ‘raw’ web-crawled data, topic-style discrepancies, and differing degrees of linguistic similarity. Despite all these data issues, they describe the results as very encouraging. What if they could have run their experiments on a good quality multilingual corpus evenly spread over the 103 languages and the domains covered in their tests? We can only wonder.

Desilofication of Data

The question should be asked every time a translation is required. What if we had access to ever more good quality language data, neatly tuned to our domain? The answer, as we all know, is that fully automatic translation would jump in quality and performance. The data thirst of the modern MT systems is unreasonable and almost insatiable. The big tech companies don’t need to be convinced. But what about the ‘insiders’ in the translation and global content industry: the tens of thousands of language service providers and their customers? Are they sufficiently aware of the data revolution that is set to disrupt their business in the next few years?

It’s not that there is a lack of data. Every company ‘sits’ on a mountain of language data in translation memories and content management systems. But the problem is that the data are locked up in legacy formats and templates that make them not very useful and accessible in the modern scenarios of machine translation. Over the past few decades companies have stored and organized language data under their own project and product labels without typically applying the hygiene of cleaning and controlling versions and terminology.

What is lacking is a sense of urgency. Every stakeholder in the translation and global content industry should know by now that in order not to be left behind they need to start working on the desilofication and transformation of their language data.

Data is in our DNA

In December 2017 TAUS published the Nunc Est Tempus book, providing a blueprint for a complete redesign of the translation business. Core to this book was the creation of a feedback loop for data that provide the metrics and intelligence that empower translation automation. But we realized that to make things happen, we needed to do more. So at the end of 2018, we issued a call for action, a ‘manifesto’, under the title Fixing the Translation Ecosystem. We identified three areas where action was required. One: education and awareness, upgrading the skill levels (the “Knowledge Gap”). Two: align and standardize our metrics so we can measure our output consistently (the “Operational Gap”). Three: transform our legacy data and facilitate access to data (the “Data Gap”). Of these three, the Data action is the most critical, as evidenced once again in the Massive Machine Translation story from Google.

At TAUS, data is in our DNA. We were the first back in 2008, in the early years of MT adoption, to come out with an industry-shared data repository to help our users improve the performance of their MT engines. Data is also core to the TAUS DQF platform that helps our users track translation production and quality.

Walking the Walk

TAUS has now transitioned from a think tank to a language data network. We don’t just talk the talk, we also walk the walk. Today we offer a comprehensive range of data services and software for our many users. Working on the launch of the TAUS Program for 2020 we have chosen Powering Language Data as our theme for the new season. This covers both managing existing data better, and also reaching out to build effective large-scale data resources for “new” languages (in machine translation usage terms).

In January of this year we launched the Matching Data Service. Using a unique clustered search technique we customize corpora based on a reference dataset provided by the customer. In Q4 we will make an API available which allows users to integrate their translation workflows with the TAUS Matching Data service.

Another critical new product and service is the Human Language Project Platform, the TAUS version of a human intelligence micro-task platform. This can be used by customers to build corpora for low-resource languages and domains, and for other language data-related tasks. The first deliverables will be a set of TAUS Colloquial training corpora in Indic languages such as Hindi, Assamese and Tamil.

Join TAUS 2020 Launch Webinar on October 9

If you want to find out more about how TAUS can help power your language data, listen to the recording of the TAUS 2020 Launch Webinar to learn more about this topic.


Jaap van der Meer founded TAUS in 2004. He is a language industry pioneer and visionary, who started his first translation company, INK, in The Netherlands in 1980. Jaap is a regular speaker at conferences and author of many articles about technologies, translation and globalization trends.

Related Articles
Purchase TAUS's exclusive data collection, featuring close to 7.4 billion words, covering 483 language pairs, now available at discounts exceeding 95% of the original value.
Explore the crucial role of language data in training and fine-tuning LLMs and GenAI, ensuring high-quality, context-aware translations, fostering the symbiosis of human and machine in the localization sector.
Domain Adaptation can be classified into three types - supervised, semi-supervised, and unsupervised - and three methods - model-centric, data-centric, or hybrid.