Powering Language Data
icons-action-calendar2 Sep 2019
7 minute read
TAUS has now transitioned from a think tank to a language data network. Today we offer a comprehensive range of data services and software for our many users. Working on the launch of the TAUS Program for 2020 we have chosen Powering Language Data as our theme for the new season.

It appears that massive-scale machine translation is possible, if only we have access to large volumes of data in all languages and domains. Isn’t it time to bridge this gap and unleash the power of all our language data?

Human Genome Project

Data constitute the amazing, almost unreasonable power behind many technology breakthroughs of the recent past. The most striking example is the Human Genome Project.

 In 2003 a big international project led by institutions in the USA and Europe resulted in a complete deciphering of the secrets of the human body. It took thirteen years and cost 2.7 Billion dollars to undertake this mega-data-project of documenting all three billion chemical units in the human genetic instruction set. To say that it was worth it is a gross understatement. The full discovery of our DNA represents a huge milestone in human evolution: we can now cure diseases, extend our lives and even start thinking about reproducing life (putting aside the moral implications).

Massive-scale Machine Translation

In 2009, three years after the launch of Google Translate, three Google researchers (Alon Halevy, Peter Norvig, and Fernando Pereira) wrote an article entitled The Unreasonable Effectiveness of Data. They reported on the consistent gains in BLEU points with every extra billion words added to the training data, regardless of whether the data contained ‘noise’ such as translation errors.

Now, ten years later, a Google research team has published a new article: Massively Multilingual Neural Machine Translation in the Wild: Finding and Challenges. They report on their efforts to build a universal Neural Machine Translation system that translates between every pair of languages. What’s needed to make this magic happen is a combination of algorithms and data. The data set-up takes a prominent position in this article. For their experiments, the Google team used 25 billion words of parallel sentences in 103 languages that they crawled from the ‘wild’ web. The challenges they report on include the wide imbalance in data volumes across the languages and domains, the inevitable dataset noise (bad quality) as a result of the fact that they had to rely on ‘raw’ web-crawled data, topic-style discrepancies, and differing degrees of linguistic similarity. Despite all these data issues, they describe the results as very encouraging. What if they could have run their experiments on a good quality multilingual corpus evenly spread over the 103 languages and the domains covered in their tests? We can only wonder.

Desilofication of Data

The question should be asked every time a translation is required. What if we had access to ever more good quality language data, neatly tuned to our domain? The answer, as we all know, is that fully automatic translation would jump in quality and performance. The data thirst of the modern MT systems is unreasonable and almost insatiable. The big tech companies don’t need to be convinced. But what about the ‘insiders’ in the translation and global content industry: the tens of thousands of language service providers and their customers? Are they sufficiently aware of the data revolution that is set to disrupt their business in the next few years?

It’s not that there is a lack of data. Every company ‘sits’ on a mountain of language data in translation memories and content management systems. But the problem is that the data are locked up in legacy formats and templates that make them not very useful and accessible in the modern scenarios of machine translation. Over the past few decades companies have stored and organized language data under their own project and product labels without typically applying the hygiene of cleaning and controlling versions and terminology.

What is lacking is a sense of urgency. Every stakeholder in the translation and global content industry should know by now that in order not to be left behind they need to start working on the desilofication and transformation of their language data.

Data is in our DNA

In December 2017 TAUS published the Nunc Est Tempus book, providing a blueprint for a complete redesign of the translation business. Core to this book was the creation of a feedback loop for data that provide the metrics and intelligence that empower translation automation. But we realized that to make things happen, we needed to do more. So at the end of 2018, we issued a call for action, a ‘manifesto’, under the title Fixing the Translation Ecosystem. We identified three areas where action was required. One: education and awareness, upgrading the skill levels (the “Knowledge Gap”). Two: align and standardize our metrics so we can measure our output consistently (the “Operational Gap”). Three: transform our legacy data and facilitate access to data (the “Data Gap”). Of these three, the Data action is the most critical, as evidenced once again in the Massive Machine Translation story from Google.

At TAUS, data is in our DNA. We were the first back in 2008, in the early years of MT adoption, to come out with an industry-shared data repository to help our users improve the performance of their MT engines. Data is also core to the TAUS DQF platform that helps our users track translation production and quality.

Walking the Walk

TAUS has now transitioned from a think tank to a language data network. We don’t just talk the talk, we also walk the walk. Today we offer a comprehensive range of data services and software for our many users. Working on the launch of the TAUS Program for 2020 we have chosen Powering Language Data as our theme for the new season. This covers both managing existing data better, and also reaching out to build effective large-scale data resources for “new” languages (in machine translation usage terms).

In January of this year we launched the Matching Data Service. Using a unique clustered search technique we customize corpora based on a reference dataset provided by the customer. In Q4 we will make an API available which allows users to integrate their translation workflows with the TAUS Matching Data service.

Another critical new product and service is the Human Language Project Platform, the TAUS version of a human intelligence micro-task platform. This can be used by customers to build corpora for low-resource languages and domains, and for other language data-related tasks. The first deliverables will be a set of TAUS Colloquial training corpora in Indic languages such as Hindi, Assamese and Tamil.

Join TAUS 2020 Launch Webinar on October 9

If you want to find out more about how TAUS can help power your language data, listen to the recording of the TAUS 2020 Launch Webinar to learn more about this topic.


Jaap van der Meer founded TAUS in 2004. He is a language industry pioneer and visionary, who started his first translation company, INK, in The Netherlands in 1980. Jaap is a regular speaker at conferences and author of many articles about technologies, translation and globalization trends.

Related Articles
icons-action-calendar3 Mar 2022

The AI scene of the 2010s was shaped by breakthroughs in vision-enabled technologies, from advanced image searches to computer vision systems for medical image analysis or for detecting defective parts in manufacturing and assembly. The 2020s, however, are foreseen to be all about natural language technologies and language-based AI tasks. NLP, NLG, NLQ, NLU… The list of abbreviations starting with NL (Natural Language) seems to grow each day. Regardless of the technology domain, it’s observed that natural language technologies will be in a field-shaping position in a variety of areas from business intelligence and healthcare to fintech.

icons-action-calendar3 Jan 2022

Bilingual, NLP-driven word clouds are now available in TAUS Data Marketplace. In this article, we discuss what word clouds are and what they can tell us about the contents of a document containing bilingual text data.

icons-action-calendar2 Dec 2021

This is the third article in my series on Translation Economics of the 2020s. In the first article published in Multilingual, I sketched the evolution of the translation industry driven by technological breakthroughs from an economic perspective. In the second article, Reconfiguring the Translation Ecosystem, I laid out the emerging new business models and ended with the observation that new smarter models still need to be invented. This is where I will now pick up the thread and introduce you to the next logical translation solution. I call it: Data-Enhanced Machine Translation.