Skyline San Jose

2 November 2017, Mountain View, CA (USA) hosted by Google

Capitalizing on Language Data




Data are in great demand in every industry these days. And so they are in the translation industry. Translation memory data have been the translator’s best resource to ensure consistency and enhance productivity since the early nineties. The use of translation memory data always remained ‘private’. Using translation memories outside your own company, practice or product, not only raised issues with data ownership and confidentiality, it was also clear that no real benefits could be obtained. Translation memory technology was in essence an invention from and for translators. 

This changed with the breakthrough of Statistical Machine Translation, first demonstrated in the early nineties by IBM, and becoming really popular with the arrival of the open-source Moses SMT engine and the translation market opening up to machine translation around the year 2010. Google Translate launched in 2006, quickly became popular, and was joined in quick succession by Microsoft’s Bing Translator, the Yandex and Baidu Translators. All these events together caused a ‘hunt’ for translation data.

The recent new breakthrough in machine translation technology - the arrival of neural networks - changes the perspective on data. Very large quantities of data, often harvested from the web, belong to the past. Neural MT thrives on smaller volumes of high quality language data. New rules apply in this new reality. We are evolving from the private use of data to a level playing field, where everyone can benefit: the non-zero-sum game.

Objective of the Data Summit

The TAUS Data Summit will bring owners and producers of language data together with the ‘power users’ and MT developers to learn from each other and to find common ground and ways to collaborate. The objective of the one-day event is to define an open market for language data in which  all parties can benefit.


Topics for Discussion

The agenda contains the following items:

  1. Introductory presentations from a selection of the participating companies, outlining use cases and opportunities and challenges in managing and sharing language data.
  2. Pricing language data. The non-existence of a language data market means that we have no experience with setting prices for data. How do we set prices? Should prices differ depending on language, domain, type, quality, format?
  3. Quality of language data. How do we determine and define the quality of language data? Are there tools to automate quality assessment? Is peer review working?
  4. Type of data. What type of data do we need? Do we need to extend the Data Market to also aggregate and exchange speech data?
  5. Business rules for sharing and trading of data. Exchange of ideas for the business rules. What’s in it for the translator? And for the language service provider? And for the translation customer?
  6. Clarifying copyright on language data. How does existing copyright legislation apply to the Data Market? How do our Terms of Use relate to daily practices and behaviors of sharing?
  7. Features and APIs of the Data Market. The new Data Market opens many new opportunities and ideas for features and APIs that can be developed. A blue-sky brainstorming session will help us identify many of these new ideas and features.

For background information on these topics we suggest you download and read the Data Market White Paper that was published in June 2017.


Who should participate?

The TAUS Data Summit welcomes everyone who is concerned about the importance and value of language data, both producers and users of language data. The producers include everyone in the translation supply chain:

  • Translation buyers that may have their in-house production resources or outsource to translation vendors;
  • Language service providers with internal translation resources and external freelance resources;
  • Independent language professionals who provide translation, transcreation, editing, post-editing and consultancy services; (ask TAUS for special freelance registration fees)
  • Power users. Power users - in our definition - are the companies that have internal translation and language technology resources;
  • Dedicated MT and language technology development companies. 

Human Language Project

In 2008 TAUS and its members already created a data-sharing platform. In 2012 the TAUS Data Cloud was revamped as the Human Language Project, pursuing the vision that ‘democratic’ access to all human language data would ultimately help us to break down the language barriers. Between 2012 and 2016 the TAUS Data Cloud has experienced significant growth in volume. The current volume is 73 billion words in 2,300 language pairs. This now forms the foundation for a further modernization of the TAUS Data Cloud. The TAUS members like to pursue a strategy that transforms the data platform into a Data Market that can help both users and producers of data to capitalize on their data and create a level playing field.

Focus and format of the Summit

The focus of the summit is on the challenges and opportunities of managing and sharing language data in the translation industry. It will be a highly interactive meeting, with short presentations of use cases and experiences, followed by discussions with the participants. The discussions will be moderated by TAUS. The aim of the meeting is to deliver concrete and useful results. We plan to draft a roadmap of features and business rules for the new TAUS Data Market based on the discussions.

Data Market Virtuous Cycle

With the TAUS Data Market we want to create a virtuous circle benefiting both providers and users of language data: the opportunity to earn royalties on language data gets data providers to participate in the Data Market, which increases data coverage for more languages and domains, the resulting better MT quality increases MT usage and generates more data demand, attracting even more data providers.


Participants are requested to prepare a short presentation of maximum three slides consisting of an overview of their organization and the challenges and opportunities they are facing with regard to language data, machine translation and quality management. We also suggest you read the Data Market White Paper that was published in June 2017.

Program Committee

Program Committee

The program committee for the TAUS Data Summit consists of: 

  • Mengmeng Niu, Program Manager Google Translate, Google
  • Jose Sanchez, Manager - MT Language Specialists, eBay
  • Nick Lambson, Automation Driven Localization Engineer, MediaLocate
  • ....



The TAUS Data Summit will be hosted by Google at their Mountain View Campus. The address is:   

1600 Amphitheatre Parkway
Mountain View, CA 94043
United States