This white paper gives an overview of the issues, the players, and benefits associated with the sharing of translation data. And it offers a solution to capitalizing on translation data, offering benefits for all stakeholders and creating a level playing field.

Overview

Data are in great demand in every industry these days. And so they are in the translation industry. Translation memory data have been the translator’s best resource to ensure consistency and enhance productivity since the early nineties. The use of translation memory data always remained ‘private’. Using translation memories outside your own company, practice or product, not only raised issues with data ownership and confidentiality, it was also clear that no real benefits could be obtained. Translation memory technology was, in essence, an invention from and for translators.

This changed with the breakthrough of Statistical Machine Translation, first demonstrated in the early nineties by IBM, and becoming really popular with the arrival of the open-source Moses SMT engine and the translation market opening up to machine translation around the year 2010. Google Translate launched in 2006, quickly became popular and was joined in quick succession by Microsoft’s Bing Translator, the Yandex and Baidu Translators. All these events together caused a ‘hunt’ for translation data.

The internet companies simply started to crawl the web and align text from translated websites. The billions of words they harvested served their purpose very well. The discoveries are well described in the paper The Unreasonable Effectiveness of Data, published by Google scientists Alon Halevy, Peter Norvig, and Fernando Pereira in 2009.

New commercial MT companies entered the market, who worked with the translation data provided by their customers. However, they often found that data was in short supply and fell back on similar techniques of crawling, curating and ‘manufacturing’ data.

The recent new breakthrough in machine translation technology - the arrival of neural networks - changes the perspective on data. Very large quantities of data belong to the past. Neural MT thrives on smaller volumes of high-quality translation data. The first English-French Google Translate engine was trained on a bilingual corpus of 100 billion words. With Neural MT a bitext of 10 million words can do miracles, and for specific domains, even smaller volumes will suffice.

This white paper gives an overview of the issues, the players, and benefits associated with the sharing of translation data. And it offers a solution to capitalizing on translation data, offering benefits for all stakeholders and creating a level playing field.

Authors: Andrew Joscelyne, Jaap van der Meer, Achim Ruopp and Anna Samiotou

 

Reports Search