The datafication of translation started with the unreasonable effectiveness of data article written by the Google scientists Fernando Pereira, Peter Norvig and Alon Halevy in 2009, or perhaps even earlier when the TAUS Data Cloud was launched in 2008. Translation learns from data. In those early days indeed there was no better data than ‘more data’. The English-French Google machine translation engine was trained by a corpus of 100 billion words. Now, with the new generation of Neural MT, very large quantities of data belong to the past. The pursuit of high-quality in-domain translation data will challenge the protectionists and create opportunities for pirates.
Data become an obsession, either way, in the translation industry. And it does not stop with translation memory data. We need speech data too. And we want to have the edits and annotations on human as well as machine translations, plus the attributes for content types, industry sectors, translators’ locations, the process applied, the technology used. And why not correlate it with the weather reports, the social graphs of the translators and their eye movement tracking? There is always something we can learn from new data.
The internet giants had a competitive edge in translation data, but they spoiled it by polluting their own fishing grounds with machine translations. Now, the hunt is open for new data marketplaces. The European Commission is investing in the Connecting European Facility. But watch out also for the greenfield translation data ventures in China, or perhaps closer to home: the TAUS Data Cloud.