Data entered the field of machine translation in the late eighties and early nineties when researchers at IBM’s Thomas J. Watson Research Center reported successes with their statistical approach to machine translation.
Until that time machine translation worked more or less the same way as human translators with grammars, dictionaries and transfer rules as the main tools. The syntactic and rule-based Machine Translation (MT) engines appealed much more to the imagination of linguistically trained translators, while the new pure data-driven MT engines with probabilistic models turned translation technology more into an alien threat for many translators. Not only because the quality of the output improved as more data were fed into the engines, but also because they could not reproduce or even conceive what really happened inside these machines.
The Google researchers who also adopted the statistical approach to MT published an article under the meaningful title “The Unreasonable Effectiveness of Data”. Sometimes even the statisticians themselves were wondering why metrics went up or down, but one thing seemed to be consistently true: the more data the better.
Around that same time, in 2008, TAUS and its members founded the Data Cloud: a cloud-based repository where everyone can upload their translation memories and earn credits to download data from other users. The objective of the data sharing platform was to give more companies and organizations access to good quality translation data needed to train and improve their MT engines.
Now, as of September 2016, the TAUS Data Cloud already contains more than 70 Billion words in 2,300 language pairs. The hunger for data seems to be unstoppable. The European Commission launched an ambitious program under the name Connecting European Facility Automatic Translation (CEF.AT) that is aimed at collecting Translation Memory (TM) data between all 24 languages of the European Union (23 after the Brexit is completed). The transition to the next generation of MT - Neural or Deep Learning MT - will not stop the hunt for data. Neural MT runs on data as much as SMT does.
Data is harvested in different ways. Besides the sharing platforms, such as TAUS Data Cloud and CEF.AT, large companies with the proper tools and means have been scraping data from translated websites. And of course, as in many other private and business environments, users share data, consciously or unconsciously, when they use online services. In the translation industry this happens for instance when translators are post-editing MT output on a cloud-based translation platform. The company that owns the platform or the MT developer that provides the API to facilitate the MT service receive the translation data and may use this to further improve the performance of the MT engines. However, these practices do raise concerns around copyright.
So much for using data for machine translation. As in other businesses and industries, the data are marching into the translation sector to teach machines to take decisions and gradually take over human tasks, also on the management side. A good example of such a decision is choosing the best translator for a job. In the classic translation agency this is typically a decision taken by a project manager who knows both customer and translator, the projects translators are working on and whether or not translators are available. The project manager also manages the steps to assign a job.
For some new translation technology platforms, this is an example of a decision that is highly automated, not just by using a database in the backend, but also by constantly feeding new data into the platform and learning from it. Data that are useful to teach machines to take good decisions in this respect range from quality scores on previous jobs, response times, on-time deliveries, throughput, productivity, to the weather on the day in the translator’s location, the specialties and preferences of the translator.
For example, TM-Town, a Japanese-based start-up, works on developing graphs or a profiles of translators. Based on analyses of translations or translation memories uploaded by the translator TM-Town is able to determine how good a translator is in a particular domain. Translated, in Rome, developed T-Rank, a fully automatic job allocation tool. Straker Software and Gengo are two other examples of platforms that use data-driven algorithms to match projects with human resources.
The potential and future of the datafication of translation became apparent at a TAUS Executive Forum in June 2013 in Dublin. Here, Aiman Copty, Vice President of the Worldwide Product Translation Group, presented the business intelligence dashboard developed by the Oracle engineers. This dashboard tracks the return-on-investment of the translation of each individual string of text in the user interface of Oracle software by, for instance, tracking how often this string is read by a user of the software.
There is an abundance of data points in the translation process. Most translation technologies keep logs of the milliseconds spent on the translation of each segment, the number of keystrokes, the number of edits, the quality score, the time of the day, the language pair, content type, the technology, and a lot more.
The questions that need to be answered now are: how do we aggregate these data and how do we make sense out of the data in order to optimize and automate processes. The datafication of translation trend leads to dashboards, benchmarking and machine learning. More and more providers and buyers of translation will have dashboards where they visualize data to report on projects, benchmark their translation resources and technologies and decipher trends. More advanced translation platforms use the data to develop algorithms and code them into their products to automate tasks, such as finding resources, matching content types with the right tools and processes, and predicting quality. Translation companies will be looking out for data specialists who can help mine data and develop the algorithms that automate and optimize management processes.
No doubt data will become increasingly important in the translation process, but what if different data tell us different things. Language service providers and technology providers typically differentiate themselves in many ways to highlight their value to their customers. They use different terminology, different metrics, different definitions, different quality levels, matching algorithms and segmentations. Therefore, for data to make sense and machines to learn and be useful across the industry, or at least across different vendors, it is important for the industry to work together towards harmonization. Industry leaders at the TAUS Forum in Dublin in June 2016 raised the question: should we work together to learn more from our collective data?
An interesting initiative in this respect already started with the creation of the TAUS DQF Enterprise User Group in which companies like Microsoft, Lionbridge, eBay, Cisco, Welocalize, Intel, Oracle, AlphaCRC work together to agree on metrics for quality evaluation, quality levels, nomenclature, categories and counting. The objectives of this collaboration are to get comparable data sets to enable industry benchmarking. This will lead to business intelligence which in turn will help us develop metrics and algorithms that can be used to learn machines to work for us across platforms and vendors.
The discussion about datafication in translation and how we - as an industry - can work together to make sense of the data that we collect, will be continued at the TAUS Annual Conference in Portland on October 24-25. If you like to share your insights and experience with data in your business, please take a moment to respond to our survey. We will publish a report in November about the outcomes of the Portland debates and the survey. All respondents to this survey will receive a copy of this report.