The translation industry has come to a new era of datafication. This is a trend no one can stop and it also means human beings are much more thoughtful and considerate about their past work and legacy data. By continuously feeding the machine with data and interacting with machine through post-editing and manual data adjustment and annotation, human beings are producing smarter machine-operated systems. This includes Machine Translation (MT) engines, terminology management platforms, translation memory tools, translation management software, and multilingual annotation systems.
Translation-related data can be used in many ways and some of its applications go beyond the translation industry itself. At the TAUS Roundtable in Washington D.C. and the annual conference in San Jose, we saw the following areas where translation data are widely used.
1. Translation-related data to support machine translation
Parallel data, monolingual data and post-editing data help train machine translation engines. Parallel data and target language data are very important for the machine to learn the alignment and translation rules and gain the naturalness of the translation in the target language. Except for SYSTRAN rule-based MT engine, most of current MT engines are based on statistical analysis of the data. Domain customization in machine translation becomes more and more important in the MT world.
From a developer’s perspective, it makes much more sense to take a top down approach, i.e., starting from a very general topic to get some results first and then move down to more specific domains and fine-tune these results. In-domain data help improve the quality of machine translation. As Chris Wendt, Microsoft’s Group Program Manager - Machine Translation Microsoft Research, presented at the TAUS conference, Microsoft Translator customers achieve an average of 10 BLEU points increase through domain customization. That is a huge increase in machine translation and it undoubtedly shows the power of customization.
Domain-customized dictionaries, bilingual parallel data and monolingual target language data could generate different translation results, if not necessarily better results. Some MT engines such as TILDE Translator allow users to upload, share and use domain-specific translation memory to customize machine translation results. In other words, users can customize the machine translation engine by feeding the machine with different data. We can also see a different relationship between human beings and machines in this trend. Human beings used to have almost no control on the machine translation process except inputting or uploading the source texts to the machine. The new building-your-own-MT model gives more power to the human users, who start to realize the importance of collecting such data as translation memory, termbases and monolingual data, and start to manipulate the machine-feeding parameters so as to control machine translation results to some degree.
2. Translation-related data to support business decision-making
Translation-related data can also be used for business intelligence and help marketers solve business problems. The business industry collects, uses and analyzes data in order to make more informed decisions. The panel discussion moderated by Sergio Pelino (Google) at the TAUS conference, represented opinions and experiences in this regard from the buyer side - Salvo Giammarresi (PayPal) and Andrea Siciliano (Google), on the supplier side - Doug Knoll (Welocalize), and Henry Wang (UTH International), a company that purchases data as a commodity.
The panelists agreed that translation-related data are useful in business management and it helps provide insight for strategic decisions in business. By analyzing the data and examples, it can help prevent fraud, recommend items to shoppers, predict emergency room waiting times, and improve localization performance. Language service providers can also use the data to select the right translators at the right time and under the right conditions. The data can also be useful for marketing people to predict the right expectations from customers. In some cases, translation-related data can help users segment the content and classify language combinations. For example, often we find data will not work across languages evenly. Some data will work well in some languages, whereas in some harder languages—or in other words, tier 2 languages—the data will not work well.
Here again, domain is an important parameter. In-domain translation-related data could be more useful in the aforementioned areas because these data are more targeted and relevant to the purposes of data analytics.
Anything that generates value might have a cost. Then what is the cost of data? What data are valuable to the company? What is the pricing strategy for this special soft product? This area is wide open and needs more discussions.
3. Translation-related data to support predictions
Translation-related data can assist human beings to make predictions. As Smith Yewell, founder and CEO of Welocalize, presented at the TAUS Roundtable in Washington D.C., most of the quality and on-time delivery problems in the localization industry are discovered after the fact, and too few localization programs are linked to measurable business outcomes. Our industry needs to introduce predictive analytics engines, just as many other industries have done.
There are many areas data can help us make predictions. For example, we can analyze the translation patterns of a translator, and predict some quality issues related to this person. At the TAUS Insider Innovation Excellence Awards, Olga Beregovaya from Welocalize demonstrated a new tool they developed, StyleScorer, which can help automatize the process of stylistic analysis and stylistic predication. In addition, as we mentioned earlier, humans can use translation-related data to predict items that shoppers may like and recommend them to shoppers. Data can also be useful for marketing people to predict the right expectations from the customers. Integrated translation management systems similar to Uber will allow us to make predictions on translation needs, translators’ availability, the right domains for translators to work with, an individual translator’s rate and translation quality issues. These are all promising areas for us to continue to explore.
4. Closing Remarks: Innovation through collaboration
The language industry is undergoing a huge revolution. In this revolution, data collection and data analytics play a very important role. Translation-related data have been widely applied in many areas beyond the translation industry itself. Many disciplines are involved in this process, for example, translation, corpus linguistics, computational linguistics, business management and neurology. Interdisciplinary collaboration is inevitable and it will definitely help realize a smarter data application on the translation and localization market.
As an educator and researcher, I have seen quite a few applications of theoretical frameworks and research findings in the industry. But I also see great potential in this regard. On the one hand, there are many more research findings and theories that the industry have not fully utilized, for example, discourse analysis theories, comparable translation-driven corpora, and textual and contextual anticipation. On the other hand, the academia should listen to the market and customize their research to meet the market needs. Together the industry and academia can promote innovations and push the profession forward.
Note: This article is mainly based on the inputs from the following events:
Datafication of translation and Let Google and Microsoft run with it, covering:
- Washington: Chris Wendt’s session
- Washington: Smith Yewell’s session
- AC: Sergio Pelino’s session
- AC: Olga Beregovaya’s session