This year’s meeting of the Association for Computational Linguistics (ACL) took place in South Korea. The long and tiresome flight from Amsterdam paid off in full by the exotic beauty of the JeJu Island and the outstanding ideas and experiments that were presented.
The list of companies sponsoring and supporting this academic conference is testament to its quality. This list includes Baidu, Google, Microsoft Research, and IBM Research, among others.
I was happy to learn about the great work going on around the world, particularly research that is questioning existing MT paradigms. Here’s a summary of a few of the talks that stood out.
Taking advantage of many evaluation metrics
MT optimization involves using automated metrics to help identify quality issues and then tuning (or optimizing) the engine based on this guidance. One of the pitfalls of optimization is the over reliance on one or two metrics. You should be aware that no automated metric (e.g. BLEU, TER) could be considered a robust indicator of absolute quality. For an overview of best practices for MT evaluation in the industry, please take a look at our recent reports on this topic.
One of the papers presented on the first day by five Japanese researchers (K. Duh from NAIST, K. Sudoh, X. Wu, H. Tsukada and M. Nagata (all – NTT)) outlined a novel “multi-objective optimization method that avoids over-ﬁtting to a single metric”.
In contrast to previous work, these researchers attempted to create a framework for optimization using many automated metrics. They aim to incorporate all available knowledge about different aspects of translation quality that could be used for optimization. Their proposed MT tuning technique, which is based on a simple Paretto algorithm, seems to work, although it was not tested against human evaluation.
As with almost all research presented at ACL, there is a long way to go before potential use by industry. But the cumulative and selective nature of this approach is a good evolution.
User defined domain specific engines
Parallel in-domain data is an expensive and often essential resource when training a SMT system for a speciﬁc business purpose. A few studies show an alternate route where adaptation of generic MT engines to a restricted topic can be done with the help of relatively cheap monolingual resources. One of these algorithms was presented at ACL by a group of Chinese researchers affiliated with Xiamen University, the Chinese Academy of Sciences and Baidu Inc. (Jinsong Su, Hua Wu, Haifeng Wang, Yidong Chen, Xiaodong Shi, Juailin Dong and Qun Liu).
Their idea is intuitive: a particular translation of ambiguous words more probably occurs in some topic-specific context and analyzing the results of topical analysis can help to constrain the translation candidates and get better results.
An example provided by the authors illustrates the possibility to disambiguate translation of the word “bank”, which often occurs in the sentences related to the economics when translated as “financial institution” and can be translated as a sloping ground along the edge of a river, lake or sea in the sentences related to geography.
The initial results of their technique showed a maximal gain of 1.2 BLEU points when experimenting on a one million segments parallel Chinese-English corpus (out-of-domain) and two monolingual in-domain corpora (four million sentences for Chinese and the same for English).
Based on this example, if a Baidu user indicated that the topic of the content to be translated related to economics, then the ‘general purpose’ Baidu translation engine would behave like a domain specific engine and provide a better result.
On-the-fly domain specific engines
Another paper describing domain adaptation technique was presented by a group of researchers from the University of Maryland led by the legendary Philip Resnik with co-authors Vladimir Eidelman and Jordan Boyd-Graber.
D. Chiang (University of Southern California) initially proposed the concept of domain adaptation based on lexical weighting in 2011. This group of researchers has extended his work. They dynamically bias a SMT system towards specific topic domains to provide more relevant translations.
The application of the dynamic adaptation algorithm described in their paper improves a full-scale system trained on 1.6 million segments by 3 TER points, which tends to imply noticeably better. This means that they adapted their system on-the-fly, without human intervention, to make it domain specific.
Google or Microsoft would no doubt love to implement such a feature if continued research proves the approach to be robust.
Deconstructing to the character level
Character-based translation always seemed to be an exotic MT paradigm, applicable to translation of out-of-vocabulary words, which the MT system is not familiar with. But now it has a new life thanks to Graham Neubig (Nara Institute of Science and Technology (NIST), Japan) and co-authors Taro Watanabe (NIST), Shinsuke Mori and Tatsuya Kawahara (both – Kyoto University).
In this paper, the authors describe how they achieved ‘decent’ results without using words or phrases as basic elements of a translation system. Instead, MT is treated as a problem of string transformation.
Initially, the character-based approach to MT was inspired by uncertainty of splitting up characters when processing Asian languages. Ambiguous word segmentation affects translation performance and can contribute significantly to translation errors. Until now, character-based translation only proved to be efficient for translating between closely related languages, like Spanish and Catalan, or Norwegian and Swedish.
In this paper the authors prove the hypothesis that character-based MT can improve the output quality for MT systems when translating between languages that belong to different families.
The algorithms for word alignment, which is the basic component of any SMT system that the system should rely on, received a stimulus in recent years. Research in the field of many-to-many alignment, aligning multi-word sequences of words on both sides of the translation, opened the way for new character-based algorithms to support SMT systems.
The potential benefits of bringing character information into the process of translation are difficult to underestimate. This would allow us to model the internal structure of tokens that an MT system operates with, to handle many problems related to long-distance dependencies and to correctly translate rarely seen (or even unseen) words. Or put simply, make significant improvements in translating between the most distant languages.
Their approach was tested on bi-directional English-Finish and English-Japanese systems and showed results similar to a regular word-based SMT. This promising approach can and, hopefully, will be further improved to become practically helpful when handling of morphologically, reach and other significant challenges in SMT.
This brief article covers just a small fraction of the insightful papers at the ACL and other major NLP conferences every year.