Making MT Invisible: Amazon’s Daniel Marcu on the Ultimate Challenge

Daniel Marcu is an Applied Science Director at Amazon, as well as being a leading academic Machine Translation researcher. He was a founder at Language Weaver, eventually acquired by SDL, and a Director of Strategic Initiatives at the Information Sciences Institute, University of Southern California.

How do you evaluate the relative importance of data and algorithms in the machine translation mix?

My view has changed over the years. Fifteen years ago I would have said that data was more important. MT adopters didn’t do enough work on data, and the academic community was not focused on that specific problem either. Today, I would say that neither is more important than providing easy-to-use solutions that make MT invisible. For about five years now, MT quality for European languages such as French and Spanish has been pretty good, and with neural MT, the quality has become even higher. So the real question is not so much “is MT good enough?” but “how easy is it for people to use MT?”

MT is already used on a massive scale in e-commerce enabling companies such as Amazon and others to make product content available in more markets than human translation alone could manage. But this is just the beginning: there are still many other areas where MT could play a role but doesn’t yet.

What are Amazon’s MT use cases?

In general, MT tends to be useful in content localization setups such as retail and inventory, which helps Amazon replicate the US shopping experience across the globe.

But inside Amazon, there are multiple business units that all depend on a global business ecosystem and need to translate documents and data streams and engage better with customers via multilingual content communication and customer support. Amazon Translate also provides services to anyone with an AWS account and this covers a very broad range of applications. Translation is also used in the Alexa ecosystem, and all these capabilities are evolving across the board.

In the translation landscape, there are now tech giants competing against both LSPs and specialized pureplays such as Systran? Is there a specific role for everyone, or will the competition heat up?

One of the benefits of working for Amazon is that you don’t worry about such things; you just try to make your customers happy! Generally speaking, though, I think that the winning applications and solutions will address a broad range of market segments, which may be served better by different providers.

One example: the MT systems we currently provide on AWS Translate are pretty awesome in terms of quality and latency. If you are a developer, and want to use the service for anything from analytics to localization, you are getting an excellent service.

But here’s a counter-example: I went trekking in Chile but don’t speak Spanish, so I tried using a phone translation app. The translation quality was high, but I simply could not use the app to interact with people who were not in tune with the latest technology. All the parts in the chain (speech recognition, MT, and speech synthesis) worked very well, but it was impossible to have a real conversation using that app with people who did not speak any English. Again and again, it was impossible to establish a protocol for how we should use the app, so I resorted to using my hands and bad Spanish!

How do you handle quality evaluation at Amazon?

There is a considerable lack of standards and deep understanding in the field of translation evaluation. At Amazon, we evaluate in two ways. First, for day to day activities, improvements in BLEU scores help validate whether a given idea might prove to be useful; this works reasonably well as long as one operates inside one R&D framework. This is still a useful strategy for rapidly evaluating improvement strategies, but the results are only indicative.

Second, we also do human evaluations using both a general MT quality scale, as well as looking in-depth at the errors of our systems. We then develop algorithms to target these errors and try to eliminate them. This process works very well for internal evaluations.

For evaluating entire solutions – e.g. for MT on a retail website, you try to assess whether customers accessing information in their own language helps them make more informed buying decisions. In customer support cases, where a support agent interacts with a customer speaking a different language via MT, you measure how effective these interactions are.

However, there are significant issues at stake. Trying to provide evaluation assessments is a very tricky business and globally the field is still trying to figure out how to provide meaningful feedback to different markets. We naturally rely on automatic metrics and we know that BLEU scores correlate with quality within a given R&D setup. But when you compare different systems, and even within a single system, you know that automatic evaluations are not always going to be meaningful.

This is not saying that MT is no good – as with all new technologies, people extrapolate from small examples and make sweeping statements about the adequacy of their solutions. The trickiest part of quality evaluation is the fact that what matters is not your BLEU score or metrics, or even human evaluation in an offline environment; what fundamentally matters is the impact of the technology in a specific situation, as my example of speech-to-speech translation in Chile shows. The speech tech and the MT both performed very well if you measure the error rate. But I still could not use the application. My point is that we need to be responsible about how we characterize these applications for the media and speak as an industry about their success rates.

Does the same go for claims of human-machine parity in translation?

This is proof of how far the technology has come. Ten years ago it was already possible to create a test environment where you could claim machines and humans were at the same level of translation accuracy. In the automobile industry, for example, which is highly structured and has been around for 100 years, the content creation and content translation processes are highly controlled. Due to the formulaic language used in certain documents, it was possible to build a statistical MT system that compared very favorably in quality with a human translator. Today, with neural MT, one can show, for example, that MT systems translate news streams as well as humans when one evaluates translation quality by looking at a set of independent sentences produced by a select set of news providers; however, that claim does not hold when one evaluates translations of full documents or considers unconventional news providers.

The field has come a very long way, but I don’t think we are yet at the point where we can claim that MT is as good as humans. In other words, parity statements are typically valid to very specific experimental conditions that are often forgotten when reported by the media, and we are not even close to making statements about human-machine parity in general, across all languages. Once again, this issue is a moving target and we need to be careful what we claim.vironment where you could claim machines and humans were at the same level of translation accuracy. In the automobile industry, for example, which is highly structured and has been around for 100 years, the content creation and content translation processes are highly controlled. Due to the formulaic language used in certain documents, it was possible to build a statistical MT system that compared very favorably in quality with a human translator. Today, with neural MT, one can show, for example, that MT systems translate news streams as well as humans when one evaluates translation quality by looking at a set of independent sentences produced by a select set of news providers; however, that claim does not hold when one evaluates translations of full documents or considers unconventional news providers.

Which challenges do you expect to be addressed in the near future?

We need higher quality systems that translate with high accuracy across the board. And we also need systems that are more robust – one of the drawbacks of neural MT is that small changes in the source can lead to big changes in the translation. We need to build MT systems that are more aware of context. The same source phrase may be translated in three different, valid ways, depending on the context. And we need to build solutions in which MT is invisible.

Remember that neural techniques have been around for some 30 years, but initially the community did not see any meaningful impact from these techniques, as the systems needed large quantities of data for training and much higher compute power. The fundamental change has been in the ways words are represented: we used to look at words as uni-dimensional data points and computer systems were very poor at understanding relations between words. The NLP research community should have imagined faster that in transitioning from unidimensional to a multidimensional representation of words, i.e., by replacing words with vectors, one can reason better than with words as data points.

We are still figuring out how to learn increasingly useful vector-based representations and how to reason with them in support of many applications. This is certainly going to keep us busy for a few more years...

Making MT Invisible: Amazon’s Daniel Marcu on the Ultimate Challenge

We interviewed Daniel Marcu, Applied Science Director at Amazon, talking about the transition from unidimensional to a multidimensional representation of words, i.e., by replacing words with vectors.