This article originally appeared in TAUS Review #3 in April 2015
Google Translate is the world’s best-known free tool for machine translation. It is made possible by Google’s huge trove of data, and the statistical techniques that match n-grams from one language with plausible n-grams on another. For an outsider to the translation industry like me, Google Translate seemed to represent a great leap forward in translation quality when it was first introduced. However, since then, its quality improvements seem more incremental, when they are visible to all. How did Google Translate get so good? And how can it avoid plateauing in quality, and get better still?
One of the bright sides of being a journalist is that when you have questions like this, you can just call the people who know the most and ask them. Google’s press team responded to my email with an offer to talk to Macduff Hughes, the engineering director for Google Translate.
First, where did Google get all of its data? It crawls and saves text from about a trillion web pages. But how does it know what is human-translated text to run its statistical learning algorithms on? I had thought that perhaps humans cull and code the texts to be fed into the engine.
But Hughes explained that the search engine simply looks for pages that look like they might be translations of one another. Perhaps they have identical domains, only one ends in /en and another ends in /fr. Perhaps they have proper names or identical numbers in the same position. The software does not weight a pairing as more or less likely to be a translation—it is an either-or binary decision, in or out.
How did it get so good? The initial leap in quality came from sheer mass. A 2009 paper by three Google researchers responded to the “physics envy” that students of human phenomena feel. A classic 1960 paper had been titled “The Unreasonable Effectiveness of Mathematics in the Natural Sciences”, extolling the power of formulae like f=ma. Linguistics has no such formula. But the Google researchers retorted by calling their 2009 paper “The Unreasonable Effectiveness of Data.”
The Google approach is that a simple approach over a huge trove of data is better than a clever approach over limited data. With so much data, errors will, it is hoped, cancel each other out in the enormous aggregate.
In addition to all that unmarked, untagged messy data, Google does get some specialty data from professional translators: the European Patent Office shares data with Google, for example, though Hughes says that this EPO data (despite its high quality) does not currently have any special weight in the public-facing Google Translate. He notes, sensibly enough, that many people use Google Translate for slangy or spoken-language purposes, for which giving too much weight to the kind of language in a patent application would be less than ideal.
But even Google has limits on what enormous amounts of data can do. There are thousands of potential language pairings across the several dozen languages Google Translate offers. But for the vast majority of those pairings (Finnish-Zulu, say), there is little or no training text available, even on a trillion web pages. So the user hoping to translate Finnish to Zulu on Google Translate will be going through a “bridging” language, almost certainly English.
This of course magnifies the possibilities for error. Asya Pereltsvaig, who teaches linguistics at Stanford, caught Google Translate translating a Russian nursery rhyme with “two happy geese” into French and getting deux oies gay—two homosexual geese. The culprit was, of course, the double-meaning of “gay” in English, the bridging language between Russian and French.
This leads to another problem. Pereltsvaig has translated this phrase with Google Translate, however badly. The dud translation now lives on the web, where it will be crawled by Google—and could be fed back into Google Translate. What if the service is, to put it crudely, consuming its own waste?
Hughes acknowledges the problem frankly. Google has tried electronically “watermarking” its translations so the crawler will recognize them and try to avoid feeding mistakes back into the system as input. And then there are web pages that simply have the same text in—suspiciously—all of the languages Google Translate offers. The system can guess that these were translated by Google and avoid feeding them back into the system.
Would more data help an organization that already has so much? Would ten trillion pages be noticeably better than one trillion? Hughes is again frank: for the most common language pairings, “we have reached about the limit where more data is helpful.”
His efforts have to turned to making Google Translate smarter, playing with rule-based improvements to see if they improve quality. In other words, if Google Translate’s first great leap forward came from huge data and computing power, for big languages, at least, its next leap forward will rely more on clever software engineering. For example, automatic parsing can improve word order in translations.
And he mentions neural networks as a particularly exciting avenue for research—this, after all, has been particularly helpful in Google’s speech-recognition.
But there is another avenue: the great software company is asking good old fashioned human users to chip in their expertise. If you are a frequent user of Google Translate, you will probably have noticed the “Help Improve Google Translate” at the bottom of the page. These user-driven efforts pack a particularly heavy punch for those languages for which data is sparse, and users are keen volunteers.
A titan of data like Google is smart enough to know the limits of data. Hughes hopes that some (undiscussed) radical breakthroughs might yet lead to a sudden leap forward in Google Translate’s quality. But even absent that, cycles of data gathering and incremental innovation are hoped to gradually inch the needle of quality forward. And the wisdom of crowds—Google’s users—could inch it further.