Not so coincidentally an article with this telling title landed in my inbox exactly on the same day that we announced release 1 of the TDA language data exchange portal. "The unreasonable effectiveness of data" is an article written by three researchers at Google* and published by the IEEE Computer Society in their March/April journal. Why is it, so ask the authors, that physics can be so neatly explained with simple mathematical formulas, while economics fail to model human behavior and grammar is suffocated by hundreds of rules and just as many exceptions
Why is it that language can never be explained with the elegance of physics equations?
In the past ten years linguistic theories evolved from Chomsky's grammatical declaration of language to the statistical approach relying on learning patterns and taxonomies of families of constructions. The Google researchers represent the latest school of thought.
forget trying to come up with elegant theories and embrace the unreasonable effectiveness of data. In support of their hypothesis they refer to the trillion-word corpus of the English language that Google made available in 2006 to the research community. Even though this corpus contains a variety or errors and incomplete sentences, it is much more valuable than the one million word corpus that was carefully corrected and annotated with grammatical information they were trained on in their student days, simply because it is a million times larger.
Google demonstrates the value of lots of data for the development of machine translation engines. The Google Translate service performs surprisingly well, so say many users. The TAUS Data Association (TDA) was established by its members to leverage the value of translation memories accumulated over the years. By sharing translated data in an industry-owned repository and categorized by industry, members will boost translator productivity and improve the quality of domain-specific machine translation engines. 150,000 professional translators in the world produce around 300 million words of good quality parallel text every day or 75 billion in a year. This is the material that TDA will store and use to help the world communicate better.
I recommend the article from the Google authors to anyone who wants to dig deeper into the why and how of TDA. We share the vision but execute differently. TDA works with trusted translations only and classifies data by owner, industry, domain and content type to give a significant quality push for MT for members who can pool data within their industry. This classification will also support cross-lingual industry-specific taxonomies and search capabilities in an effective way. TDA respects IP rights and is owned by its members. TDA members maintain the data and watch over the quality.
* Thanks to Alon Halevy, Peter Norvig, and Fernando Pereira for a great article.