That is the short and easy answer the SMT developer will give, when you ask: "what can we do to improve the quality of the machine translation engine?"
But things are not always that easy in the world of Statistical Machine Translation. Even the insiders are sometimes puzzled by the effects of data training on the SMT engines. It's time to bring some clarity into this obscure and complex area of the translation industry. According to the current TAUS market survey more than 50% of the respondents expect to be using machine translation in their translation operation within the next two years. More than 50% of the respondents also expect to share language data with industry partners in order to build large enough data sets to train MT engines more effectively.
In a few weeks the TAUS Data Association will be launching release 1 of the language data exchange portal. The repository already contains around half a billion words in 45 language pairs. Access to large volumes of data will soon no longer be the show stopper. Access to SMT engines is not a problem either. In addition to commercial systems such as Language Weaver and Asia Online, there are many research systems and there is the open source Moses SMT engine. No, the key is really how to put one and one together, in other words how to get the best results out of the combination of the data and the engine.
TAUS asked Tom Hoar, a veteran from the world of speech technology and expert in SMT training, to write a concise guide about cleaning and preparing data from the training of SMT engines. This TAUS Technical Guide is intended for anyone faced with preparing translation training data for statistical machine translation. It examines data preparation processes which are the catalysts that enable data and algorithms to work in unison. It explores how to define an organization's training data strategy to match overall system design, identifies potential data sources, introduces the challenges of merging multiple corpora to create large data sets and explores several methods to prepare these translation memories into SMT training data.
At the same time Asia Online finished their report on an SMT pilot project they offered to do for some of the TAUS members who decided to share their translation memories to assess the potential benefits.
Jumping to one conclusion.... Yes, data sharing is beneficial, but equally (if not more) important is it to have ‘clean' data. And... wouldn't it be great if we all used the same terminology if we refer to the same thing? Things we are learning, and putting in practice. In this latest research TAUS is getting technical and practical for our members who want to get their hands ‘dirty'.
TAUS MEMBER REPORT
Technical Guide to SMT Training Data
Author: Tom Hoar