This tutorial will equip you with the know-how to make informed decisions about how to implement the Moses open source MT solution. It covers the full cycle from data preparation to tuning your engine to integration with your localization workflow.
The intended audiences are:
- Localization engineers at language service providers
- Software engineers at translation buyers
- Software developers new to machine translation
- Students of machine translation
Although the tutorial is focused on Moses, some modules, such as principles of SMT, those related to data and MT evaluation, are useful even if you buy a vendor solution and chose not to implement Moses yourself.
To follow the whole tutorial you need a basic knowledge of Linux/Unix administration. You must be able to operate command line tools.
Following this tutorial you will:
- Have a basic understanding of the principles of machine translation
- Have an overview of available open source tools for MT
- Understand the capabilities and benefits of the Moses open-source MT system
- Understand the options to obtain and deploy Moses
- Understand how a Moses MT system can be trained and optimized using available training data
- Understand how a trained Moses MT system can be integrated into an existing localization workflow and estimate the effort required
- Be able to compare Moses to other available MT systems
- Understand how to get support for Moses and how to contribute to the OSS project
This tutorial is supported by the European Commission Grant Number 288487 under the 7th Framework Programme.
Principles of Machine Translation
This presentation provides a brief overview of the history of machine translation and the approaches that were developed during that history. It then focuses on statistical machine translation including its different flavors, the process of training an SMT system with training data and the decoding process to perform translations.
Presentation by Barry Haddow, University of Edinburgh, UK |Duration: 14:29
Data Types and Sources
Training data is the essential ingredient for statistical MT systems. This presentation describes parallel and monolingual data, where to obtain it, and how to combine and select data to achieve the highest quality MT output.
Presentation by Maxim Khalilov, TAUS Labs | Duration: 08:20
Data Conversion and Corpus Preparation
This presentation and screencast describes the required training data format for the Moses SMT system and shows how to convert data into this format. It also shows how to align text from translated documents and how to convert TMX files to source more data for SMT training.
Demonstration by Achim Ruopp, TAUS Labs | Duration: 11:27
Data Cleaning and Tokenization
Once data is converted into the right format, it needs to be tokenized and cleaned before it can be used to train a SMT system. This presentation explains tokenization and word segmentation for East Asian languages and outlines cleaning options for SMT training data, used by many MT vendors.
The presentation provides guidance on which data cleaning to apply and how to apply it to obtain the best quality MT system. For some languages it is beneficial to add linguistic information to the SMT system. This is also described.
Presentation and demonstration by Achim Ruopp, TAUS Labs | Duration: 10:30 and 09:14
This presentation contains an overview of the Moses machine translation system, of associated components and the requirements on how to obtain and run the system. It also describes the history of Moses and the larger open-source Moses eco system including the development process, support and opportunities to contribute.
Presentation by Barry Haddow, University of Edinburgh, UK | Duration: 20:54
Training a Moses MT System
This screencast shows how to train a small Moses SMT system with the training data prepared in earlier screencasts, how to tune the trained system using a tuning set and finally how to perform translations with the trained system.
Demonstration by Achim Ruopp, TAUS Labs | Duration: 10:34
Bulk Translation and MT System Optimization
This screencast uses the Moses SMT system trained earlier to bulk translate a set of test data for which the BLEU score is calculated based on the available reference translations. In the second part of the screencast, the trained Moses system is optimized for lower memory use and translation speed.
Demonstration by Achim Ruopp, TAUS Labs | Duration: 06:35
This presentation provides a contrast between automated evaluation and human evaluation of machine translation output. We explain how automated evaluation is useful in the development of MT systems and then go on to describe the automated metrics BLEU, TER, GTM and Meteor.
Presentation by Maxim Khalilov, TAUS Labs | Duration: 12:10
This presentation describes different strategies of human evaluation for MT output, how to use them for error analysis for the improvement of MT systems and how to apply them in an industry setting to achieve the desired project goals.
Presentation by Maxim Khalilov, TAUS Labs | Duration: 08:51
Document Translation and Integration scenarios
Translation of complex document formats is common in the language industry. This presentation explains how the Okapi Framework and the Moses for Localization open source project can be used to translate these file formats using machine translation. We also address how to translate web pages with Moses and how to integrate Moses MT systems into content management or translation workflows using available web APIs.
Presentation by Achim Ruopp, TAUS Labs | Duration: 09:55
Document Translation and Web API demo
Previous demos showed how to translated single sentences and collections of sentences. This demo shows how to translate complex document formats using a combination of the Okapi Framework, Moses for Localization and Moses. The second half demonstrates the use of two web APIs that are available for Moses - the Moses Server XML-RPC API and the Moses for Localization REST API.
Demonstration by Achim Ruopp, TAUS Labs | Duration: 05:22
The presentations found via the links below where given at TAUS Open Source MT Showcases, these are supported by the European Commission Grant Number 288487 under the 7th Framework Programme.
Santa Clara, October 2013
- Is the Translation Industry Ready, Jaap van der Meer, TAUS
- The Open Source MT System Moses and Its Use in the Industry, Achim Ruopp, TAUS
- Creating Competitive Advantage with Rapid Customization & Deployment of Moses, Tony O’Dowd, KantanMT
- Microsoft Translator, Chris Wendt, Microsoft
- The WeMT Program, Olga Beregovaya, Welocalize
London, June 2013
- The Evolving MT Landscape, Rahzeb Choudhury, TAUS
- Moses Past, Present and Future, Hieu Hoang, University of Edinburgh
- Moses and Other Open Resources, Maxim Khalilov, TAUS Labs
- The True Value of MT to Global Business, Udi Hershkovich, Safaba Translation Solutions
- Moses in the Mix. A Technology Agnostic Approach to a Winning MT Strategy, Lori Thicke, LexWorks
- I Used to Be a Translator. Now I Run MT, Manuel Herranz, Pangeanic
- The Dynamic Quality Framework Tools, Rahzeb Choudhury and Maxim Khalilov, TAUS
Singapore, April 2013
- Introduction and Overview, Rahzeb Choudhury, TAUS
- Moses and Other Resources, Rahzeb Choudhury, TAUS
- Strategies for Building Competitive Advantage and Revenue from Machine Translation, Dion Wiggins, Asia Online
- MT for Southeast Asian Languages, Ai Ti Aw, Institute for Infocomm
- Hunnect’s Use Case, Sándor Sojnóczky, Hunnect
- Google Translator Toolkit, Patcharin Areewong, Google
- A Small LSP’s Guide to Commercialized Open Source SMT, Tom Hoar, Precision Translation Tools
- TAUS DQF, Rahzeb Choudhury, TAUS
Seattle, October 2012
- The Landscape, Rahzeb Choudhury, TAUS
- Moses Tutorial and Other Open Resources, Achim Ruopp, TAUS labs
- Two Practical Use Cases at AVB Translations, Joel Sigling, AVB Translations
- Full Service Enterprise-Specific MT for Global Enterprises, Alon Lavie, Safaba Translation Solutions
- Language Processing Techniques for Statistical Machine Translation, Diego Bartolome, tauyou
- The Simple Install – Streamlining Moses Setup for Industry Scale Users, Jeff Rueppel, Adobe
- TAUS Dynamic Quality Framework, Rahzeb Choudhury, TAUS
Paris, June 2012
- Moses inside Symantec, Fred Hollowood, Symantec
- An MT journey: MT in use at Sybase, a SAP company, Kerstin Bier, Sybase
- Bologna Translation Service: Making education accessible accross Europe, Luc Meertens, CrossLang
- Moses: The Trusted Translations Experience, Gustavo Lucardi, Trusted Translations
- The ups and downs of implementing an MT environment for English - Hungarian, Sándor Sojnóczky, Hunnect
- Pangea MT putting open standards to work, Manuel Herranz, Pangeanic
Beijing, April 2012
- How we use Moses to develop our multi-lingual Machine Translation systems, Chengqing ZONG, Institute of Automation, Chinese Academy of Sciences
- High quality self-serve MT in SmartMATE, Jie Jiang, Applied Language Solutions
- Moses tool set. A set of tools based on Adobe technology to simplify your usage of Moses, Yu Gong, Adobe
Monaco, March 2012
- Friendly Machine Translation, Diego Bartolome, tauyou
- Moses on the Cloud. Do-it-yourself Machine Translation, Andrejs Vasiljevs, Tillde
- High quality self-serve MT in SmartMATE, Jie Jiang, Applied Language Solutions
- "Moses, Moses: Let my people go." Moses MT engine feasibility study, Serge Gladkoff, Logrus
- A Moses engine for legal translation, Joel Sigling, AVB Translations
- Moses from the point of View of an LSP: The Trusted Translations experience, Gustavo Lucardi, Trusted Translations
Taking the MT decision: selection, build-out and hosting
Showcasing the industry's innovations
Want to ride the machine translation tidal wave?
Will there be a thousand Moses MT systems?
Machine translation and Asian languages
Moses: Commodity creates opportunity
Moses takes TAUS to Beijing (through MosesCore)
Six Moses machine translation use cases (through MosesCore)
Moses Showcases at Localization World in Paris (through MosesCore)
Dr. Maxim Khalilov was responsible for Research and Development at TAUS from 2011 until 2014. He specializes in building statistical machine translation engines and has published 30 publications in scientific journals and conference proceedings.
Maxim achieved his PhD in Signal Theory and Communication from the Polytechnic University of Catalonia in Barcelona. During his PhD training he studied at the Center for Language Technologies at Macquarie University in Australia. His PhD thesis was entitled "New statistical and syntactic models for machine translation" and focused on different aspects of statistical machine translation (SMT) technology: language model optimization, word rerordering for phrase- and N-gram-based SMT and introducing syntactical knowledge to the statistical translation models.
From 2009-2011, he worked as a post-doctoral researcher at the Language and Computation group at the University of Amsterdam, where he did research on the integration of machine learning algorithms and syntax to statistical translation systems.
Achim Ruopp specializes in translation automation, internationalization and multi-lingual natural language processing. He believes that machine translation is not just for the big guys and academia, but that everybody can build MT systems for their languages and use case. He works on making the tools and knowledge for do-it-yourself MT available as widely as possible.
Achim has over a decade of experience in the localization industry, working at Microsoft enabling developer tools for international markets. In 2007 he started Digital Silk Road to advise customers on statistical machine translation and currently develops content and products for TAUS Labs. He is a frequent presenter at internationalization and machine translation conferences and workshops.
Achim holds an MA in computational linguistics from the University of Washington and a diploma in computer science from the Technical University Munich.
Barry Haddow is a post-doctoral researcher in machine translation, working in the statistical MT group in the University of Edinburgh. His research interests include discriminative training and feature engineering for machine translation systems, and domain adaptation in machine translation. He has also been one of the main contributors to the Moses translation system in his time at Edinburgh, both in terms of software and user support. Barry is currently the coordinator of the EU-funded MosesCore project, as well as working on the EU-funded Accept project, for translation of user-generated text. He has previously worked on EuroMatrix and EuroMatrixPlus, as well as the academic-industrial information extraction project, TXM. He received his PhD in Mathematical Physics in 1994 from the University of Aberdeen, and before coming to Edinburgh worked as research fellow in Trinity College Dublin, and for several years as an IT Consultant in a Dublin-based company.
This tutorial is a part of the MosesCore project supported by the European Commission Grant Number 288487 under the 7th Framework Program.
For more information about the MosesCore initiative, please check this website.