Getting Started

This tutorial will equip you with the know-how to make informed decisions about how to implement the Moses open source MT solution. It covers the full cycle from data preparation to tuning your engine to integration with your localization workflow.

The intended audiences are:

  • Localization engineers at language service providers
  • Software engineers at translation buyers
  • Software developers new to machine translation
  • Students of machine translation

Although the tutorial is focused on Moses, some modules, such as principles of SMT, those related to data and MT evaluation, are useful even if you buy a vendor solution and chose not to implement Moses yourself. 

Pre-requisites

To follow the whole tutorial you need a basic knowledge of Linux/Unix administration. You must be able to operate command line tools.

Takeaways

Following this tutorial you will:

  • Have a basic understanding of the principles of machine translation
  • Have an overview of available open source tools for MT
  • Understand the capabilities and benefits of the Moses open-source MT system
  • Understand the options to obtain and deploy Moses
  • Understand how a Moses MT system can be trained and optimized using available training data
  • Understand how a trained Moses MT system can be integrated into an existing localization workflow and estimate the effort required
  • Be able to compare Moses to other available MT systems
  • Understand how to get support for Moses and how to contribute to the OSS project

This tutorial  is supported by the European Commission Grant Number 288487 under the 7th Framework Programme.

Principles of MT

Principles of Machine Translation
This presentation provides a brief overview of the history of machine translation and the approaches that were developed during that history. It then focuses on statistical machine translation including its different flavors, the process of training an SMT system with training data and the decoding process to perform translations.

Presentation by Barry Haddow, University of Edinburgh, UK |Duration: 14:29


Data

Data Types and Sources
Training data is the essential ingredient for statistical MT systems. This presentation describes parallel and monolingual data, where to obtain it, and how to combine and select data to achieve the highest quality MT output.

Presentation by Maxim Khalilov, TAUS Labs | Duration: 08:20


Data Conversion and Corpus Preparation
This presentation and screencast describes the required training data format for the Moses SMT system and shows how to convert data into this format. It also shows how to align text from translated documents and how to convert TMX files to source more data for SMT training.

Demonstration by Achim Ruopp, TAUS Labs | Duration: 11:27


Data Cleaning and Tokenization
Once data is converted into the right format, it needs to be tokenized and cleaned before it can be used to train a SMT system. This presentation explains tokenization and word segmentation for East Asian languages and outlines cleaning options for SMT training data, used by many MT vendors.

The presentation provides guidance on which data cleaning to apply and how to apply it to obtain the best quality MT system. For some languages it is beneficial to add linguistic information to the SMT system. This is also described.

Presentation and demonstration by Achim Ruopp, TAUS Labs | Duration: 10:30 and 09:14


Training Systems

Moses Introduction
This presentation contains an overview of the Moses machine translation system, of associated components and the requirements on how to obtain and run the system. It also describes the history of Moses and the larger open-source Moses eco system including the development process, support and opportunities to contribute.

Presentation by Barry Haddow, University of Edinburgh, UK | Duration: 20:54


Training a Moses MT System
This screencast shows how to train a small Moses SMT system with the training data prepared in earlier screencasts, how to tune the trained system using a tuning set and finally how to perform translations with the trained system.

Demonstration by Achim Ruopp, TAUS Labs | Duration: 10:34


Bulk Translation and MT System Optimization
This screencast uses the Moses SMT system trained earlier to bulk translate a set of test data for which the BLEU score is calculated based on the available reference translations. In the second part of the screencast, the trained Moses system is optimized for lower memory use and translation speed.

Demonstration by Achim Ruopp, TAUS Labs | Duration: 06:35


Evaluating Output

Automatic Metrics
This presentation provides a contrast between automated evaluation and human evaluation of machine translation output. We explain how automated evaluation is useful in the development of MT systems and then go on to describe the automated metrics BLEU, TER, GTM and Meteor.

Presentation by Maxim Khalilov, TAUS Labs | Duration: 12:10


Human Evaluation
This presentation describes different strategies of human evaluation for MT output, how to use them for error analysis for the improvement of MT systems and how to apply them in an industry setting to achieve the desired project goals.

Presentation by Maxim Khalilov, TAUS Labs | Duration: 08:51


Integration

Document Translation and Integration scenarios
Translation of complex document formats is common in the language industry. This presentation explains how the Okapi Framework and the Moses for Localization open source project can be used to translate these file formats using machine translation. We also address how to translate web pages with Moses and how to integrate Moses MT systems into content management or translation workflows using available web APIs.

Presentation by Achim Ruopp, TAUS Labs | Duration: 09:55


Document Translation and Web API demo
Previous demos showed how to translated single sentences and collections of sentences. This demo shows how to translate complex document formats using a combination of the Okapi Framework, Moses for Localization and Moses. The second half demonstrates the use of two web APIs that are available for Moses - the Moses Server XML-RPC API and the Moses for Localization REST API.

Demonstration by Achim Ruopp, TAUS Labs | Duration: 05:22


Use Cases

The presentations found via the links below where given at TAUS Open Source MT Showcases, these are supported by the European Commission Grant Number 288487 under the 7th Framework Programme.

Santa Clara, October 2013

London, June 2013

Singapore, April 2013

Seattle, October 2012

Paris, June 2012

Beijing, April 2012

Monaco, March 2012

Useful Resources

Moses Resources

openmt 

Website: http://www.statmt.org/moses/
Source: https://github.com/moses-smt/mosesdecoder
Mailing list: This email address is being protected from spambots. You need JavaScript enabled to view it.
Twitter: @MrMoses_Esq  


Articles

Taking the MT decision: selection, build-out and hosting
Showcasing the industry's innovations
Want to ride the machine translation tidal wave?
Will there be a thousand Moses MT systems?
Machine translation and Asian languages
Moses: Commodity creates opportunity
Moses takes TAUS to Beijing (through MosesCore)
Six Moses machine translation use cases (through MosesCore)
Moses Showcases at Localization World in Paris (through MosesCore)


Reports

"/Manager's Guide to Implementing Open Source SMT
How to Implement Open Source MT Solutions
Moses Users: Experiences and future requirements
Moses Users: Changing priorities (through MosesCore)

ALL REPORTS


Technologies

 

TAUS Directories: Find tools to improve your engine performance

Bios

Maxim Khalilov

maxim khalilov

Dr. Maxim Khalilov was responsible for Research and Development at TAUS from 2011 until 2014. He specializes in building statistical machine translation engines and has published 30 publications in scientific journals and conference proceedings.

Maxim achieved his PhD in Signal Theory and Communication from the Polytechnic University of Catalonia in Barcelona. During his PhD training he studied at the Center for Language Technologies at Macquarie University in Australia. His PhD thesis was entitled "New statistical and syntactic models for machine translation" and focused on different aspects of statistical machine translation (SMT) technology: language model optimization, word rerordering for phrase- and N-gram-based SMT and introducing syntactical knowledge to the statistical translation models.

From 2009-2011, he worked as a post-doctoral researcher at the Language and Computation group at the University of Amsterdam, where he did research on the integration of machine learning algorithms and syntax to statistical translation systems.


Achim Ruopp

achim ruopp

Achim Ruopp specializes in translation automation, internationalization and multi-lingual natural language processing. He believes that machine translation is not just for the big guys and academia, but that everybody can build MT systems for their languages and use case. He works on making the tools and knowledge for do-it-yourself MT available as widely as possible.

Achim has over a decade of experience in the localization industry, working at Microsoft enabling developer tools for international markets. In 2007 he started Digital Silk Road to advise customers on statistical machine translation and currently develops content and products for TAUS Labs. He is a frequent presenter at internationalization and machine translation conferences and workshops.

Achim holds an MA in computational linguistics from the University of Washington and a diploma in computer science from the Technical University Munich.


Barry Haddow

barry haddow

Barry Haddow is a post-doctoral researcher in machine translation, working in the statistical MT group in the University of Edinburgh. His research interests include discriminative training and feature engineering for machine translation systems, and domain adaptation in machine translation. He has also been one of the main contributors to the Moses translation system in his time at Edinburgh, both in terms of software and user support. Barry is currently the coordinator of the EU-funded MosesCore project, as well as working on the EU-funded Accept project, for translation of user-generated text. He has previously worked on EuroMatrix and EuroMatrixPlus, as well as the academic-industrial information extraction project, TXM. He received his PhD in Mathematical Physics in 1994 from the University of Aberdeen, and before coming to Edinburgh worked as research fellow in Trinity College Dublin, and for several years as an IT Consultant  in a Dublin-based company.


If you experience difficulties viewing the presentations, please write to This email address is being protected from spambots. You need JavaScript enabled to view it.

This tutorial is a part of the MosesCore project supported by the European Commission Grant Number 288487 under the 7th Framework Program.

For more information about the MosesCore initiative, please check this website.