DeMT™ Build
Measured Improvement. Success-Based Pricing.
Are you trying to improve the performance of your engines for specific languages and domains? TAUS DeMT™ Build is an end-to-end service that includes domain definition, data collection, data cleaning and evaluation and benchmarking. We guarantee an improvement in BLEU scores and base our price on the percentage of improvement.
How it works
Domain Definition

Defining and framing the domain for which training and customization is required is the first step. Together with the customer, we align on a domain definition that matches with the requirements of the users. The domain is defined by one main category and up to five subcategories.

Data Collection

In the next step, we help you access vast amounts of multilingual data. Through custom web crawling, based on your defined domain and language pair, and other innovative techniques, we start a process to create, evaluate and validate a custom training dataset.

Data Cleaning

The training dataset will undergo a thorough cleaning process managed by the TAUS Data team. Cleaning steps include: deduplication, translation check (using advanced methods such as embeddings), language check, markup removal, alignment, tokenization, punctuation fixes.

Evaluation & Benchmarking

In the next phase, both public MT engines as well as the customer’s baseline engines are customized with the DeMT™ training dataset. The output of the engines before and after customization are compared and carefully evaluated by independent partner company Polyglot Technology, generating a report showing BLEU and COMET scores.

DeMT™ Evaluation Report, June 2022