Defining and framing the domain for which training and customization is required is the first step. Together with the customer, we align on a domain definition that matches with the requirements of the users. The domain is defined by one main category and up to five subcategories.
In the next step, we help you access vast amounts of multilingual data. Through custom web crawling, based on your defined domain and language pair, and other innovative techniques, we start a process to create, evaluate and validate a custom training dataset.
The training dataset will undergo a thorough cleaning process managed by the TAUS Data team. Cleaning steps include: deduplication, translation check (using advanced methods such as embeddings), language check, markup removal, alignment, tokenization, punctuation fixes.
In the next phase, both public MT engines as well as the customer’s baseline engines are customized with the DeMT™ training dataset. The output of the engines before and after customization are compared and carefully evaluated by independent partner company Polyglot Technology, generating a report showing BLEU and COMET scores.