Yamagata provided TAUS with in-domain datasets and translation memories, and around 200 segments of annotated (good and bad examples of) translations in both language combinations. The custom model creation involved the following steps:
- Cleaning of the provided datasets to identify high quality translation segments.
- Generation of additional synthetic data - here TAUS created paraphrases (similar sentences that should be scored close to original examples) and perturbations (changing specific outputs of the sentence, to get examples that should be scored low). The team also scored all examples, interpolating scores for paraphrases and perturbations, to be able to provide a score for training.
- Experiments using different portions of the training dataset to fine-tune the model, to get the lowest possible error rate on the test set.
Yamagata required the MTQE model to provide a binary categorization of good (do not require post-editing) and bad (require light post-editing) translations. The customized MTQE model is fine-tuned with distinct thresholds for the two language pairs in order to minimize the classification error rate. As a result, a score of 0.75 or above is considered 'Good' for DE>EN, whereas 0.85 is considered 'Good' for FR>EN.
TAUS Estimate API as the Ultimate Risk Management Solution for a Global Technology Corporation
Based on examples of texts from one of the largest technology companies in the world, TAUS generated a large dataset and customized a quality prediction model. The accuracy rate achieved was 85%.
Domain-Specific Training Data Generation for SYSTRAN
After the training with TAUS datasets in the pandemic domain, the SYSTRAN engines improved on average by 18% across all twelve language pairs compared to the baseline engines.
Customization of Amazon Active Custom Translate with TAUS Data
The customization of Amazon Translate with TAUS Data always improved the BLEU score measured on the test sets by more than 6 BLEU points on average and 2 BLEU points at a minimum.