Finding high-quality data for MT training has always been a challenge on the path to generating high-performing MT output. The challenge increases when the language pairs are rare or when training data in a lesser-known domain is needed.
Machine translation systems do a great job at providing solutions for automated translation services when fed with the right training data. However, it was a challenge to find high-quality datasets necessary to build specialized automatic tools about this new topic in the healthcare domain.
TAUS provided Pangeanic a total of 1.8 million words of MT training data in English to Spanish, German, Polish, Russian, and Chinese language pairs.
Using the data provided by TAUS, Pangeanic built COVID-19 domain-specific neural machine translation (NMT) models for the five language pairs on Pangeanic ECO user-friendly customer portal on which the user can adapt models using three levels of training settings.
The highest BLEU score improvement has been recorded in the English > Russian language pair with 50%, followed by English > Chinese with 26%, English > German with 20%, English > Spanish with 9%, and English > Polish with 8%.
Şölen is the Head of Digital Marketing at TAUS where she leads digital growth strategies with a focus on generating compelling results via search engine optimization, effective inbound content and social media with over seven years of experience in related fields. She holds BAs in Translation Studies and Brand Communication from Istanbul University in addition to an MA in European Studies: Identity and Integration from the University of Amsterdam. After gaining experience as a transcreator for marketing content, she worked in business development for a mobile app and content marketing before joining TAUS in 2017. She believes in keeping up with modern digital trends and the power of engaging content. She also writes regularly for the TAUS Blog/Reports and manages several social media accounts she created on topics of personal interest with over 100K followers.
TAUS provided 172.980 segments of training data in French-German language pair, in a very specific area of the broadly legal domain for Custom MT, one of the latest and leading MT services companies delivering affordable machine translation engine training, evaluation, and integration.
Online machine translation engines provide easy access to high-quality machine translations. They are optimized for content like news articles and social media posts that users of online platforms frequently translate.
Data annotation is the categorization and labeling of data to be used in the training of AI applications. Training datasets must be carefully categorized and annotated for each specific use case. High-quality, human-powered data annotation allows companies to build and improve AI implementations which results in enhanced customer experience solutions such as product recommendations, relevant search engine results, computer vision, speech recognition, chatbots, and more.