KantanMT created an English-German machine translation engine which was first trained with open source OPUS technical data (mainly software, instructional content. Eg. Gnome, KDE, PHP).
KantanMT then evaluated the engine on a test set of 500 segments. This test set was randomly generated by the engine from the training data while training the first iteration of the engine. No cleansing or edits were performed on the test set. Evaluation yielded the following results (also shown in image I below):
- BLEU = 30%
- F-Measure = 48%
- TER = 65%
KantanMT re-trained the engine by adding TAUS Data Cloud technical data from EMC in the IT domain to the training set. The addition of these data to the training set improved the scores a great deal, leading to a higher quality and increased the performance of the engine. More specifically these were the improved results (also shown in image II below):
- BLEU = 42%
- F-Measure = 61%
- TER = 53%
As a conclusion it was shown that the addition of TAUS Data Cloud data relevant to the translation project increased the BLEU score of the engine and therefore boosted the quality and performance of the MT output.
Note: The reason for creating a new engine instead of showing an existing commercial KantanMT engine is that Kantan’s clients usually use a mixture of data sets included in the KantanMT Library to bulk up word count or boost scores, and it would not be possible to clearly see how much TAUS Data Cloud data contributed to quality (in the table below you can see an example of TAUS Data Cloud data included in KantanMT LibraryTM).
TAUS Data Cloud data usage within the KantanMT Platform
The table below KantanMT provides an overview of how much TAUS Data Cloud data are proportionally used within the KantanMT LibraryTM for the building of MT engines.
Percentage of TAUS Data Cloud data (downloaded & located in KantanLibraryTM)
|Total build jobs||100|
|Build jobs with Library||48|
|Build jobs with TAUS Data Cloud data (in the Library)||27|
|Top-5 Language Pairs||% Usage over TAUS Library Data|
*This data has been extracted from 2016 usage records
**Build job: MT engine created
The table above clearly shows that KantanMT and KantanMT clients use the data from the TAUS Data Cloud within the KantanMT library a lot for building MT engines. The reasons are that the Data Cloud contains:
- High-quality data: data is by and large human-translated and uploaded by content producers. A number of basic cleaning filters are also performed on data to be uploaded to the Data Cloud. For a detailed description see: How is data upload quality monitored in Data Cloud?
- Large volumes of data: the Data Cloud is a huge industry-shared repository with a large language pair coverage and a lot of domain-specific data. For total figures see What data does the Data Cloud contain?. For figures per translation direction and per other selected features see Discover & Download. Note: if you do not have credentials yet, you can sign up to the Data Cloud Free Tier.