Best Practices for TAUS Data Cloud

The availability of language and translation data sets in highly-demanded and business-oriented language pairs as well as in smaller and usually less-resourced languages, be it general or domain/origin-specific data, is crucial for the language and localization industry. The state-of-the-art translation technology is data-driven and learns from the data.

In its nearly 10-years of existence, the TAUS Data Cloud has proven to be a valuable resource allowing access to large industry-shared translation data sets in multiple languages, industry domains and content types for different use cases .

So what is the best way to leverage the data from the Data Cloud for making the best use of it in your projects? What are some of the Data Cloud use cases ?

Discovering and Identifying Relevant Data

7 The Quantum Leap copy.jpg The discovery of data relevant to your projects has become more straightforward through the improvements of the Discover & Download UI in the last Data Cloud releases (current version release 2.2.0).You can now select the source and target languages you’re looking for and receive the detailed list of all the available data sets in this translation direction together with the associated metadata in the Data Cloud. The metadata provides information for each data set on industry domain, content type, data owner/origin, segment and source word counts and direct/matrix. The order of the list is based on the most recent uploads, however you can sort it by other attributes as well. If you wish to receive more specific results, you can further narrow your search by selecting the available values for the different attributes, for example, select “Industrial Electronics” as industry domain and/or “Sales and Marketing Material” as content type, and so on.

Once you identify the relevant data sets for your specific purposes from the list, you can view random samples of each data set. This “look-inside” option allows you to browse through the parallel data sets to make sure that they meet your expectations regarding data relevance, quality and style, before you decide to download them.

Data Cloud Use Cases

The primary use of Data Cloud generic and domain/origin-specific data is machine translation (MT); more specifically the training of the MT engines and the evaluation and improvement of MT performance. Some use cases can be seen in the overview of the Data Cloud “Capitalizing on Translation Data” (slides 13-16). While the volume of the available project data is an important factor (the more data, the better), in many cases the relevance of the data to a project and the quality of the data are proven to be more important when it comes to new generation translation technologies. So, the trade-off of quantity, relevance and quality should definitely be considered.

Another way to use the Data Cloud is the creation of derivatives from the Data Cloud monolingual or multilingual data by performing curation-related activities such as processing and managing data. Examples are (further) annotation or cleaning of data to fit better into a specific project, or extraction of useful, project-specific data. Of the latter, extraction of industry-shared terminology to create term banks or glossaries for different purposes can be given as a prime example.

You can also deploy Data Cloud resources to fill up existing translation memory (TM) repositories for various purposes such as for boosting productivity on post-editing (PE), enhancing autosuggest results, and retrieving more translation references through a search tool.

Furthermore, you can leverage the Data Cloud resources for natural language processing (NLP) tasks such as cross-lingual information retrieval, text classification, language modeling, image captioning, question answering, speech recognition and document summarization.And finally, you can use the translation data for projects based on translation or comparative language studies i.e. explore two different languages within an interdisciplinary and linguistic context, on the basis of a parallel domain-specific Data Cloud data set.

The above-mentioned list of use cases of the Data Cloud is by no means exhaustive, as the volume and richness of Data Cloud resources can facilitate many other use cases and purposes.

Future Prospects

There are ongoing efforts to collect more data and make it available through the Data Cloud. These efforts include data business development activities within the TAUS Data Cloud roadmap and large-scale web crawling activities within collaborative projects funded by the EU such as the 3-year ModernMT project (completed by the end of 2017) and the 18-month ParaCrawl project within Connecting Europe Facility (CEF) program (initialized in September, 2017) both of which have translation technology experts on board from research/academic institutions and the industry.

TAUS is investigating the best possible business model to tackle the issues surrounding availability and accessibility of translation data sets. Is it the current TAUS Data Cloud reciprocal model or a future TAUS Data Market transaction model that can provide the best solution for driving market adoption?

We welcome you to follow our some of our efforts, news and achievements by reading relevant TAUS reports such as Translation Data Landscape Report and Data Market White Paper.

Table of Contents

Best Practices for TAUS Data Cloud

This article provides a great insight into the best practices for TAUS Data Cloud. What are the best use cases for Data Cloud? How is data useful in language/translation industry.

Discovering and Identifying Relevant Data

Data Cloud Use Cases

Future Prospects