Data Cloud

The TAUS Data Cloud is a neutral and secure repository platform for sharing, pooling and leveraging translation data.

It contains tens of billions of words in multiple translation directions, 17 well-defined industry domains, 9 content types and from multiple data sources.

It is a super cloud for the global translation community, supporting translation automation efforts, improving translation quality and fueling business innovation.

on 10/08/2015
Was this helpful?

The TAUS Data Cloud offers the following data features:

Search: users can search for translations of strings (words, phrases, segments) in different translation directions and industry domains.

Upload: users can upload translation data in TMX format.

Discover & Download: users can discover, browse, view samples and download data sets (in TMX format) with the selected features.

Users can also provide Feedback on data quality, keep track of their Account History and re-access the files they have uploaded or downloaded.

Access to features depend on the user subscription.

 

on 10/08/2015
Was this helpful?

The Data Cloud contains translation data of tens of billions of words in multiple translation directions across 17 well-defined industry domains and 9 content types. 

More specifically, the Data Cloud translation data sets are:

  • Industry-shared TMs, contributed by TAUS members from the industry sector i.e. providers, buyers, LSPs
  • Individual TMs contributed by TAUS members i.e. freelance translators, professionals in language technology
  • Public data made available to the community as such by large international and intergovernmental organisations
    • Public data owners include: European Union institutions, United Nations, etc.
  • Parallel data sets generated, processed, curated and/or improved by companies or academic institutions from
    • Existing available public parallel data sets
    • Multilingual public sector websites
    • Multilingual websites that allow free use and reuse of their content

Acknowledgements: Special thanks to all organisations and individuals that create, generate, process, curate and make data available to the language and machine translaiton community for use in their projects.

 

on 10/08/2015
Was this helpful?

The primary use case of the translation data in the Data Cloud is training machine translation (MT) engines. It can also be valuable for buyers and providers of translation technology and translation services to improve the quality of translations and automation efforts. It can further be of significant help to students of translation and translation technology.

on 10/08/2015
Was this helpful?

The Data Cloud currently supports 62 language-locales (i.e. regional languages). Languages are represented as ISO 639-1 two letter language codes. Regions are represented by ISO 3166-1 alpha-2 two letter country codes (or in a few cases, by non-standard codes such as XL for Latin America). Language-locales are represented as a language, followed by a dash ("-"), followed by a region, as for example fr-CA for Canadian French.

The current list of supported languages is the following:

af-ZA Afrikaans

ar-AE Arabic (U.A.E.)

ar-AR Arabic

ar-EG Arabic (Egypt)

ar-SA Arabic (Saudi Arabia)

be-BY Belarusian

bg-BG Bulgarian

cs-CZ Czech

cy-GB Welsh

da-DK Danish

de-DE German (Germany)

el-GR Greek

en-AU English (Australia)

en-CA English (Canada)

en-GB English (United Kingdom)

en-US English (United States)

en-ZA English (South Africa)

es-EM Spanish (International)

es-ES Spanish (Spain)

es-MX Spanish (Mexico)

es-XL Spanish (Latin America)

et-EE Estonian

eu-ES Basque

fa-IR Farsi

fi-FI Finnish

fr-BE French (Belgium)

fr-CA French (Canada)

fr-FR French (France)

he-IL Hebrew (Israel)

hr-HR Croatian

ht-HT Haitian

hu-HU Hungarian

id-ID Indonesian

is-IS Icelandic

it-IT Italian (Italy)

ja-JP Japanese

ko-KR Korean

lt-LT Lithuanian

lv-LV Latvian

mk-MK Macedonian

ms-MY Malay

mt-MT Maltese

nb-NO Norwegian (Bokmal)

nl-BE Dutch (Belgium)

nl-NL Dutch (Netherlands)

nn-NO Norwegian (Nynorsk)

no-NO Norwegian

pl-PL Polish

pt-BR Portuguese (Brazil)

pt-PT Portuguese (Portugal)

ro-RO Romanian

ru-RU Russian

sk-SK Slovak

sl-SI Slovene

sv-SE Swedish

th-TH Thai

tr-TR Turkish

uk-UA Ukranian

vi-VN Vietnamese

zh-CN Chinese (PRC)

zh-HK Chinese (Hong Kong)

zh-TW Chinese (Taiwan)

on 05/30/2016
Was this helpful?

Currently there are 17 industry domains available in the Data Cloud:

  • Automotive Manufacturing
  • Consumer Electronics
  • Computer Software
  • Computer Hardware
  • Industrial Manufacturing
  • Telecommunications
  • Professional and Business Services
  • Stores and Retail Distribution
  • Industrial Electronics
  • Legal Services
  • Energy, Water and Utilities
  • Financials
  • Medical Equipment and Supplies
  • Healthcare
  • Pharmaceuticals and Biotechnology
  • Chemicals
  • Leisure, Tourism, and Arts
  • plus the Undefined Sector
on 10/08/2015
Was this helpful?

Currently there are 9 content types available in the Data Cloud:

  • Instructions for Use
  • Sales and Marketing Material
  • Policies, Process and Procedures
  • Software Strings and Documentation
  • News Announcements, Reports and Research
  • Patents
  • Standards, Statutes and Regulations
  • Financial Documentation
  • Support Content
  • plus an Undefined Content Type
on 10/08/2015
Was this helpful?

The Data Cloud supports the industry standard TMX 1.4b format.

on 10/08/2015
Was this helpful?

The pooling ratio depends on the subscription levelit ranges from 1:1, 2:1, 3:1 to 5:1, that is, if you upload 1M source words you earn credits to download 1M, 2M, 3M or 5M source words respectively, in any translation direction(s) and across industry domains.

The basic subscription and academic membership offer no upload option. Basic and academic subscribers or members and subscribers that have no or not enough data to upload can buy credits to download data.

Uploads and downloads can be in any translation direction and industry domain and do not have to correspond.

on 10/08/2015
Was this helpful?

The process to upload translation data in the Data Cloud is the following: 

1. Prepare the file(s) to be uploaded: files should be bilingual TMX version 1.4b, in UTF-8 encoding and zipped.

2. Login with your TAUS credentials, follow the Data Cloud and click on the Upload in the side menu.

3. Assign the relevant to your data set attributes.

5. Click on Choose File button to select your file from its location.

6. Click on the Upload button. The system pre-processes your file and performs some quality checks prior to importing it to the database.

NOTE: The credits you earn with each upload are automatically credited into your account, allowing you to download the data of your selection.

 7. Go to the Account History and click on the Uploads tab to see all your and your organisation's uploads. By clicking on the "+" on the left side of every upload, you can view the metadata of the data set.

8. To save the original file you have uploaded to the Data Cloud in your device, click on the "Download original" button on the right side of each dataset. 

NOTE: You can repeat this action as many times as you wish. All previous uploads from your organisation are stored in the cloud and you can view them and save them through your Account History to your device.

on 02/02/2016
Was this helpful?

The processs to download (pool) datasets from the Data Cloud is the following:

1. Log in with your TAUS credentials, follow the Data Cloud and Services and click on the Download

2. Select the dataset you are interested in by selecting the translation direction, industry, etc. on the left-side column.

3. Every time you select a feature you see the volume of the available words and of the already downloaded words as well as the required credits for this download, your current credits and your balance if you perform this download. You can also browse the data sets available in the Data Cloud based on your selection, together with their metadata and view a sample of each data set.

4. If you decide to download the data sets with the selected features, click on the Export button. A window will pop up asking you to confirm and showing the count in segments as well of your selection.

5. Click on "OK" if you want to proceed with the export.

6. Go to the Account History and click on the Downloads button to see all the downloads of your organisation. By clicking on the "+" on the left side of every download, you can view the metadata of the data set.

7. To save a data set you downloaded from the Data Cloud to your device, click on the Download button on the right side of each data set.

You can repeat action 7 as many times as you wish. All (previous) downloads from your organisation are saved in the cloud; you can view them and re-download them through your Account History to your device for free. The downloaded data cannot be distributed to others according to the TAUS Terms of Use.

Currently, partial downloads (i.e. downloading part of a dataset) are not possible.

 

on 02/02/2016
Was this helpful?

Partial downloads are not supported by the Data Cloud currently. This means that you need an equal amount of credits to the (source) words of the data set(s) you select to download. For example, you need to have at least 1M available credits in your account in order to download data set(s) containing 1M words.

 

 

on 10/14/2016
Was this helpful?

The Account History keeps track of all your organisation's uploads and downloads since it joined TAUS as a member or subscriber. While viewing the list of uploads or downloads you can click on "+" on the left side of every upload/download, to see the metadata of the dataset. By clicking on the Download Original/Download button respectively on the right side of each uploaded/downloaded file you can save it in your device anytime. You can repeat this action as many times as you wish. 

on 02/02/2016
Was this helpful?

You can use your acquired (earned or purchased) credits for one or multiple downloads and in one or more translation directions, industry domains and other features.

You can use your credits whenever you choose to. Currently there is no time limit of when to use your credits. A credit is equivalent to a source word.

Example: 

1. You have earned credits to download 1M (source) words. So your account is credited with 1M credits.

2. You download a data set in en_US-fr_FR and the features of your choice of 230K words. 

3. You now have 770K remaining credits in your account.

4. Your further download another data set in en_US-ja_JP and the features of your choice of 70K words.

5. You now have 700K remaining credits in your account

6. You can use your credits to download any data sets as long as the number of your credits is equal or higher than the number of credits required for your selection.

on 02/04/2016
Was this helpful?

When a file gets uploaded, a number of processing steps are followed (such as indexing) before it's available for search.

When a file gets uploaded, it can be immediately downloaded.

on 02/11/2016
Was this helpful?

This happens because the language code in your file is different than the language code in the Data Cloud so the system asks for disambiguation by using a pop-up menu with the options.

For example: if your file has  "English" as source language and "French" as target language, the system will ask you to further specify the "language locale" of both the source and the target, that is, to select the English variant from: Australia, Canada, South Africa, United Kindom, United States and the French variant from: Belgium, Canada, France. 

Once you have made the selection, press the button Set. The upload will then start.

on 02/09/2016
Was this helpful?

You can find documentation in TAUS Data Cloud API on how to integrate the Data Cloud services into your own technology and leverage the TAUS Data Cloud from it. 

In order to use the Data API you need to do a basic subscription (for free).

Should you like to register your application on Data Cloud, please write to data@taus.net to send you the app key. An app key identifies your application to Data Cloud, and helps TAUS diagnose any problems you may have. Instructions on how to pass the app key with the API calls can be found here - APP Keys

on 02/24/2016
Was this helpful?

Each Data Cloud session lasts 15 minutes. After that you need to re-log in with your credentials.

If you are performing an Upload or a Download, the process will continue in the back-end.

on 10/05/2016
Was this helpful?

Direct translation data is uni-directional and Matrix is bi-directional. More specifically:

Directly uploaded data is regarded as uni-directional in the Data Cloud. The data sets are uploaded with a a specific source and target language and this metadata is preserved in the Data Cloud. For example when an English (US) to Italian data set is uploaded, the segments are stored as EN (US) ->IT and not the other way round. Therefore only this translation direction for this dataset is possible to download. In the Search UI you may get reversed translations in the search results which are of course tagged as such (with a blue arrow).

Matrix data, i.e. translation data generated through a pivot language, is regarded as bi-directional. However, when the one direction of a matrix data set is downloaded by an organisation, the other direction is not available to be downloaded by the same organisation any more, since this would be the same data set.

on 04/05/2016
Was this helpful?

Since the release of the free TAUS Academic Membership Program in December 2014, TAUS Data Cloud offers industry-shared translation memory data to academic staff and students of all universities in the world, in order for them to experiment with the development of MT engines, research, or develop new ideas.

More specifically, TAUS academic members are able to download up to 40 billion words in multiple language directions and industry domains from the data uploaded up to January 2012 in the Data Cloud.

on 08/23/2016
Was this helpful?

These are the basic criteria for data uploads to the Data Cloud, which should be copyright-owned data (e.g. do not upload on-behalf-of other party's data or publicly available data)

  • Copyright-owned data (e.g. do not upload on-behalf-of other party's data or publicly available data)
  • Correct source & target language, as specified in the Upload UI
  • Correct metadata for the data set in terms of features (i.e. industry domain). If the level of granularity is not available, please use a broader category.
  • Correctly aligned data - please make sure that the data sets do not include:
    • Misaligned segments (i.e. source not translation of target due to incorrect positioning of segment linking during  the alignment process)
    • Omissions (i.e. missing parts in source or target)



Data quality of the Data Cloud uploads is currently monitored as described in  https://www.taus.net/faq/187-data-cloud/61-does-the-data-cloud-monitor-data-quality. More automatic data quality filters will be included in future Data Cloud releases.

Users may experience removal of credits earned by uploading data that does not meet the above basic quality criteria.

on 03/14/2017
Was this helpful?

When a file is sent for upload to the Data Cloud, the system perfoms a number of quality checks which include:

On file level:

An error message appears if one of the above is not valid. When the issues are fixed by the user, the file can be sent for upload again.

On TMX syntax:

  • Source language declared in each translation unit <tu> must be the same as the source language declared for the file (detected from <header> src lang and <tu> srclang)  
  • Each translation unit variant <tuv> must declare their <xml:lang> attribute
  • There must be only one source translation unit variant <tuv> with the declared source language

An error message appears if one of the above is not valid. When the issues are fixed by the user, the file can be sent for upload again.

On segment level:

  • Individual bilingual segments are filtered out if:
    • Source and/or target content is empty
    • Source and target content is identical
    • They already exist in the specific translation direction in the Data Cloud
    • Source and/or target length exceeds 4096 characters 

The filtered out segments can be found in the error report in the Account History UI/Uploads, next to each uploaded file.

Normalization also takes place for example in terms of conversion of XML entities to Unicode characters, removal of tags in brackets and of characters of several Unicode categories, etc.

More automatic data quality filters will be added in future Data Cloud releases.



on 09/08/2016
Was this helpful?

Matrix is a function in the Data Cloud that increases the available translation directions and data volume by generating on-the-fly new translation links through a common pivot language, under the conditions that no equivalent directly linked segments exist and segments share the exactly the same features in the Data Cloud.  

Example:

For bilingual segments from

  • de-DE to fr-FR and
  • de-DE to es-ES
  • where de-DE segments above are identical and 
  • share the same features with their translations in both target languages i.e. industry, content type and data owner/provider
  • and when no direct translations fr-FR to/from es-ES exist in the Data Cloud
  • such translations are generated by the matrix function through the pivot language de-DE.
With any new uploads Matrix translations are re-generated accordingly.

 

Matrix also increases the number of results by acting as a fall back search, if direct parallel data is not available.



on 09/29/2016
Was this helpful?
No, you may not upload pivoted translations but directly translated data instead.

Pivoted or matrix data is translation data that is generated via a common pivot language - the TAUS platform produces such pivoted data with the process described in FAQ https://www.taus.net/faq/187-data-cloud/62-what-is-the-matrix-function

As the Data Cloud generates pivoted or matrix data automatically,  please only upload directly translated data.

Users that upload pivoted translation data may experience removal of credits earned by the upload of such data
on 03/14/2017
Was this helpful?
You can only upload translation data for which you have copyright ownership.

Users that upload non-owned translation data may experience removal of credits earned by the upload of such data.
on 03/14/2017
Was this helpful?