Data Cloud

The TAUS Data Cloud is a neutral and secure repository platform for sharing, pooling and leveraging translation data.

It contains tens of billions of words in multiple translation directions, 17 well-defined industry domains, 9 content types and from multiple data sources.

It is a super cloud for the global translation community, supporting translation automation efforts, improving translation quality and fueling business innovation.

on 10/08/2015
Was this helpful?

The TAUS Data Cloud offers the following data features:

Search: users can search for translations of strings (words, phrases, segments) in different translation directions and industry domains.

Upload: users can upload translation data in TMX format. For more information, see FAQ: Which are the steps to upload data sets to the Data Cloud?

Discover & Download: users can discover, browse, view samples and download data sets (in TMX format) with the selected features. For more information see Which are the steps to discover data sets in the Data Cloud? & Which are the steps to download data sets from the Data Cloud?

Users can also provide Feedback on data quality, keep track of their organization's uploads and downloads through the Account History, re-access the files they have uploaded or downloaded and search, activate or archive segments they have uploaded as users through My Data.

Access to features depends on the membership plan.

 

on 10/08/2015
Was this helpful?

The Data Cloud contains translation data of tens of billions of words in multiple translation directions across 17 well-defined industry domains and 9 content types. 

More specifically, the Data Cloud translation data sets are:

  • Industry-shared TMs, contributed by TAUS members from the industry sector i.e. providers, buyers, LSPs
  • Individual TMs contributed by TAUS members i.e. freelance translators, professionals in language technology
  • Public data made available to the community as such by large international and intergovernmental organisations
    • Public data owners include: European Union institutions, United Nations, etc.
  • Parallel data sets generated, processed, curated and/or improved by companies or academic institutions from
    • Existing available public parallel data sets
    • Multilingual public sector websites
    • Multilingual websites that allow free use and reuse of their content

Acknowledgements: Special thanks to all organisations and individuals that create, generate, process, curate and make data available to the language and machine translation community for use in their projects.

 

on 10/08/2015
Was this helpful?

There are different use scenarios for the rich Data Cloud resources that cover multiple translation directions in a diversity of domains:

  • Training of MT engines, especially when users have no or not enough training data for their projects.
  • Evaluation and improvement of MT output quality, especially with relevant to users' project data. See as an example KantanMT's use case.
  • Terminology extraction from the industry-shared TMs contributed by TAUS members. That is to automaticaly extract terms relevant to your projects by using your terminology extraction tools and methods, open-source tools or commercial tools (see for example Term Extraction Tools) from the downloaded Data Cloud data sets.
  • Research purposes in the translation technology field. See also relevant FAQ: Which are the Data Cloud benefits for TAUS academic members.
  • Etc.
on 10/08/2015
Was this helpful?

The Data Cloud currently supports 62 language-locales (i.e. regional languages). Languages are represented as ISO 639-1 two letter language codes. Regions are represented by ISO 3166-1 alpha-2 two letter country codes (or in a few cases, by non-standard codes such as XL for Latin America). Language-locales are represented as a language, followed by a dash ("-"), followed by a region, as for example fr-CA for Canadian French.

The current list of supported languages is the following:

af-ZA Afrikaans

ar-AE Arabic (U.A.E.)

ar-AR Arabic

ar-EG Arabic (Egypt)

ar-SA Arabic (Saudi Arabia)

be-BY Belarusian

bg-BG Bulgarian

cs-CZ Czech

cy-GB Welsh

da-DK Danish

de-DE German (Germany)

el-GR Greek

en-AU English (Australia)

en-CA English (Canada)

en-GB English (United Kingdom)

en-US English (United States)

en-ZA English (South Africa)

es-EM Spanish (International)

es-ES Spanish (Spain)

es-MX Spanish (Mexico)

es-XL Spanish (Latin America)

et-EE Estonian

eu-ES Basque

fa-IR Farsi

fi-FI Finnish

fr-BE French (Belgium)

fr-CA French (Canada)

fr-FR French (France)

he-IL Hebrew (Israel)

hr-HR Croatian

ht-HT Haitian

hu-HU Hungarian

id-ID Indonesian

is-IS Icelandic

it-IT Italian (Italy)

ja-JP Japanese

ko-KR Korean

lt-LT Lithuanian

lv-LV Latvian

mk-MK Macedonian

ms-MY Malay

mt-MT Maltese

nb-NO Norwegian (Bokmal)

nl-BE Dutch (Belgium)

nl-NL Dutch (Netherlands)

nn-NO Norwegian (Nynorsk)

no-NO Norwegian

pl-PL Polish

pt-BR Portuguese (Brazil)

pt-PT Portuguese (Portugal)

ro-RO Romanian

ru-RU Russian

sk-SK Slovak

sl-SI Slovene

sv-SE Swedish

th-TH Thai

tr-TR Turkish

uk-UA Ukranian

vi-VN Vietnamese

zh-CN Chinese (PRC)

zh-HK Chinese (Hong Kong)

zh-TW Chinese (Taiwan)

on 05/30/2016
Was this helpful?

Currently there are 17 industry domains available in the Data Cloud:

  • Automotive Manufacturing
  • Consumer Electronics
  • Computer Software
  • Computer Hardware
  • Industrial Manufacturing
  • Telecommunications
  • Professional and Business Services
  • Stores and Retail Distribution
  • Industrial Electronics
  • Legal Services
  • Energy, Water and Utilities
  • Financials
  • Medical Equipment and Supplies
  • Healthcare
  • Pharmaceuticals and Biotechnology
  • Chemicals
  • Leisure, Tourism, and Arts
  • plus the Undefined Sector
on 10/08/2015
Was this helpful?

Currently there are 9 content types available in the Data Cloud:

  • Instructions for Use
  • Sales and Marketing Material
  • Policies, Process and Procedures
  • Software Strings and Documentation
  • News Announcements, Reports and Research
  • Patents
  • Standards, Statutes and Regulations
  • Financial Documentation
  • Support Content
  • plus an Undefined Content Type
on 10/08/2015
Was this helpful?

The Data Cloud supports the industry standard TMX 1.4b format.

on 10/08/2015
Was this helpful?

The pooling ratio depends on the membership plani.e. the pooling ration can be 1:1 or 1:x. This means that when users upload 1M source words, they earn credits to download 1M or xM source words respectively, in any translation direction(s) and across industry domains.

Uploads and downloads can be in any translation direction and industry domain and do not have to correspond.

on 10/08/2015
Was this helpful?

The process to upload translation data sets in the Data Cloud is the following: 

  • Prepare the file(s) to be uploaded: files should be bilingual TMX version 1.4b, in UTF-8 encoding and zipped.
  • Login with your TAUS credentials, follow the Data Cloud and click on the Upload in the side menu.
  • Assign the relevant to your data set attributes.
  • Click on Choose File button to select your file from its location.
  • Click on the Upload button. The system pre-processes your file and performs some quality checks prior to importing it to the database.
  • The credits you earn with each upload are automatically credited into your account, allowing you to download the data of your selection.
  • Go to the Account History and click on the Uploads tab to see all your and your organisation's uploads. By clicking on the "+" on the left side of every upload, you can view the metadata of the data set.
  • To save in your device the original file you have uploaded to the Data Cloud, click on the "Download original" button on the right side of each dataset. 
    • You can repeat this action as many times as you wish.
    • All previous uploads from your organisation are stored in the cloud and you can view them and save them through your Account History to your device.
on 02/02/2016
Was this helpful?

The processs to discover relevant to your projects translation data sets from the Data Cloud is the following:

  • Log in with your TAUS credentials to access the Discover & Download page
  • Make your attribute selection on the left-side column of:
    • translation direction (obligatory)
    • industry domain, content type and data owner (optional)
  • Every time you select an attribute you get a list of the available in the Data Cloud data sets based on your current selection together with all their associated metadata.
  • You can view the volume of:
    • the available words
    • the already downloaded words
    • the required credits for this download
    • your current credits
    • your balance if you perform the specific download
  • You can browse the list of the available in the Data Cloud data sets based on your current selection to find the relevant to your projects data sets.
  • You can view random samples of 100 segments each of the directly uploaded data sets for a first-hand assessment of the data quality.


on 08/30/2017
Was this helpful?

The processs to download translation data sets from the Data Cloud is the following:

  • Log in with your TAUS credentials to access the Discover & Download page
  • Make your attribute selection on the left-side column of:
    • Translation direction (obligatory)
    • Industry domain, content type and data owner (optional)
  • Every time you select an attribute you can view the volume of:
    • The available words
    • The already downloaded words
    • The required credits for this download
    • Your current credits
    • Your balance if you perform the specific download
  • You can  narrow your selection, and possibly therefore the number of retrieved data sets by selecting values for all attributes.
  • If you decide to download the retrieved (on the basis of your selection) data sets, click on the Export button.
    • A window will pop up asking you to confirm the export
    • Click on "OK" if you want to proceed with the export.
  • Go to the Account History and click on the Downloads button to see all the downloads of your organisation.
    • By clicking on the "+" on the left side of every download, you can view the metadata of the data set.
    • To save a data set you downloaded from the Data Cloud to your device, click on the Download button on the right side of each data set.
    • You can repeat this action as many times as you wish for free

The downloaded data cannot be distributed to others according to the TAUS Terms of Use.

Currently, partial downloads (i.e. downloading part of a dataset) are not possible. 

on 02/02/2016
Was this helpful?

Partial downloads are not supported by the Data Cloud currently. This means that you need an equal amount of credits to the (source) words of the data set(s) you select to download. For example, you need to have at least 1M available credits in your account in order to download data set(s) containing 1M words.

 

 

on 10/14/2016
Was this helpful?

The Account History keeps track of all your organisation's uploads and downloads since it joined TAUS as a member or subscriber.

While viewing the list of uploads or downloads, you can click on "+" on the left side of every upload/download, to see the metadata of the dataset.

By clicking on the Download Original/Download button respectively on the right side of each uploaded/downloaded file you can save it in your device anytime. You can repeat this action as many times as you wish. 

on 02/02/2016
Was this helpful?

Through My Data you can search in all the data sets that you as a user of your organization have uploaded. You can also activate or archive segments (results of your search).

You cannot search in the data sets that other users of your organization have uploaded. This is mainly because of the archive and activate function, that is, you can only archive or activate data that you as a user have previously uploaded. 

on 09/19/2017
Was this helpful?

There are 3 ways to acquire credits to download data sets from the Data Cloud:

  • Bonus credits
    • You are awarded credits by becoming a TAUS member or by renewing your TAUS membership. The amount of credits depends on the TAUS membership plan. 
  • Earn credits
  • Purchase credits
    • If you are a TAUS member but have no or not enough data to share
    • If you are not a TAUS member and therefore have no access to Upload page.

Sharing your translation data sets can be benefitial for both you and the community because:

  • You earn credits to download valuable industry-shared translation data sets
  • Your data contributions may trigger more data contributions and downloads and fuel therefore translation automation efforts.
on 08/31/2017
Was this helpful?

You can use your acquired (earned or purchased) credits for one or multiple downloads and in one or more translation directions, industry domains and other features.

You can use your credits whenever you choose to. Currently there is no time limit of when to use your credits. A credit is equivalent to a source word.

Example: 

1. You have earned credits to download 1M (source) words. So your account is credited with 1M credits.

2. You download a data set in en_US-fr_FR and the features of your choice of 230K words. 

3. You now have 770K remaining credits in your account.

4. Your further download another data set in en_US-ja_JP and the features of your choice of 70K words.

5. You now have 700K remaining credits in your account

6. You can use your credits to download any data sets as long as the number of your credits is equal or higher than the number of credits required for your selection.

on 02/04/2016
Was this helpful?

When a file gets uploaded, a number of processing steps are followed (such as indexing) before it's available for search.

When a file gets uploaded, it can be immediately downloaded.

on 02/11/2016
Was this helpful?

This happens because the language code in your file is different than the language code in the Data Cloud so the system asks for disambiguation by using a pop-up menu with the options.

For example: if your file has  "English" as source language and "French" as target language, the system will ask you to further specify the "language locale" of both the source and the target, that is, to select the English variant from: Australia, Canada, South Africa, United Kindom, United States and the French variant from: Belgium, Canada, France. 

Once you have made the selection, press the button Set. The upload will then start.

on 02/09/2016
Was this helpful?

You can find documentation in TAUS Data Cloud API on how to integrate the Data Cloud services into your own technology and leverage the TAUS Data Cloud from it. 

In order to use the Data API you need to do a basic subscription (for free).

Should you like to register your application on Data Cloud, please write to data@taus.net to send you the app key. An app key identifies your application to Data Cloud, and helps TAUS diagnose any problems you may have. Instructions on how to pass the app key with the API calls can be found here - APP Keys

on 02/24/2016
Was this helpful?

Each Data Cloud session lasts 15 minutes. After that you need to re-log in with your credentials.

If you are performing an Upload or a Download, the process will continue in the back-end.

on 10/05/2016
Was this helpful?

Direct translation data is uni-directional and Matrix is bi-directional. More specifically:

Directly uploaded data is regarded as uni-directional in the Data Cloud. The data sets are uploaded with a a specific source and target language and this metadata is preserved in the Data Cloud. For example when an English (US) to Italian data set is uploaded, the segments are stored as EN (US) ->IT and not the other way round. Therefore only this translation direction for this dataset is possible to download. In the Search UI you may get reversed translations in the search results which are of course tagged as such (with a blue arrow).

Matrix data, i.e. translation data generated through a pivot language, is regarded as bi-directional. However, when the one direction of a matrix data set is downloaded by an organisation, the other direction is not available to be downloaded by the same organisation any more, since this would be the same data set.

on 04/05/2016
Was this helpful?

Since the release of the free TAUS Academic Membership Program in December 2014, TAUS Data Cloud offers industry-shared translation memory data to academic staff and students of all universities in the world, in order for them to experiment with the development of MT engines, research, or develop new ideas.

More specifically, TAUS academic members are able to download up to 40 billion words in multiple language directions and industry domains from the data uploaded up to January 2012 in the Data Cloud.

on 08/23/2016
Was this helpful?

These are the basic criteria for data uploads to the Data Cloud:

  • Upload copyright-owned data
    • Do not upload on-behalf-of other party's data or publicly available data
  • Make sure that the files to be uploaded do not include:
  • Make sure that the files to be uploaded are bilingual and have the correct TMX format (1.4b) and encoding (UTF-8)
  • Select the correct metadata for the files to be uploaded from the Upload page. If the level of granularity is not available, please use an existing broader category
Data quality of the Data Cloud uploads is currently monitored as described in the FAQ: How is data quality monitored in the Data Cloud?. More automatic data quality filters may be included in future Data Cloud releases.

Users may experience removal of credits earned by uploading data that does not meet the above basic quality criteria.
on 03/14/2017
Was this helpful?

When a file is sent for upload to the Data Cloud, the system perfoms a number of quality checks which include:

On file level:

  • The exact same file must not have been uploaded to the Data Cloud before
  • The file must be in a valid TMX 1.4b format plus: 
    • A zipped file (.zip)
    • UTF-8 encoded
    • Bilingual
  • The specification of language and country identifiers in RFC 3066 language identifiers is required https://www.ietf.org/rfc/rfc3066.txt 
  • Language identifiers must be readable in TMX file
  • Language and country must be specified and be one of the supported Data Cloud languages, see relevant FAQ: Which languages are currently supported by the Data Cloud?

An error message appears if one of the above is not valid. When the issues are fixed by the user, the file can be sent for upload again.

On TMX syntax:

  • Source language declared in each translation unit <tu> must be the same as the source language declared for the file (detected from <header> src lang and <tu> srclang)  
  • Each translation unit variant <tuv> must declare their <xml:lang> attribute
  • There must be only one source translation unit variant <tuv> with the declared source language

An error message appears if one of the above is not valid. When the issues are fixed by the user, the file can be sent for upload again.

On segment level:

  • Individual bilingual segments are filtered out if:
    • Source and/or target content is empty
    • Source and target content is identical
    • They already exist in the specific translation direction in the Data Cloud
    • Source and/or target length exceeds 4096 characters 

The filtered out segments can be found in the error report in Account History (Uploads button), next to each uploaded file.

Normalization includes:

  • Conversion of XML entities to Unicode characters
  • Cleaning of the 5 predefined XML/TMX entity references: &lt; &gt; &quot; &apos; &amp;
  • Cleaning of all HTML predefined character entities
  • XML/TMX elements & content markup and inline (like <b>, </b>, <i>, </i>, <ph>, </ph>)

More automatic data quality filters may be added in future Data Cloud releases.



on 09/08/2016
Was this helpful?

Matrix is a function in the Data Cloud that increases the available translation directions and data volume by generating on-the-fly new translation links through a common pivot language, under the conditions that no equivalent directly linked segments exist and segments share the exactly the same features in the Data Cloud.  

Example:

For bilingual segments from

  • de-DE to fr-FR and
  • de-DE to es-ES
  • where de-DE segments above are identical and 
  • share the same features with their translations in both target languages i.e. industry, content type and data owner/provider
  • and when no direct translations fr-FR to/from es-ES exist in the Data Cloud
  • such translations are generated by the matrix function through the pivot language de-DE.
With any new uploads Matrix translations are re-generated accordingly.

Matrix also increases the number of results by acting as a fall back search, if direct parallel data is not available.



on 09/29/2016
Was this helpful?
No, you may not upload pivoted translations but directly translated data instead.

Pivoted or matrix data is translation data that is generated via a common pivot language - the TAUS platform produces such pivoted data with the process described in FAQ https://www.taus.net/faq/187-data-cloud/62-what-is-the-matrix-function

As the Data Cloud generates pivoted or matrix data automatically,  please only upload directly translated data.

Users that upload pivoted translation data may experience removal of credits earned by the upload of such data
on 03/14/2017
Was this helpful?
You can only upload translation data for which you have copyright ownership.

Users that upload non-owned translation data may experience removal of credits earned by the upload of such data.
on 03/14/2017
Was this helpful?