How to Get Your First Data into Data Marketplace
icons-action-calendar26 Nov 2020
5 minute read
Here is a step by step guide to publishing your language data on the TAUS Data Marketplace.

In November 2020, TAUS launched the Data Marketplace, an open data market where translators, language service providers, technology developers and enterprises come together to sell and buy language data for machine translation and other machine learning applications.

We’ve realized that there is not necessarily a lack of language data out there, but rather that the data is locked in silos or sits unused. Have you created or collected language data over time that’s been waiting to be put to use? Now you can easily monetize it, and here is how.

Step 1: Confirm that You Have Rights to Sell Data

Whether you are the customer (who paid for the translations), the LSP (the agency in the middle) or the translator (who produced the translation), be aware of the rights that you have when it comes to sharing and selling the translation memories or any other type of language data. We lay this out for you in the Data Marketplace FAQs. In general, the problems around ownership of translation data exist mostly in a theoretical setting. There are no records of legal cases around the ‘unlawful’ use of translation memories. If a problem arises, a ‘notice and take-down’ will settle it. 

For more insights and general guidance, read our article on Language Data Ownership and Copyright in Translation or download a free copy of the  Who Owns My Language Data White Paper. If you are still in doubt, you may want to check in with your customers before publishing the data in the Data Marketplace.

Step 2: Make your Data MT-ready

Data Marketplace prepares your data for sale by cleaning it. Navigate to the Sell Data section of the Data Marketplace and simply upload a TMX file (more file formats will be available soon) by dragging and dropping it or by clicking on the browse button and selecting a file from your computer. This initiates the data analysis process.

What you’ll see as a result are the counts and the quality of the data uploaded, specifically: High-quality - unique segments that passed the filtering; Replica - segments that already exist in the Data Marketplace, and Low-quality - segments that haven’t passed the filtering. You’ll also be able to see the sample of the segments that were marked as low-quality and won’t be included in the published file.

Step 3: Define Metadata and Price

If you’re happy with the result of the analysis, you can move onto the Publishing step, where you add metadata such as domain and content type to your dataset, and a description to make your dataset stand out from the rest. 


Step 4: Create an Account and Publish

In order to publish a dataset, you need to have an account with the Data Marketplace. That way you can edit your published datasets and we can make sure that you receive your payment when someone purchases the data.

Created an account? The very last thing you need to do is click on the Publish button.

Your document will now appear in the Data Marketplace, both under the Data Sellers section and when a potential buyer selects that language pair under the Explore Data tab.

The Data Marketplace has set out to become a vibrant and secure global  language data trading platform. You are in the right place if you want to share your language datasets with the wider AI and ML services providers. The Data Marketplace team is on call to support you in the steps, from listing your datasets to getting paid. Have any questions or concerns?  See complete publishing guidelines for sellers here, or reach us at dms@taus.net at any point in time. 


Milica is a marketing professional with over 10 years in the field. As TAUS Head of Product Marketing she manages the positioning and commercialization of TAUS data services and products, as well as the development of taus.net. Before joining TAUS in 2017, she worked in various roles at Booking.com, including localization management, project management, and content marketing. Milica holds two MAs in Dutch Language and Literature, from the University of Belgrade and Leiden University. She is passionate about continuously inventing new ways to teach languages.

Related Articles
icons-action-calendar9 Nov 2023
Explore the crucial role of language data in training and fine-tuning LLMs and GenAI, ensuring high-quality, context-aware translations, fostering the symbiosis of human and machine in the localization sector.
icons-action-calendar19 Dec 2022
Domain Adaptation can be classified into three types - supervised, semi-supervised, and unsupervised - and three methods - model-centric, data-centric, or hybrid.
icons-action-calendar19 Dec 2022
Machine learning and AI applications need data in order to work. And in order to get good results and output, the cleaner the data, the better.