When to Community-Source Your Training Data for ML

6 minute read

Ideal use cases for when to community-source training data for ML and common misconceptions around these data acquisition models.

The amount of content that is being produced worldwide and needs translation has been surging for years now. For the vast majority of players in the language industry, the COVID-19 pandemic didn’t slow things down but rather accelerated them. According to Nimdzi, the language services industry reached USD 55B in 2020 and is on a growth path expected to hit a whopping USD 73.6B by 2025. This will only be possible if the industry keeps on showing the same amount of resilience and adaptability, embracing new technologies and digital transformation.

Undoubtedly the most important technological advances in the language industry happened in the field of Natural Language Processing (NLP) and Machine Learning (ML), all geared towards enabling automation of translations at scale. Both technologies require tremendous amounts of high-quality training data. The ways to get the data have also evolved. From gathering data internally, using public sources or professional translators to translate it, companies can now acquire off-the-shelf datasets, leverage data marketplaces, or use third-party platforms to community-source the data. 

In this article, we will look at the important aspects of community-sourcing training data for ML and share the ideal use cases for this data acquisition model.

Common Misconceptions 

The community-sourcing approach is often seen as taking the easier and cheaper route which might not guarantee great quality outcomes. This is not necessarily true, especially when it comes to language data intended for ML applications. Let’s address those misconceptions first.

Misconception #1: Community-Sourcing Data is Free

While companies like Facebook, TED, and CSOFT leveraged their network and communities for voluntary translations popularly known as crowdsourcing, what we refer to as community-sourcing language data is not a mechanism to obtain free translations. We believe that anyone eager to contribute to technological evolution and greater digital representation of their language is entitled to compensation. TAUS’ global community of Human Language Project (HLP) contributors is therefore paid for their work. We ensure that the compensation for the performed work is fair, the payment terms transparent, and that the rates match reasonable wage levels in their respective countries while keeping the prices of the generated datasets competitive for our clients.

Misconception #2: Any Organization Can Community-Source Data

So you basically send some words to the people in your network and they translate it, right? Well, it’s not quite that easy. The first thing you need is a user-friendly, robust platform with advanced admin features, integrated NLP capabilities, and a transaction system able to facilitate worldwide payments. Think of software development, setup, and server costs. Then you need to establish communities in relevant countries and onboard the workforce. Finding and recruiting qualified workers in remote parts of the world requires a great deal of coordination and experienced project ambassadors on the ground. Finally, and this is probably the hardest part, you want to keep the community engaged with varying tasks, continuous work, and timely payments. Definitely more complex than just sending out some words for translation. Read here how TAUS set up a data annotation community for a global e-commerce corporation.

Misconception #3: Community-Sourcing Can’t Deliver Quality Data

This one is by far the most critical assumption when it comes to community-sourcing. Everyone needs great volumes of language data fast, but nobody wants poor-quality data. We couldn’t agree more. This is exactly where community recruitment and management play a key role. Even though this approach doesn’t necessarily involve professional translators, communities consist of native speakers and bilingual people who are well informed about the purpose of the project and aware that they only get compensated for work that’s up to a set standard. Finally, adding automated and manual moderation steps to the process further helps with ensuring quality. 

Even though community-sourcing might not be the preferred solution for every data collection scenario, if done right, it provides some clear benefits such as access to a large distributed workforce, significantly shorter turnaround times, and potentially more control of the process. Since the projects are split into micro-tasks, the creation, cleaning, and validation of large amounts of data can be done faster and more efficiently. 

Community-sourcing can be particularly interesting as a way of acquiring data when you:

Want to Add a New Language With No Available Data

Even the largest enterprises have been in this position, and they still are when it comes to the long tail languages. Kickstarting translations in a new language with little to no available language data is always hard, especially if you want to be able to deploy MT to some extent from the very start. While you’re building your pages from scratch and making sure that the linguistic particularities and market requirements (such as different alphabet, text alignment, number formatting or specific content) are being accounted for, the communities can be creating the data assets that will allow you to jumpstart things in production when the time comes.

Have to Train a Specific ML Model

ML models are inherently problem-specific and require tailored data of all sorts - domain-specific data, labeled data, or pre-processed data. By making community tasks specific to your use case, multiple data tailoring steps can be carried out at the same time, ensuring that the data that you get is ready for training your ML model without any additional adjustments. 

Need High Volumes of Training Data, Fast

Think of cases such as releasing a new line of products or adding customer support for a new industry vertical, even if it’s just for the top languages. This challenge of scaling within a limited time frame can be largely alleviated by obtaining as much additional data as you can in the specific domain and training your systems before those new pages or features gain enough traction/priority to justify the engagement of the localization teams. The advantage of using communities to get this data is again the fact that task distribution allows for greater volumes, and that the workforce doesn’t need much lead time.


Despite some bad reputation that it recently gained due to being used synonymously with terms like translation crowdsourcing or user-generated translation, community-sourcing as a paid and managed activity offers great opportunities for an array of data-related tasks, such as data creation, labeling, and more. Choosing the right community-sourcing partner with an established platform, engaged communities and a track record in providing end-to-end data solutions is essential. With a highly specialized NLP team and a global community with thousands of data contributors, TAUS can help generate, collect, prepare and annotate text, image, or audio datasets for your ML systems.



Milica is a marketing professional with over 10 years in the field. As TAUS Head of Product Marketing she manages the positioning and commercialization of TAUS data services and products, as well as the development of taus.net. Before joining TAUS in 2017, she worked in various roles at Booking.com, including localization management, project management, and content marketing. Milica holds two MAs in Dutch Language and Literature, from the University of Belgrade and Leiden University. She is passionate about continuously inventing new ways to teach languages.

Related Articles
Purchase TAUS's exclusive data collection, featuring close to 7.4 billion words, covering 483 language pairs, now available at discounts exceeding 95% of the original value.
Explore the crucial role of language data in training and fine-tuning LLMs and GenAI, ensuring high-quality, context-aware translations, fostering the symbiosis of human and machine in the localization sector.
Domain Adaptation can be classified into three types - supervised, semi-supervised, and unsupervised - and three methods - model-centric, data-centric, or hybrid.