Why Low-resource Language Data Matters

5 minute read

As machine translation for low-resource languages becomes more popular the need for low-resource language data becomes critical. Here's why.

You’ve probably heard before that there are around 7,000 spoken languages in the world, 4,000 with an established writing system. You might also know that only 23 languages account for more than half of the world’s population (view infographic by Alberto Lucas). But, did you know that a quarter of those 23, like Bengali, Tamil, Telugu, Urdu, Marathi, and Lahnda are not even in the top 40 languages on the internet when it comes to the availability of online content?

A simple explanation is that internet penetration in regions like Asia & Pacific happened slower so that these languages “came online” later. The languages above are also what we call low-resource languages - languages lacking linguistic resources to train automated translation systems. That’s all logical, and it might seem like we could wait another few years for these languages to catch up. However, here are three reasons why these and other long-tail languages should be in focus already today and why there should be an industry-wide effort to increase the available data pool.

Digital Evolution is Breaking Ground

The past couple of years saw immense growth in new internet users, with over 350 million people coming online in 2019, or nearly one million daily. India, Asia-Pacific, and China were the growth frontrunners. Even with the average annual digital growth of 9%, the majority of languages are still digitally under-represented, with the biggest online automated translation platforms supporting no more than 103 languages. The ability of the businesses to approach these new online users will depend on whether or not there will be content available in the language that they can understand. The race to the next billion users is on, and the only way to match the speed of digitalization is automated translation.

Languages are Disappearing Fast

Linguistic diversity and its preservation is an existential question. Language shapes our identity and ensures that we can communicate, integrate socially, and develop. A loss of a language is a loss of cultural heritage and knowledge. Still, the number of disappearing languages is shocking. According to UNESCO’s Atlas of the World’s Languages in Danger, 230 languages went extinct between 1950 and 2010, and 40% of the world’s languages are considered endangered with fewer than 1,000 speakers left. The year 2019 has been proclaimed as the International Year of Indigenous Languages by the UN General Assembly in order to draw more attention to this problem.

Economy is The Driver

If the need for digital multilingualism or preserving linguistic diversity does not convince you yet, there is the economical side of things as well. One look at the International Monetary Fund overview of the emerging regions based on GDP and you will again see Asia and Pacific (including Southeast Asia) at the very top of the list, followed by Africa and the Caribbean. What does this have to do with low-resource languages, you might wonder? Well, if your organization is doing business in any of these regions, localizing in Bengali, Tamil, Telugu, Nepali, Sinhala, Lao and others might soon be a necessary step to ensure your success in the future.

TAUS has been creating corpora in low-resource languages to address digital inequality and support the machine learning efforts in some of these languages. Check TAUS Data Services and Data Library for more information.


Milica is a marketing professional with over 10 years in the field. As TAUS Head of Product Marketing she manages the positioning and commercialization of TAUS data services and products, as well as the development of taus.net. Before joining TAUS in 2017, she worked in various roles at Booking.com, including localization management, project management, and content marketing. Milica holds two MAs in Dutch Language and Literature, from the University of Belgrade and Leiden University. She is passionate about continuously inventing new ways to teach languages.

Related Articles
Purchase TAUS's exclusive data collection, featuring close to 7.4 billion words, covering 483 language pairs, now available at discounts exceeding 95% of the original value.
Explore the crucial role of language data in training and fine-tuning LLMs and GenAI, ensuring high-quality, context-aware translations, fostering the symbiosis of human and machine in the localization sector.
Domain Adaptation can be classified into three types - supervised, semi-supervised, and unsupervised - and three methods - model-centric, data-centric, or hybrid.