Ethical Responsibility towards Underrepresented Languages
icons-action-calendar5 Jan 2021
5 minute read
See how a linguist specializing in low-resource Middle-Eastern languages mobilizes his native community to generate language data and by offering it on Data Marketplace manages to increase digital representation of these languages.

"The Web does not just connect machines, it connects people," said Tim Berners-Lee, the inventor of the World Wide Web. Whether online or offline, language is just as important to building human connections: it forms the basis of how users identify with each other and the boundaries within which communities come together for common interests.

Making information equally available in as many languages as possible online is strictly tied with the development of Machine Learning technologies and Machine Translation practices. Despite the efforts, not all of them gain the same traction or show the same results. This is shown by the research from academics Mark Graham and Matthew Zook, who compared the Google searches made in the West Bank in Hebrew, Arabic and English. They discovered a striking imbalance between linguistic groups: searches in Arabic in areas under Palestinian control usually result in only 5% to 15% of the number of results that the same search term brings up in Hebrew. English searches also bring back between four and five times more results than in Arabic.

Inequality in information and representation in different languages online can even affect how we understand places and even how we act in these places. In a case study of the West Bank, searching for “restaurant” locally in Hebrew, Arabic and English brings back different results for each language. Google can send Arabic speakers to one part of the city and Hebrew speakers to another when they are searching for the same thing. This pre-selection of information that people receive risks reinforcing social segregation in the city and shapes how people interact with each other and the world around them.

These studies make it clear that the universe of information on the internet looks very different from one language to another and what it often comes down to is the availability of data in a specific language. To help low-resource languages to become available online, TAUS started the Human Language Project (HLP) where communities of underrepresented languages generate datasets to be used in Machine Learning projects. 

Aren Yıldırım is one of the ambassadors for the HLP Middle Eastern community. He speaks Kurdish, Kurmanji, Sorani, Zazaki and Hewrami in addition to Turkish, Persian and Arabic. He is the author of a Sorani-Turkish dictionary and several course books on Kurdish for Foreigners. He has been a well-respected linguist for 12 years and has translated 6 books from Persian and Turkish into Kurdish and Sorani.

Aren highlights that as a translator of several low-resource languages spoken by people in conflict and oppression zones, he understands the value and necessity of data production to support these communities and languages. “It is vital for me and the communities I represent to see that these languages that are not classified as official languages in countries they are spoken in, and are even banned in daily usage, are being valued by international companies and platforms,” says Aren. He adds “The fact that the language data produced in these languages will be used in digital platforms will certainly expedite the connection and communication between oppressed communities and other cultures.”

So far he has mobilized a big community of Kurmanji (Arabic and Latin script), Sorani and Turkish speakers to translate one million words in each one of these languages on the HLP Platform. These datasets are available for sale in the TAUS Data Library. Recently Aren and his community translated 200K words in the English - Kurmanji/Sorani language pair in the media and news domain for use in AI and ML projects. These are available for purchase on the Data Marketplace.

“I believe that making communication possible through machine learning applications, and thus faster, forms a new economic ecosystem of its own. It seems without a doubt that other linguists like me would see the value in this new business stream and understand how it helps us stay aligned with the contemporary business and technology trends,” says Aren. 

Data will continue to be at the heart of the translation and language business going forward. The content production rate increases along with the speed of technology and AI has to keep up with it through diverse solutions. Based on this, Aren suggests that “we must be aware that the wave of transformation has already started and it is the way that linguists, especially in low-resource languages, can monetize their efforts by mobilizing the communities of native speakers around them to produce datasets. This is a financial benefit for their communities but, more than that, ethical responsibility to their culture and languages.” 

More datasets by Aren Yıldırım can be found here on the TAUS Data Marketplace.


Şölen is the Head of Digital Marketing at TAUS where she leads digital growth strategies with a focus on generating compelling results via search engine optimization, effective inbound content and social media with over seven years of experience in related fields. She holds BAs in Translation Studies and Brand Communication from Istanbul University in addition to an MA in European Studies: Identity and Integration from the University of Amsterdam. After gaining experience as a transcreator for marketing content, she worked in business development for a mobile app and content marketing before joining TAUS in 2017. She believes in keeping up with modern digital trends and the power of engaging content. She also writes regularly for the TAUS Blog/Reports and manages several social media accounts she created on topics of personal interest with over 100K followers.

Related Articles
icons-action-calendar14 Sep 2021

TAUS Data Marketplace has brought new opportunities to everyone, from individual linguists and LSPs to data and publishing companies, to leverage and monetize their content. The key to being a part of the surging trend of language data for AI is the successful conversion of available multilingual content into language data that is directly usable for AI model training. 

icons-action-calendar14 May 2021

Globalization lies at the core of the contemporary age. Knowing this, it’d be ideal to think that the enterprise of academic research would capitalize on contributions from researchers globally and also wants these contributions to be accessible by all students and academics all around the world. Yet, language barriers still present a considerable stumbling block when it comes to the global circulation of academic findings. English is the dominant language in the academic world, which means that researchers around the world are under pressure to publish their findings in English and academic students are expected to understand and digest all of these significant findings in English. This overall contributes to the creation of an academic monoculture.

icons-action-calendar22 Apr 2021

The total volume of data created worldwide is expected to reach 149 zettabytes by 2045. Therefore, capitalizing on data has become as important as human, financial, or any other capital. Data as capital has gained even more importance now that data-trained systems start to dominate all imaginable aspects of the world we live in.