Ethical Responsibility towards Underrepresented Languages

"The Web does not just connect machines, it connects people," said Tim Berners-Lee, the inventor of the World Wide Web. Whether online or offline, language is just as important to building human connections: it forms the basis of how users identify with each other and the boundaries within which communities come together for common interests.

Making information equally available in as many languages as possible online is strictly tied with the development of Machine Learning technologies and Machine Translation practices. Despite the efforts, not all of them gain the same traction or show the same results. This is shown by the research from academics Mark Graham and Matthew Zook, who compared the Google searches made in the West Bank in Hebrew, Arabic and English. They discovered a striking imbalance between linguistic groups: searches in Arabic in areas under Palestinian control usually result in only 5% to 15% of the number of results that the same search term brings up in Hebrew. English searches also bring back between four and five times more results than in Arabic.

Inequality in information and representation in different languages online can even affect how we understand places and even how we act in these places. In a case study of the West Bank, searching for “restaurant” locally in Hebrew, Arabic and English brings back different results for each language. Google can send Arabic speakers to one part of the city and Hebrew speakers to another when they are searching for the same thing. This pre-selection of information that people receive risks reinforcing social segregation in the city and shapes how people interact with each other and the world around them.

These studies make it clear that the universe of information on the internet looks very different from one language to another and what it often comes down to is the availability of data in a specific language. To help low-resource languages to become available online, TAUS started the Human Language Project (HLP) where communities of underrepresented languages generate datasets to be used in Machine Learning projects.

Aren Yıldırım is one of the ambassadors for the HLP Middle Eastern community. He speaks Kurdish, Kurmanji, Sorani, Zazaki and Hewrami in addition to Turkish, Persian and Arabic. He is the author of a Sorani-Turkish dictionary and several course books on Kurdish for Foreigners. He has been a well-respected linguist for 12 years and has translated 6 books from Persian and Turkish into Kurdish and Sorani.

Aren highlights that as a translator of several low-resource languages spoken by people in conflict and oppression zones, he understands the value and necessity of data production to support these communities and languages. “It is vital for me and the communities I represent to see that these languages that are not classified as official languages in countries they are spoken in, and are even banned in daily usage, are being valued by international companies and platforms,” says Aren. He adds “The fact that the language data produced in these languages will be used in digital platforms will certainly expedite the connection and communication between oppressed communities and other cultures.”

So far he has mobilized a big community of Kurmanji (Arabic and Latin script), Sorani and Turkish speakers to translate one million words in each one of these languages on the HLP Platform. These datasets are available for sale in the TAUS Data Library. Recently Aren and his community translated 200K words in the English - Kurmanji/Sorani language pair in the media and news domain for use in AI and ML projects. These are available for purchase on the Data Marketplace.

“I believe that making communication possible through machine learning applications, and thus faster, forms a new economic ecosystem of its own. It seems without a doubt that other linguists like me would see the value in this new business stream and understand how it helps us stay aligned with the contemporary business and technology trends,” says Aren.

Data will continue to be at the heart of the translation and language business going forward. The content production rate increases along with the speed of technology and AI has to keep up with it through diverse solutions. Based on this, Aren suggests that “we must be aware that the wave of transformation has already started and it is the way that linguists, especially in low-resource languages, can monetize their efforts by mobilizing the communities of native speakers around them to produce datasets. This is a financial benefit for their communities but, more than that, ethical responsibility to their culture and languages.”

More datasets by Aren Yıldırım can be found here on the TAUS Data Marketplace.

Ethical Responsibility towards Underrepresented Languages

See how a linguist specializing in low-resource Middle-Eastern languages mobilizes his native community to generate language data and by offering it on Data Marketplace manages to increase digital representation of these languages.