From Studying Medicine to Selling MT Training Datasets

3 minute read

See how a medical doctor trying to break the linguistic monoculture in academia as a student ended up creating a dataset of medical glossaries and translation memories that brings about a 90% BLEU score improvement for the English and Arabic language pair.

Globalization lies at the core of the contemporary age. Knowing this, it’d be ideal to think that the enterprise of academic research would capitalize on contributions from researchers globally and also wants these contributions to be accessible by all students and academics all around the world. Yet, language barriers still present a considerable stumbling block when it comes to the global circulation of academic findings. English is the dominant language in the academic world, which means that researchers around the world are under pressure to publish their findings in English and academic students are expected to understand and digest all of these significant findings in English. This overall contributes to the creation of an academic monoculture.

Muhammad Alkhateeb is originally an orthopedic surgeon trained in Orthopedics and Traumatology at the Damascus Al Mujtahid University Hospital in Syria. As a medical student, over ten years ago, Muhammad started translating. Now he's a surgeon, and still helping the medical community through English - Arabic translations. His experiences showcase how his struggle to break the academic monoculture turned into a brand new occupation through which he can now monetize the translation data he has collected and generated over the years.

“Most medical sciences textbooks are written in English. However, our faculty’s teaching language was Arabic so I started translating these English resources as a student. I spent many hours studying medical terminology to be able to do these translations efficiently,” explains Muhammad and adds “Over the years, I gathered 125.000 medical terms in English to Arabic and had to learn how to use CAT tools and machine translation.” At the same time, he discovered that English - Arabic machine translation outputs in the medical domain were not accurate. He continued to build his medical glossaries and collected half a million segments of translation memory with hopes that one day they’d be useful to improve MT systems.

While searching online for a platform where he can contribute that purpose, he came across the TAUS Data Marketplace. “I wanted to share my datasets there to make it easier for MT developers to access accurate data in the English - Arabic language pair so that other medical students or doctors who teach medicine can access more accurate translations of the available academic findings,” says Muhammad, and in addition, he creates a monetary benefit for himself. 

On the TAUS Data Marketplace, he currently offers 400K segments in the medical domain with 5 to 25 words per segment in the English-Arabic language pair. “This data was used in three different test projects to improve MT performance and a 90% BLEU score improvement was observed in the MT engines trained in the medical domain,” explains Muhammad.

In domains like medical, sentences may be structured simply, however, terminology plays a significant role. After training an MT engine with datasets and terminology bases hand-crafted by a medical professional turned translator such as Muhammad Alkhateeb, a significant improvement is guaranteed. 



Şölen is the Head of Digital Marketing at TAUS where she leads digital growth strategies with a focus on generating compelling results via search engine optimization, effective inbound content and social media with over seven years of experience in related fields. She holds BAs in Translation Studies and Brand Communication from Istanbul University in addition to an MA in European Studies: Identity and Integration from the University of Amsterdam. After gaining experience as a transcreator for marketing content, she worked in business development for a mobile app and content marketing before joining TAUS in 2017. She believes in keeping up with modern digital trends and the power of engaging content. She also writes regularly for the TAUS Blog/Reports and manages several social media accounts she created on topics of personal interest with over 100K followers.

Related Articles
Interview with Kathleen Kownacki: Bridging Language Divides as a Local Data Collector.
Participation in the language data projects offered by the Human Language Project leads to new professional opportunities.
Lexicala has emerged from the publishing world, as a provider of quality lexicographic content for leading dictionary publishers worldwide, and joined the TAUS Data Marketplace as a language data seller.