From Studying Medicine to Selling MT Training Datasets

Globalization lies at the core of the contemporary age. Knowing this, it’d be ideal to think that the enterprise of academic research would capitalize on contributions from researchers globally and also wants these contributions to be accessible by all students and academics all around the world. Yet, language barriers still present a considerable stumbling block when it comes to the global circulation of academic findings. English is the dominant language in the academic world, which means that researchers around the world are under pressure to publish their findings in English and academic students are expected to understand and digest all of these significant findings in English. This overall contributes to the creation of an academic monoculture.

Muhammad Alkhateeb is originally an orthopedic surgeon trained in Orthopedics and Traumatology at the Damascus Al Mujtahid University Hospital in Syria. As a medical student, over ten years ago, Muhammad started translating. Now he's a surgeon, and still helping the medical community through English - Arabic translations. His experiences showcase how his struggle to break the academic monoculture turned into a brand new occupation through which he can now monetize the translation data he has collected and generated over the years.

“Most medical sciences textbooks are written in English. However, our faculty’s teaching language was Arabic so I started translating these English resources as a student. I spent many hours studying medical terminology to be able to do these translations efficiently,” explains Muhammad and adds “Over the years, I gathered 125.000 medical terms in English to Arabic and had to learn how to use CAT tools and machine translation.” At the same time, he discovered that English - Arabic machine translation outputs in the medical domain were not accurate. He continued to build his medical glossaries and collected half a million segments of translation memory with hopes that one day they’d be useful to improve MT systems.

While searching online for a platform where he can contribute that purpose, he came across the TAUS Data Marketplace. “I wanted to share my datasets there to make it easier for MT developers to access accurate data in the English - Arabic language pair so that other medical students or doctors who teach medicine can access more accurate translations of the available academic findings,” says Muhammad, and in addition, he creates a monetary benefit for himself.

On the TAUS Data Marketplace, he currently offers 400K segments in the medical domain with 5 to 25 words per segment in the English-Arabic language pair. “This data was used in three different test projects to improve MT performance and a 90% BLEU score improvement was observed in the MT engines trained in the medical domain,” explains Muhammad.

In domains like medical, sentences may be structured simply, however, terminology plays a significant role. After training an MT engine with datasets and terminology bases hand-crafted by a medical professional turned translator such as Muhammad Alkhateeb, a significant improvement is guaranteed.

From Studying Medicine to Selling MT Training Datasets

See how a medical doctor trying to break the linguistic monoculture in academia as a student ended up creating a dataset of medical glossaries and translation memories that brings about a 90% BLEU score improvement for the English and Arabic language pair.