Data-Enhanced Machine Translation

8 minute read

The next logical translation solution: Data Enhanced Machine Translation (DEMT)

This is the third article in my series on Translation Economics of the 2020s. In the first article published in Multilingual, I sketched the evolution of the translation industry driven by technological breakthroughs from an economic perspective. In the second article, Reconfiguring the Translation Ecosystem, I laid out the emerging new business models and ended with the observation that new smarter models still need to be invented. This is where I will now pick up the thread and introduce you to the next logical translation solution. I call it: Data-Enhanced Machine Translation.

Current shortcomings

Machine Translation technology has brought the promise of a world without language barriers. The need for translation is astronomical: the total output from MT platforms is tens of thousands of times bigger than the translation production of all professional translators in the world combined. And yet in its current state and use of the technology we are still looking at a half-baked solution. What’s wrong? Well, the quality of course, and more specifically the coverage of domains and languages. Imagine where we would be if we could mobilize the technology to handle all languages and domains equally well. 

Translation out of the wall

Let’s zoom out here for a moment and put ourselves in the shoes of an average netizen somewhere in the world. The 4.6 billion people that live and work online on our planet look at translation as a utility: translation out of the wall, not much different from electricity or the internet. It’s always there, and if it isn’t, well that’s most inconvenient of course. The quality may also not always be great, but it will work better next time, right? Remember the poor quality of Skype calls in the early days or the trouble we had making connections to the internet by dialing in through our computers. We accepted all of these inconveniences because we anticipated that the technology would catch up and improve. 

So do millions of small, medium and large businesses around the world. Translation is a utility they purchase from their cloud provider and pay for through monthly billing, along with hosting costs for their websites and other IT services. If the translation ‘signal’ is not available for a particular country, then so be it. They have to live with the disconnect until the ‘coverage’ is available. The alternative of a full human translation is not an option, because it may take a week, a month or more before the translation is finished, and that simply does not match the fast pace of global business these days. Not to speak of the cost of human translation: probably 500 times higher than what the company pays for using the MT service from the cloud provider.

The compromises of the translation sector

So what can we do to bridge the gap: boost the quality and expand the coverage of the translation ‘signal’ for the billions of end-users and millions of business users? The sobering truth is that the biggest shortcoming is not so much in the technology but in the way we go about it. We need an innovation ecosystem. 

But the best we are getting at the moment is a translation industry making compromises: the new MT technology is molded into existing business models and processes. If you can’t beat ‘m, eat ‘m, seems to be the mantra of some of the most forward-thinking translation platforms, judging from the sheer number of MT engines they claim to integrate with. What they achieve is a productivity gain resulting in a price reduction for the customer: ten to twenty percent faster and cheaper every year. At the core, though, nothing is changing. Except that in this race to the bottom they are dragging along the professional translators and giving them ever more dumb post-editing work to do. 

The compromises made in the translation industry come from on the one hand a long tradition of craftsmanship and on the other the principle that a translation is a service that is always commissioned and paid for by a customer. The post-editing model, or ‘human-in-the-loop’ as it is more neatly referred to, is in a way a half-baked solution, a compromise. These compromises stop us from seeing the bigger picture and bigger opportunities. What if these traditions and principles no longer count under the new economics? 

Rethinking everything

As we have seen at other ruptures in the history of economies, it is thinking out-of-the-box that will help us to reap the full benefits of a technology breakthrough. Rethinking everything, a tabula rasa kind of approach to the definitions of markets and products, is the key to innovation.

The world is our market

The translation industry, I believe, has the mission to help the world communicate better. The world is our market. Operators in the translation industry must see beyond the inner circle of customers that they are serving today and help every business, small and large, become truly world-ready. The challenges of practically real-time quality translation and utility-based pricing are waiting to be solved by innovative thinkers from the core language industries. Others won’t do it or can’t do it. The tech companies bring the translation utility only so far. They can’t close the quality gap or don’t want to because they can’t excel in everyone’s field of expertise and style preferences. The reward for solving the problems is a market that is thousands of times bigger, at least in volumes of output, than the translation industry today.

The data is the product

To deliver on our mission to help the world communicate better and support every business to become truly world-ready, we need to really think of translation as a product, rather than a creative service. There is no other way we can scale up and deliver real-time. And as we can see all around us: translation is already a utility, a feature or a product on the internet. The central problem is that it’s just not good enough. To make it better, the professionals in the language industry need to look under the hood, analyze the product and figure out how it works. They will then discover that MT is not a magic black box, and realize that the difference between good and bad or not-so-good MT lies in the data that we put into the engine.

Think of MT as tasteless instant food: the product of a recipe (the algorithms) and ingredients (the data). To produce really good tasty food we need to get the best ingredients. Yes, we can perhaps tweak the recipe a bit here and there, but the biggest gain is in the quality of the ingredients. Putting it this way, we can say that the data - and not the translation process - is in fact the product. There is a growing awareness in the MT and AI industries overall that too much emphasis is being put on the models and that the more important data work is undervalued. 

Becoming practical

So now that we have redefined the market and agreed that it is so much bigger than what we consider our market today, and now that we have redefined translation as essentially a data product, the question remains: how do we turn this into a viable business, and how do we scale?

Best use of human capital

There is a huge need for translation and this is continuing to grow. Yet the population of professional translators is not infinite. MT technology, as we have seen, helps to increase productivity. But is post-editing MT the best use of our human capital? The core competencies of the translation industry are of course the skills of the people and their deep knowledge of vernaculars. If we want to optimize the utilization of this human capital, we are much better off investing in data products that we can sell multiple times. 

Knowing how to assemble high-quality data in their domain is the professional competitive advantage of the people working in the translation industry. For example, an English to French professional translator with subject matter expertise in fishery and maritime law can alone meet the needs of a hundred different lawyers who may now be engaging with customers on both sides of the Channel over fishing rights after Brexit. Or think of Mohamed Alkhateeb , the Syrian medical doctor and translator who uploaded his English to Arabic medical translation memories to the TAUS Data Marketplace, helped to boost the quality of Systran’s medical engine for the Arabic markets and created a good income for himself.


The final piece of the puzzle is how do we bring demand and supply together in actual practice. How can specialized translators and language service providers operating in a niche market find as many customers as possible for their translation or data products? 

In my previous article, Reconfiguring the Translation Ecosystem, I introduced marketplaces and collaborative platforms as the new sharing models that would best support innovation in the translation ecosystem. As translation is now becoming another form of AI we see that the big AI marketplaces are gradually starting to  include data for translation. We are also seeing the emergence of more specialized aggregation platforms such as aiXplain, Systran’s Marketplace, Hugging Face, and of course the TAUS Data Marketplace. All of these sharing platforms will push the one-to-many business model and help the translation industry to scale up and eventually build bridges to the millions of new business customers waiting for the translation ‘signal’ and better quality.

Data-Enhanced Machine Translation

The next logical translation solution therefore is Data-Enhanced Machine Translation. This is the premium quality level of real-time translation, perhaps not as good as human translation quality or transcreation, but good enough for 90% of all use cases. The crucial fact is that it’s a product, a feature. Supply is driven by demand. An Azerbaijani translator specialized in VAT regulations may be the sole provider of translation data in this niche, whereas French e-commerce specialists must share customers with many others. Dashboards on collaborative platforms will help operators in the translation industry allocate their resources to where they can make the biggest difference. This innovation wave may make the translation industry more transparent and create a more equal level playing field for all providers.  

Data sellers: go to TAUS Data Marketplace to check the value of your data.

Translation buyers: build your own domain-specific dataset with the Matching Data feature on the TAUS Data Marketplace (Coming Soon).

See our blog article on data cascades and check out Andrew Ng’s plea for data-centric AI


Jaap van der Meer founded TAUS in 2004. He is a language industry pioneer and visionary, who started his first translation company, INK, in The Netherlands in 1980. Jaap is a regular speaker at conferences and author of many articles about technologies, translation and globalization trends.

Related Articles
Purchase TAUS's exclusive data collection, featuring close to 7.4 billion words, covering 483 language pairs, now available at discounts exceeding 95% of the original value.
Explore the crucial role of language data in training and fine-tuning LLMs and GenAI, ensuring high-quality, context-aware translations, fostering the symbiosis of human and machine in the localization sector.
Domain Adaptation can be classified into three types - supervised, semi-supervised, and unsupervised - and three methods - model-centric, data-centric, or hybrid.