Synthetic Data Generation for Neural Machine Translation

8 minute read

Synthetic parallel data generation by back-translation as a solution for the problem of translating low-resource languages and texts from low-resource domains.

In recent years, NMT systems are getting better and better, some even claiming human parity. If systems on-par with human translators could really be deployed, that would fulfill the “no-human in the loop” dream that the industry seems to indulge in more and more frequently.

What is powering the success of these NMT systems? On the one hand, it definitely has to do with the progress that has been made with regard to the architecture of the models, especially the Transformer model that was invented in the field of machine translation. On the other hand, high-quality training data remains a hard requirement when it comes to training a high-performing machine translation system. 

Although numerous translation systems with high performance have been built today, domains and language combinations covered by them barely scratch the surface of an iceberg. For the majority of around 6,000 languages spoken in the world today, there are no machine translation systems, and no language data that could be used to build those systems. But even when it comes to higher-resource languages, such as English, Dutch or Japanese, there are linguistic contexts where available machine translation systems perform worse than we would like or expect. This happens, for example, when we want to translate text data related to a specific domain that is quite different from the training data translation models have seen during training. One example of such a domain could be scientific articles about protein folding. An off-the-shelf model, such as the English-to-Dutch Amazon Translate model, will probably not perform well in this context.

We already mentioned that there is no or scarce language data in most of the world’s languages. The situation is not really different when it comes to most of the possible domains we would want to talk about. We have a lot of data in some domains, for example in parliamentary proceedings (Europarl Corpus) or TED talks, however the same cannot be said for many other domains.

The problem of translating low-resource languages and texts from low-resource domains has been tackled in various ways. In this article, we will focus on one data-centric approach to domain adaptation or low-resource language data translation, namely synthetic parallel data generation by back-translation. To begin with, we will briefly describe the terms, and then provide a few examples from our experimentation at TAUS.

Synthetic data generation refers to automated production of data without a need to collect and label new data. There are numerous techniques of synthetic data generation, and they are highly dependent on the field and on the task we are working on (for example, image classification vs. sentiment analysis vs. machine translation). Synthetic data generation can come in two formats: data augmentation and synthetic data generation proper. Data augmentation refers to new instances being produced from already existing ones, such as cropping a photograph. Synthetic data generation proper refers to data being produced completely artificially, for example, from a prompt using an image or text generation model Purely synthetically produced data can in turn also be used for data augmentation, to augment natural data we already have at our disposal.

Computer vision and synthetic data generation

Computer vision is the field of computer science that pioneered most of the synthetic data generation techniques. This is understandable when we think of how training data might look like for a vision task. Let’s say we wanted to train a classifier that can recognize cats in photos. If we want to augment our dataset with additional examples of cats, we can just use the photos we already have and make them black-and-white, rotate them, crop them… and still have them depicting cats! Or, if we want to use purely synthetic images, we can even use an image-generation model like DALL·E to generate images of cats solely based on the text prompt “cat”.

Image 1: Images of cats generated using the Craiyon (formerly DALL·E mini) model

When it comes to NLP, the use of data augmentation is not as clear-cut because of the discrete nature of language data. Thus, if we try something similar to what we described when we talked about computer vision, and we swap places of some words in a sentence, we will probably end up with a sentence that is not grammatically correct, or at least a sentence that has a different meaning to the sentence we started from. Still, there are many different ways to make synthetic data generation work for NLP, especially when it comes to tasks such as sentiment analysis. A good primer can be found here. (Note that this source, and many other sources where translation is not the main task for data augmentation and synthetic data generation, confuse back-translation with round-trip translation. Back-translation is described in more detail in the remainder of this blog. Round-trip translation is similar to back-translation in that it uses translation models for synthetic data generation, but it’s a different method with a different goal. In round-trip translation, a sentence in some source language is translated to some other target language and then translated back to the original source language, resulting in a synthetic paraphrase of the original sentence.)

Back-translation for synthetic data generation

When it comes to the task of machine translation, most synthetic data generation techniques rely on text generation models they have on hand, namely the translation models, since translation is, of course, just another way of text generation. The most widely used and most useful technique is called back translation. This technique enables us to make use of monolingual language data that is always more abundant and easier to come by than bilingual parallel data. The way this works is as follows: if our goal is to build a machine translation system that can translate from some source language A to some target language B, and we have monolingual data in language B, we can make use of a language model that knows how to translate from language B to language A to translate those monolingual instances in language B, and then we can pair them with their translations to obtain a synthetic parallel dataset with A → B examples. To translate from B to A, we can use an already existing model, or when training models from scratch, we can use our original data in A → B, flip the source and the target sides and then train a model using this data.

Diagram 1: We can use a target → source translation model to translate natural monolingual sentences in the target language to the source language

This synthetic parallel dataset can then be used to train the machine translation system we are building, namely the one that translates from language A to language B.

Diagram 2: Next, we can use the resulting synthetic parallel datasets to train or adapt a machine translation model that translates from source to target

Why does back-translation work well? 

The first important thing to note is that synthetic data actually comprises just one, the source side of the parallel training dataset. The target side is natural data, namely the monolingual sentences in the target language. This means that, by using our back-translated synthetic parallel dataset as training data,  we will show our model many valid target sentences, and this will strengthen the decoder. Some source sentences will be valid and well-formed too, but some will actually be bad translations. Nevertheless, this should not impact our system too much, since it should never be asked to translate similar erroneous sentences. To go back to our protein folding example, let’s say we have some English monolingual sentences that we want to back-translate to Dutch to build a Dutch → English model that can perform well on sentences from the “protein folding domain”. The model we are using to translate from English to Dutch is not great, and it translates “protein folding” as eiwit bouwing, instead of eiwitvouwing or eiwitopvouwing, which would be correct translations. But, once we start using our Dutch → English model that we trained using these erroneous translations, we will not expect it to make the same mistake, because we are not likely to ask it to translate “eiwit bouwing” in the first place, since this is not a valid concept in this domain!

Another thing to note is that back-translation usually works best when synthetic parallel datasets are used to augment natural parallel data. Thus, we combine natural and synthetic parallel data to arrive at a training dataset that consists of natural parallel data plus synthetic parallel data added to it in some ratio, for example, 1:1, 1:2 or 1:4. The best ratio will depend on the language pair and domain, and we can discover it by experimenting, since as we add more and more synthetic data the model will probably start deteriorating and producing results that, at some point, could even be worse than those produced by the original model.

As already mentioned, synthetic data can be used in low-resource language or domain situation. At TAUS, we recently experimented with back-translation in the context of domain adaptation such as we have with TAUS DeMT™, where high-quality TAUS data is used to adapt general-domain models such as Amazon Translate to perform well in domains such as financial services or e-commerce. 

Download DeMT™ Evaluation Report >

Image 2: Performance of Amazon Translate in the three domains: unadapted off-the-shelf model, model customized using natural data, and model customized using synthetic back-translated data

We experimented with adapting English → Dutch Amazon Translate models with data in the three domains: Financial services, Pharmaceuticals and E-commerce. On average, using high-quality TAUS data in the three domains, we were able to obtain translations that were 5 BLEU points better than the ones produced by the general, off-the-shelf model. Using just the target side of the original parallel dataset, back-translated using Amazon’s off-the-shelf Dutch → English model, we were able to produce translations that were, on average, 2 BLEU points better than the unadapted baseline. Thus, back-translated data did help to adapt the language model to the domains in question, but high-quality natural parallel data performed much better.

Image 3: Customizing Amazon Translate with data from the e-commerce domain

We have already mentioned that back-translated data works best when it is used to augment natural data. To test this on our own examples, we performed a series of experiments where different quantities of synthetic data were added to natural parallel data. Image 3 shows the results of one of these experiments. Again, using just natural data we obtain the best result, but combining natural and synthetic data performs much better than using just synthetic data.

The last experiments that we ran concerned natural data setup. While in the aforementioned experiments, we always used customization with natural data as the upper bound, and then used portions of natural data to generate synthetic data, now we wanted to use all natural data that we had for adaptation, and add to that real monolingual data in the target language that we had in the domain of interest. Image 4 shows that, by using monolingual data in the KYC domain (Know your customer domain that is part of Financial Services / Anti-money Laundering domains) on top of natural data, we were able to further customize the general language model to this specific domain.

Image 4: Customizing Amazon Translate with data from the KYC domain

In conclusion, we remain confident that high-quality natural parallel data is the way to go when it comes to customizing high-performing general-domain translation models to perform better in a domain of interest. Still, when the situation is lower-resource, or high-quality parallel data is not available, using back-translation as a synthetic data generation method can help push model performance further, although never as far as real natural language data. In the future, we plan to continue experimenting with synthetic data in different low-resource scenarios and using different generation methods that build on back-translation. Stay tuned :)

Discover case studies >


Junior Machine Learning Engineer at TAUS with a background in linguistics, anthropology and text mining. Passionate about implementing state-of-the-art NLP solutions and doing the data work, while also following engineering best practices.

Related Articles
Purchase TAUS's exclusive data collection, featuring close to 7.4 billion words, covering 483 language pairs, now available at discounts exceeding 95% of the original value.
Explore the crucial role of language data in training and fine-tuning LLMs and GenAI, ensuring high-quality, context-aware translations, fostering the symbiosis of human and machine in the localization sector.
Domain Adaptation can be classified into three types - supervised, semi-supervised, and unsupervised - and three methods - model-centric, data-centric, or hybrid.