Understanding BLEU Scores in Customized Machine Translation

It doesn’t happen very often nowadays, but every now and then I still find in my inbox a great example of what is becoming a relic from the past: a spam email with cringy translation. Like everyone else, I’m certainly not too fond of spam, but the ones with horrendous translations do get my attention. The word-by-word translation is like a puzzle to me: I want to know if I can ‘reverse-translate’ it to its original phrasing.

At the same time, I’m wondering what or who has produced such a translation. Automatic translation has come very far: It has become very hard to find a free translation service that does a worse than mediocre job. I realize that this little joy of reading out loud the worst bloopers from my inbox will soon be a thing of the past.

That said, it doesn’t mean that all automatic translation is flawless from now on. Far from that. Often, automated translations, though perfectly understandable, still have an out-of-place feeling around it, and especially reading a dense text about a subject that is new, is often very tiring in the longer run. It just is more of an effort to get behind the text and grasp the meaning. Automatic translation moved from incomprehensible to sometimes uneasy.

Tuning training data

The big machine translation engines, such as Google Translate, Microsoft Translator, and Amazon Translate, allow customers to adjust the output of translations towards their preferred domain or even their preferred style. That way, customizing machine translation is the next step forward to make translations more in line with the context expectations readers might have. The idea here is that neural machine translation offers baseline translations that are good enough for generic use, but by feeding it custom training material, it gets the extra quality that makes for a more knowledgeable translation.

The training material should consist of a good amount of approved translations in a given language pair. Behind the scenes, customization is done through complete retraining of the translation model, or by readjusting the parameters on the fly, but the result is that translation is more ‘your style’.

TAUS firmly believes in boosting engines this way. As a data company, we are eager to run the experiments of setting up different training datasets, and see what the impact of training on domain data can be.

There are a few steps involved in the training:

selecting domain and language pairs;
selecting the right training material;
evaluating the training results.

TAUS has a huge repository of language data, but as with any big text corpora, some combinations of language and domain are just more suitable for customization than others. Based on experience, chances of success can be estimated.

Selecting data for training is a harder job. It requires thinking about how narrow you want to have your domain, and what the quality of your data should be. As you may expect, narrowing the focus of your training material means lower applicability, but better results. Selecting the domain-relevant parts of the data is an art in itself, and one that will improve more and more with the advancement of neural models.

Regarding the quality of the training data: more is not always better. High quality and consistency in the training data outperform quantity. This is true to a larger extent than you might think. It is a variation on the Anna Karenina principle. The number of ways in which things can go wrong is so much larger than the ways in which things can go right. That makes the lower end of the quality spectrum suffer much more than average from internal inconsistencies, so it actually pays off not to be too conservative when trimming your data.

In fact, it is all about tuning and dialing. We use different metrics for the reliability of our training data. It is much like creating the best espresso with the coffee beans you’re given. Temperature, coarseness, amount: you carefully keep on dialing in all the different parameters until you hit the sweet spot.

Evaluating the quality

The idea of testing machine translation is simple: you create a training set with source sentences and their translations, but you keep a small portion aside with very reliable reference translations. Never train with this test set, because that would give away the right answers of the testing. You only use it to try out the translation engine, both before and after customization, and then you compare the generated translations with the given reference translation.

For the estimation of translation quality, there are quite a few different methods and metrics available. Nothing beats the good old human review. We should know as we've worked hard on establishing a dynamic approach to evaluate quality with DQF. For our initial estimations, we used a small-scale human review of reference translations and generated translations. We wanted to know if our reference translations were good in the first place (yes, they were all quite good, and almost always better than the generated translations), and also if the customized translations changed things for the better.

Apart from initial exploration, for that last estimation, human evaluation has its limits. Quickly, the workload for reviewing experiments is too large to implement at scale. That is when the need for automated evaluations comes in.

If you are familiar with the debates in natural language processing, you probably know that quite some effort is spent on estimating whether or not automatically evaluating translation quality is in tune with the human estimation of what a good translation is. The presumption is that a common human judgment is leading here. A good metric should therefore reflect human judgment in giving good translations a high score. The problem with these types of metrics often is that they might not always be immediately evident or intuitive. The logic of the metric does not appear to follow its purpose. That certainly applies to the most used metric for machine translation, the BLEU score.

Calculating BLEU scores

Let’s get you up to speed with calculating BLEU scores. In short, BLEU score takes already existing perfectly good translations as the reference translation, and compares the output of machine translation, the candidate translation, to this reference. Eventually, this comparison is expressed in a number between 0 and 1, and higher numbers indicate better scores.

A method like this must somehow make up for the fact that each source segment can have several perfectly good translations. BLEU score actually provides that and allows for multiple reference translations, each of them is considered equally good. But any deviation from the reference or references gets a lower score. This is where the BLEU score gets complicated. BLEU score checks on the words in the candidate translation, counts them, and whenever there are words in the candidate translation that are not in the reference translation, the score suffers. This is a way of calculating the precision of the translation: too much is not good.

From this, you might think that a series of words in random order that happens to be also in the reference translation, would make a high score, but it doesn’t work that way. Not only single words are included in the calculation, but also groups of consecutive words. The algorithm offers some leeway to come up with variations, but typically all groups of two, three, and four consecutive words that are in the candidate translation are counted and compared to all the groups of the same number of consecutive words in the reference translations. These groups of consecutive words are so-called n-grams, and they make sure that randomly ordering the correct words will not be rewarded because they only match the reference when the words are in the same consecutive order.

Also, a brevity penalty is applied to the score. We already saw that words in the candidate sentence that don’t appear in the reference sentences lower the score. Candidates with fewer words than the reference, on the other hand, will lower the maximum possible score by means of the brevity penalty. As a good explanation of the complete calculation, you can check out this page.

BLEU score: more than the sum of its parts

BLEU score is the type of metric that works best when applied to large amounts of data. First of all, don’t expect that each and every segment will be better if the BLEU score of a complete test translation is higher than that of another test translation. After all, the BLEU score is an average, and that means that individual segments will have different scores, better or worse. Furthermore, even in the case of two different candidate translations of the same segment, it is not necessarily the one with the higher BLEU score that is always the better translation. And finally, it is not recommended to compare larger bodies of translated text based on BLEU score, if the source text is completely different.

But on the whole, when comparing two larger bodies of candidate translations of the same source, the candidate with the higher score is generally perceived as the better translation.

What does raising the score look like

Tuning training sets to boost machine translation requires a lot of trial and error. We were set for a modest BLEU score impact of a few points at first, but we were quite impressed that a good training set could boost test translation by 6 points, or sometimes even around 10 points. That is actually a lot.

How much is a lot? The numbers may seem pretty impressive, but what does it do to the translations themselves? I’ll present some samples that demonstrate how translations can show great improvement. As it is my native language, I will give the samples for Dutch, but I will explain the subtleties enough for non-Dutch readers to understand.

One of the customizations involved a large set of medical translations. We trained Amazon Translate using “Active Custom Translation”, which allows for on-the-fly tuning translations using a bilingual corpus. Some of the main topics in the training corpus were about:

How and when to administer medicines;
Which effects and side-effect to expect from medical treatments;
Setting up experiments for medical research;
Reporting on life science reports.

We used a test set of 2000 segments. After customizing the translation with our training set, the total BLEU score went up by 7 points, from 44.3 to 51.3. There were 825 segments that had some sort of change, out of which 600 had a higher BLEU score after translation. The ones with a negative impact on BLEU score didn’t change as much on average as the ones with a higher BLEU score.

The changes to the translations came in all different forms. But some corrections are coming back more often. The training stimulated a much more formal language.

The source sentence:

‘Thank you! We will contact you as soon as possible.’

changed from:

Dank je wel! We nemen zo snel mogelijk contact met je op.

to:

Bedankt! We nemen zo spoedig mogelijk contact met u op.

whereas the reference was:

Bedankt, we nemen zo spoedig mogelijk contact met u op.

Note that both ‘u’ and ‘je’ are translations for ‘you’, but ‘je’ is much more informal, and will not be used to address people in a medical setting. ‘As soon as possible’ changed from ‘zo snel mogelijk’ to ‘zo spoedig mogelijk’. Both are correct, but ‘spoedig’ has again a more formal tone that makes it more like what you expect from an organization in the medical field.

Apart from using more formal language, the customized translation also sounded more professional for the medical field. For example:

[Product] is given according to official recommendations.

had the following reference translation:

[Product] wordt toegediend in overeenstemming met officiële aanbevelingen.

The customized translation was exactly the same as the reference translation. ‘Toegediend’ is a translation of ‘administered’, and is preferred over the uncustomized:

[Product] wordt gegeven volgens officiële aanbevelingen.

which uses the more literal ‘gegeven’ for ‘given’. Same is true for the difference in tone of ‘in overeenstemming met’ and ‘volgens’.

Other changes made the translation less ambiguous. For example:

[Substance] was studied in 14 main studies involving over 10,000 patients with essential hypertension.

which, without customization, was translated as:

[Substance] werd bestudeerd in 14 hoofdonderzoeken waarbij meer dan 10.000 patiënten met essentiële hypertensie betrokken waren.

After customization, it was exactly phrased as the reference translation:

[Substance] werd onderzocht in veertien belangrijke studies waaraan meer dan 10 000 patiënten met essentiële hypertensie deelnamen.

Note here that the original sentence uses ‘studied’, in the sense of performing empirical research. The Dutch ‘bestudeerd’ can be used for that as well but is more commonly used for learning from literature, and ‘onderzocht’ has a less ambiguous meaning for scientific research. The same sort of disambiguation is in ‘betrokken’ as translation for ‘involving’: it is a good translation, and actually the most literal. However, ‘deelnamen’ (‘participating’) is better since it means more active involvement in the research. Finally, ‘hoofdonderzoeken’ is a bit strange by implying a sort of hierarchy in studies, whereas ‘belangrijke studies’ is perfectly natural in this context.

Hallucinations are known as a side-effect of treatment with dopamine agonists and levodopa.

translated, without customization, to

Hallucinaties staan bekend als een neveneffect van behandeling met dopamineagonisten en levodopa.

After customization, it translated to:

Hallucinaties zijn bekend als bijwerking van de behandeling met dopamine-agonisten en levodopa.

Here again, the customized version shows more knowledge of the field. With ‘staan bekend als’, the uncustomized translation suggests ‘Hallucinations are known for being a side-effect of’, with the hint that most people probably only know hallucinations as they are side-effects of these particular treatments, whereas ‘Hallucinaties zijn bekend als’ just states that it is known that hallucinations can occur as a side-effect. It might be subtle, but it is the difference between a good-sounding statement and one that would surprise the reader for the wrong reasons.

As a final example, customization was able to correct an incomprehensible translation in a very concise way. The source was quite awkward:

[Product name] also induced an advance of the time of sleep onset and of minimum heart rate.

The uncustomized translation was:

[Product name] veroorzaakte ook een voorschot van het begin van de slaap en de minimale hartslag.

which supposed some kind of ‘deposit’ (‘voorschot’) of the start of sleep and of minimal heart rate. The customized translation removed that financial connotation, and got it right that the product caused an earlier sleep and minimal heart rate:

[Product name] vervroegt ook de tijd van inslapen en van minimale hartfrequentie.

These examples show how much a domain sets the expectations of the language to be used. By using more generic models for translation, that expectation gets breached, and that makes reading and understanding text so much harder.

The BLEU score of a translation is not the kind of metric that immediately feels familiar. It has a maximum of 100% and a minimum of 0%, but apart from that, it is difficult to decide on hard limits for good or bad quality. It’s not recommended to compare values across domains and languages, but it will indicate improvements when applied to the same test translation, as long as the test translation is large enough, and the reference translations are reliable.

When will improvements become noticeable? It is a matter of sensitivity, but improvements of more than 5 percentage points make a better translation. Not every sentence is better, but the improvement, on the whole, is real and will make a better reading overall.

Understanding BLEU Scores in Customized Machine Translation

BLEU scores are essential to calculate translation precision: they compare reference translations with MT translation output, also known as candidate translations.

Tuning training data

Evaluating the quality

BLEU score: more than the sum of its parts

What does raising the score look like