In another article, we discussed automatic machine translation (MT) evaluation metrics such as BLEU, NIST, METEOR, and TER. These metrics assign a score to a machine-translated segment by comparing it to a reference translation, which is a verified, human-generated translation of the source text. As a result, they are easy to understand and work with, but severely limited by their reliance on references, which are not always available in a translation production setting.
What is Quality Estimation?
Quality estimation (QE), on the other hand, is a method for predicting the quality of a given translation rather than assessing how similar it is to a reference segment. As defined in Machine translation evaluation versus quality estimation (Specia et al., 2010), the task “consists in estimating the quality of a system’s output for a given input, without any information about the expected output”.
While traditional MT evaluation metrics are primarily based on text comparison, quality estimation uses machine learning (ML) methods to assign quality scores to machine-translated segments. Depending on the level of textual granularity at which QE is performed, its results can either be represented using standard data science metrics (e.g. the F-measure) or in terms of the correlation of the predicted scores with the gold standard.
Current approaches rely almost exclusively on deep learning architectures based on artificial neural networks, but methods that make use of rich, linguistically-informed feature sets also exist. Since QE makes it possible to estimate translation quality based on learned correspondences between source and target segments, it has become an active and exciting field of research in natural language processing (NLP).
There are several possible applications for quality estimation in the translation industry apart from the obvious choice of using the method to evaluate the performance of an MT algorithm. For instance, QE can select the best translation from the output of multiple systems, making it possible to compare different translation algorithms using a single metric. Moreover, it can be used to classify translated segments into GOOD or BAD categories so that those of insufficient quality can be marked for human post-editing.
Levels of Granularity
Quality estimation can be performed at various levels of text, including that of words, phrases, sentences, and even entire documents.
At the word level, QE is concerned with predicting binary labels for words based on whether they were translated correctly or not. Errors related to missing words are also taken into consideration.
Phrase-level QE aims to predict the quality of translated phrases and is derived from word-level results. Incorrect word order and missing words can also be taken into account for more fine-grained results.
Sentence-level QE aims at assigning a score to a translated sentence based on the number of words that need to be changed in order to match the text provided in the nearest reference translation. In practice, this involves predicting the translated segment’s TER (or HTER) score, which expresses the edit distance between the segment in question and a corresponding gold-standard reference.
Finally, document-level QE aims to detect errors in longer machine-translated texts and assign a quality score based on their type and severity. Accuracy, fluency, and style can be measured, along with the impact of the detected errors on the overall meaning of the translated document.
Developing quality estimation frameworks requires considerable technical expertise and knowledge about machine learning methods. Fortunately, a number of such frameworks have been made publicly available in recent years, allowing industry stakeholders and researchers to experiment with various methods without having to build their own solutions.
QuEst, one of the first open-source QE frameworks, relies heavily on linguistic and textual features. Its feature extraction module can extract a variety of attributes from text at three levels of granularity. These are then passed to the machine learning module, which uses them to build a robust QE model.
The QUETCH framework consists of an artificial neural network that is capable of learning QE features from raw textual input only. However, it is only adapted for the word-level task.
On the other hand, deepQuest was conceived for document-level QE, but it can also be used at other levels of text. The framework can generalize from word and sentence-level QE results to the document level. It provides two different neural ML architectures and it is significantly faster to train than comparable models.
Among other recent QE frameworks, OpenKiwi stands out as perhaps the most accessible and comprehensive implementation. It contains four ML architectures that can perform quality estimation at different levels of granularity, a number of pre-trained models, and various ways to change data and experimental settings. It can either be used as a Python package or simply run from a computer’s command line.
Despite a series of successful attempts at developing quality estimation algorithms in recent years, there are still significant challenges to overcome. According to a 2020 paper about the recent advances in the field, it is likely that QE models trained on publicly available datasets are simply guessing translation quality rather than estimating it. The authors (Sun et al.) attribute this to certain inherent flaws in current QE datasets due to which the resulting QE models fail to make correct judgments about the adequacy of translated segments because they ignore semantic relationships between these and the originals.
Given its potential applications, quality estimation remains an active field of research in the NLP community. As a method that does not require access to reference translations, it may very well become a standard evaluation tool for translation and language data providers in the future.
NLP Research Analyst at TAUS with a background in linguistics and natural language processing. My mission is to follow the latest trends in NLP and use them to enrich the TAUS data toolkit.