Translation Quality - a Hygiene Factor that Needs to be Done Right

We rarely think of linguistic quality when done right. Because then it is not necessarily seen as a differentiator or a success factor. It is when the mark gets missed that quality takes (back) the spotlight. This was one of the opening thoughts of the TAUS QE Summit by James Douglas (Microsoft), at Microsoft premises in Redmond.

With a group of around fifty experts for whom quality never leaves the spotlight, we looked at the quality standards that are most likely to catch on and sustain themselves long-term, at the correlation between the metrics used to ensure the quality and customer’s perception of the product quality, and finally, we questioned the future-fit of the LSP model in an attempt to predict the next big step for the language industry and its supply chain.

This report is meant to highlight some new and old quality-related challenges and potential solutions, as raised by the participants of the event. We formed multiple focus groups during the day to draft takeaways and industry-wide action points around the four main topics: business intelligence, user experience, risk and expectation management and DQF Roadmap planning.

Automation and Standards

In recent years, the industry focus was on automation and datafication of workflows. Without the standards and common agreements, however, we can only get so far. The caveat of an automated system is that it is not aware of how the output is trending. TAUS Director Jaap van der Meer emphasized the lack of common ways of measuring the quality output in the industry. One needs to be able to augment the internal processes, while preserving the customer-centric approach to the product quality. With a tool like TAUS DQF, that capability is there, in real time and including the productivity results.

The short-term roadmap for DQF includes developing My DQF Toolbox – a feature that allows users to customize their reporting environment and correlate different data points:

productivity vs. edit density
productivity vs. error density

On top of that, the benchmarking feature will be further expanded to allow internal benchmarking by vendor, customer, MT engine, content type, etc, explained Dace Dzeguze, DQF Product Manager at TAUS.

Alan K. Melby tied into the conversation about the common agreements, reminding us of the importance of industry standards and the outstanding work being done by the standards body ASTM International (LTAC Global). Out of the quality metrics that have been most used so far (SAE J2450, LISA QA, MQM and DQF), he only sees two that will continue to grow adoption: TAUS DQF (harmonized with MQM) and DFKI MQM. Now the harmonized DQF-MQM error typology has reached the momentum of gaining a true industry acceptance and has been included in the new ASTM standard WK46396 “New Practice for Development of Translation Quality Metrics”.

Business Intelligence: Data-informed vs Data-driven

The data and dashboards have made their way into the translation industry, now it is a matter of making sure that they are properly interpreted. Arle Lommel (CSA) led the conversation with Mark Lawyer (SDL), Scott Cahoon (Dell) and Patricia Paladini (CA Technologies), to tackle the topic of the shift to business intelligence and process monitoring. He stressed the difference between the metrics and KPIs: a metric tells you something fundamental about the organization and the KPIs tell you if you are meeting the business objectives, and noticed that even with vast amounts of data, very few organizations have effective KPIs. So, how can business intelligence help here? Where can we expect to see the benefits of business intelligence, when will we be able to predict quality ahead of time and what are the best practices around data privacy?

Business intelligence (BI) will reach its full potential in the future, projecting the impact of linguistic quality on revenue, predicted Mark. Scott added that at Dell they see a lot of potential in the DQF model: today we rely a lot on humans to understand the input and output.The future is to build BI into the infrastructure that we have and gain intelligence about the vendors to understand which translations or languages should go to which vendor. At CA Technologies, they are moving away from a traditional model that looks at errors, to looking at improvements instead, emphasized Patricia. They focus their business intelligence efforts on understanding the user experience.

In continuous localization, you can get signals ahead of time, stated Mark. For SDL, real-time quality prediction requires real-time close collaboration and metrics on both sides - client and service provider. At Dell, they approach prediction on the task level with DQF, comparing how translation is going through their system with the set service level agreements. They believe that this unbiased methodology can advance the growth of the industry as a whole. The type of real-time data that they are after at CA Technologies, is understanding if they overtranslate, and where they don’t translate while they should.

When it comes to data sharing and privacy, translators should be motivated to share their data and the system should be looking at the averages, not good or bad days, said Patricia. Buyers need to have a better understanding of the service that they are buying, which means access to more data from LSPs. What we are really buying from LSPs is their ability to find, manage and maintain a strong pool of translators, added Scott. It might be necessary to reevaluate the LSP model, as disintermediation is the direction that the industry is moving into. Vendors should look for other opportunities and services to offer besides trying to line up enough translators to do a certain job, Patricia concluded.

Takeaways:

Proliferation of data makes it harder to focus on what matters, while the supply chain is generally fragmented with little oversight. There is a need for more insights sharing across the industry and for having a human in the loop to do sanity checks.
Continuous localization is only possible if we move away from the offline work.
From the buyers perspective, the lack of accountability for individual translators and of access to data from LSPs might lead to resurgence of in-house translation teams.
There is room for LSPs to expand on and enrich their services.
There needs to be more visibility on the quality metrics and calculations behind them, so that we can understand better what they tell us about quality.
Data sharing is crucial for industry growth, we should look at ways to collect the data in a way that still respects translator’s privacy and autonomy.
DQF has the potential to become a marketing platform for the industry and translators. The value of the DQF data is in the network effect and its exponential growth.

User Experience: The Ultimate Judge of Quality

In the always-on economy, it is really the user experience, more than anything else, that determines quality. In the opening conversation introduced by Glen Poor (Microsoft), Katka Gasova (Moravia), Vincent Gadani (Microsoft), and Andy Jones (Nikon) shared their experiences around collecting and managing user-generated feedback and determining when the quality meets customer expectations and requirements.

There are multiple aspects that one can look at to assess if target content will be successful: translation quality, quality of the language attributes, verity of the cultural attributes and non-linguistic elements, etc. However, these are all analytical approaches assessing compliance to the language specifications/requirements, rather than capturing individual emotional experience. What we need, explained Katka, is a more holistic approach - target content assessment at the ‘macro level’. The same approach should also be followed when choosing the appropriate linguistic quality programs.

One of the biggest challenges is that evaluation has traditionally happened on the product level and not on the language level. The issues are mostly on the functionality side, so the questions to ask is if there is a degradation in user experience due to translation, and if it introduces any additional friction.

At Microsoft, they’ve measured the language quality with a user survey, using a 5-point symmetrical Likert scale model and carefully crafted survey questions and translations. They collected responses from hundreds of thousands of users that helped them calculate a NLQS – net language quality score (similar to NPS - net promoter score) for 50 languages.

Takeaways:

Understanding the user experience requires a holistic approach, it can’t be judged based on segments.
The question to ask when deciding on whether a quality program makes sense: does the money spent on the program give a solid confidence that the target output will contribute to the user experience and product relevance in the given market?
Running a user survey may not necessarily give you actionable feedback, but with analysis you will get valuable insights. Tips for a successful user survey: understand the survey experience end-to-end, make it language specific, and choose the right error margin.
There is a big opportunity in finding ways to connect the translator with the customer feedback. Especially LSPs have no full visibility of the customer feedback, so this is where buyers and the LSPs should work together more.
One way to get to the mind of the end-user is to look at the source content and understand the writer’s intention in terms of user experience so that that can inform the translation.

Risk and Expectation Management

How can the industry players manage the quality and pricing risks and expectation effectively in the era of MT? In this topic, introduced by Scott Cahoon (Dell), Dalibor Frivaldsky (Memsource) and JP Barazza (Systran) shared their experiences with pre- and in-production assessment and evaluation tools and methods.

Scott opened the conversation by explaining the two different models they have at Dell. One is a traditional pre-production scenario with vendors assessing and retraining engines and deciding when the NMT is ready for prime time. The second is in-production evaluation with DQF that involves turning all languages on and running all throughput data through DQF to measure the performance of the engines and see which of the languages are ready now, and which need to be tuned further.

Dalibor linked to that conversation by sharing interesting data that they have at Memsource: almost 35% of translation output is of high quality, thanks to translation memories (TMs). The usefulness of a TM is limited to when same or similar text is translated. Artificial Intelligence (AI) really comes into play with non-translatables. Automatically identifying segments that need no translation is the first patented AI-based feature developed by Memsource and supported in 219 language pairs. In addition to that, to cater for real-time assessment of the MT quality, Memsource has enhanced their TMS with a machine translation quality estimation (MTQE) feature that adds a score to machine output before post-editing.

The focus at Systran is on an infinite, perpetual training model that includes the quality management and measurement, explained JP. It happens as they keep on feeding the new data, with automated systematic testing. Every iteration of a training gets tested while the test sets are curated so that they never get into the training corpora. For quality evaluation, Systran uses a simple and intuitive human review evaluating adequacy of translation against the source sentence, giving it a score on a scale of 0-100% that equals the portion of sentences that were better, equal or worse than human translation. The perfect machine translation is the one where the final translation is exactly the same as the pre-translated content.

Takeaways:

Assessing the quality of the MT output can be done better with contextually aware quality evaluation.
A good sample size for testing when customer is looking for validation on the customization is 200 sentences per language pair.
The value of generic MT has diminished, and it has to become a commodity. It is customizing it with your own, private data that adds value.
All systems need a good feedback loop.

DQF Roadmap Planning and Transcreation

At the QE Summit at Microsoft in Dublin on 11 April 2018, the community assigned TAUS the task to develop a best practice for transcreation. At the QE Summit in Redmond, we’ve consulted with the participants on a first draft of this best practice and opened the floor for other ideas on DQF features to be developed by TAUS.

Transcreation has no agreed upon, teachable guidelines, it is mostly done independently or internally, explained Manuela Furtado (Alpha CRC), and added: measuring quality no longer relies on source and target, but the end-user and likes, clicks, sales, etc. The ultimate goal is to create a new text. Although it might introduce more complexity in the already complex field of global content creation and quality evaluation, transcreation represents a new niche for human ingenuity and creativity - it is a part of the content evolution in the digital era. As an industry, we need to be have a common understanding and agreement on what is expected of this new translation format and what it should be measured against. TAUS is working together with a board consisting of multiple companies on formulating the Transcreation Best Practices.

Takeaways:

There is a thin line between transcreation and copywriting that needs to be well defined so that the scope and the expectations around skill sets can be set right.
For transcreation to become a full-fledged service, there needs to be a shift in focus from efficiency to creativity.
DQF Dashboard has the potential to become a marketplace for translations that both buyers and vendors could use, with built-in metadata around the resources on best quality, throughput and availability.
DQF should be further expanded to include speech data.

Translation Quality - a Hygiene Factor that Needs to be Done Right

Highlights from the TAUS QE Summit 2018: This report is meant to highlight some new and old translation quality related challenges and potential solutions around the four main topics: business intelligence, user experience, risk and expectation management and DQF Roadmap planning.