The Data + Signals Cycle in Translation

6 minute read

Large-scale data management will expand the kind of jobs required. Language also acts to produce data beyond the translation moment. This will likely foster new types of work for language professionals. Let's look a little closer at why “language data” is a richer concept than you might think.

TAUS has just published a new report on the role of language data in the AI paradigm – LD4AI. This explores the origins and scale-up of the current role for language data moderation in translation pipelines driven by machine learning, supported by “humans in the loop.” One finding is that large-scale data management will expand the kind of jobs required. In this respect, it may be useful to understand how language also acts to produce data beyond the translation moment. This will likely foster new types of work for language professionals. Let's look a little closer at why “language data” is a richer concept than you might think.

Language becomes data in two distinct ways – let’s call them HLD (Human Language Data) and DLD (Digital Language Data).

  1. Human LD is the visual or acoustic content that humans produce and consume as physical words on screens and paper or transmitted through the air as speech. All these stack up historically into a mass of recordings we naturally now call data or content. But in a networked economy, HLD also becomes a source of secondary data because of the way it signals socio-cognitive meanings. More below.
  2. Digital LD is specifically digital, ultimately consisting of vectors of numbers representing various linguistic dimensions of HLD that are used to prime a machine learning algorithm inside an AI software program. In fact it is the formal twin of HLD, used to teach a machine, rather than acting interactively as a communication medium.

When we use DLD to drive a translation process, we select a chunk of bilingual text to train an algorithm to seek patterns in data so that the machine can then help translate a new batch of target language data from a new source text between the same languages.

To improve quality and machine-readability, we clean up and enrich the source data first by tagging phenomena such as untranslatables or ambiguous expressions, debugging any unwanted gender or racial references, annotating named entities, and so on. This human-moderated source data is then ready to enter the machine process of learning these data points and translating them all appropriately into another language. Data moderation therefore optimizes DLD for a machine learning or AI operation. 

Data as Signals 

Back in the social world of encounters between content, people and language, that same translated content will have a particular impact on each human reader. For them, language is not a mass of word embeddings and vectors familiar to neural MT engineers, it is a medium for messaging in a specific human tongue for some further purpose - informing, engaging, seducing, evaluating, making decisions, entertaining. And telling lies.

Human language, in other words, is always grounded in speech acts that create various psychological effects. And in today’s online life, readers’ and listeners’ reactions - such as slowing down or speed-reading a text, hesitating over an unknown word, eyeballing a certain proper name for more than two seconds, “liking” it, requoting it, etc. – all become useful new data for the publisher of that text. These reactions are not the producer’s language data in our LD4AI sense, but information about a receiver’s behavior that signals attitude and sentiment, engagement or rejection. Surveillance data, if you prefer, though the term has dark connotations.


Signals as Data

Surveillance of this type is a constant in our own conversations: we instinctively scan each other’s faces and body language to spot signs of assent, discord, doubt, collusion, or rejection. We have evolved to be alert to unusual word choices, voice tones, hesitations. When we scan a Tweet we note tell-tale signs in the humor, register, misspellings, or word choices. Not all these micro-signals are encoded clearly in the language, but they are easily inferred from the overall communicative experience. Indeed, one of the distinguishing marks in digital network life as a whole has been the automation of surveilling content for signs that produce useful data for other uses. This is especially true for our acts of speaking and writing, reading and listening. Even silence can speak volumes...

So now that content owners, marketers, communicators, and internauts globally are all able to elicit more insights from tracking the reactions of reader-users to the varied signals encoded in acts of language, they will inevitably attempt to control the game by designing forms of language communication that augment the desired signals. The aim is to optimize such audience reactions, even weaponize them. Not only for written text but even more effectively in the spoken language now spreading through all our new voice channels. This form of DLD will also expand the range of potentially translatable content. 

As part of this transition to LD4AI, therefore, we are entering a virtuous circle of mutual reinforcement between data and signals. Translation suppliers are already providing language data moderation services to better inform the machines that speak, write and translate; their journey may soon include harvesting new types of speech, signed and text data derived from human reactions to their clients’ translated content as well. 



Long-time European language technology journalist, consultant, analyst and adviser.

Related Articles
Purchase TAUS's exclusive data collection, featuring close to 7.4 billion words, covering 483 language pairs, now available at discounts exceeding 95% of the original value.
Explore the crucial role of language data in training and fine-tuning LLMs and GenAI, ensuring high-quality, context-aware translations, fostering the symbiosis of human and machine in the localization sector.
Domain Adaptation can be classified into three types - supervised, semi-supervised, and unsupervised - and three methods - model-centric, data-centric, or hybrid.