This article originally appeared in TAUS Review #3 in April 2015
As any professional translator knows, high quality translation depends on understanding the context of the source material. This article introduces the concept of preemptive disambiguation, along with an example microformat that can be embedded in online documents to make them easier to translate accurately.
The basic idea behind preemptive disambiguation or “Pre D” is to embed information in a document, in a format that is hidden from normal users and does not damage the visible document layout, but visible to any machine or professional translators who are working on it.
Microformats are an especially attractive way to do this, as they are already part of web standards, and are widely supported by web browsers and other tools. For example, let’s say that we want to embed geographical information in an article in a way that enables a program to easily extract this information, and know its context. To do this we might say:
<p>The birds <span class=”species” style=”visible:none”>Passer domesticus</span> roosted at <span class=”geo”><span class=”latitude”>45.5</span><span class=”longitude”>-122.68</span></span>.</p>
The additional markup is invisible for users, but any program parsing this page will be able to see and extract this geographical information (for example to display this page in search results overlaid on a map). This general approach is already used to add structure and machine-readable content to regular web pages, and can be easily applied to assist in translation.
Translation Markup Language (TML)
TML, or something like it, is a microformat that can be embedded within documents wherever additional information is required to disambiguate the meaning of a phrase, to provide additional context, style guide hints or glossary entries. Microformats provide a lightweight way to embed semantic information in a web document. They are already in widespread use, and do not break backward compatibility with existing browsers.
Microformats are appealing because they are simply ignored by applications that don’t understand them. In the example below, TML is used to embed a comment to explain the usage of the word “pipeline” so that it is not interpreted literally. A user viewing this in a standard web browser would not see the remarks, while a user viewing this with a TML aware application would see the instructions.
Embedding Comments About Context Using TML
A brief comment about the context or meaning of a phrase or sentence is often all that’s needed to assist the translator in reaching a correct interpretation. TML makes it easy to embed otherwise hidden annotations that are only visible to people using translation-aware browsers, editing tools, etc.
Example of TML being used to embed comments about context for translation:
<p>The company’s sales <span class=”TML”>pipeline<span class=”comment” style=”display:none”>In this context, pipeline refers to potential customers, not a literal pipeline</span></span> is nearly full.</p>
The class “synonyms” is used to attach a list of synonyms to a word or phrase. This can be used both by human and machine translators (this approach will enable machine translation engines to automatically determine the correct meaning of a word whose meaning might otherwise not be obvious).
Example of TML synonyms:
<p>In <span class=”tml”>rough<span class=”synonyms” style=”visibility:none”>approximate</span></span> numbers, there were 100 people at the conference.</p>
If a phrase belongs to a translation glossary, we can use TML to explicitly reference the glossary entry, as shown in the example below:
<p>Insightly is the leading Google Apps <span class=”tml”>CRM <span class=”glossary style=”visibility:none”>CRM : Customer Relationship Management <a href=”https://companyxyz.translationglossary.com/term/crm>more information</a></span></span> service for Google Apps.</p>
Here we embed both basic information from the glossary entry, as well as a hyperlink for more information. As with other TML tags, this is invisible to ordinary users and is only visible to people using TML aware tools.
Implications For Web Authoring Tools
Adding support for TML, both to authoring tools and translation tools, will require minimal effort thanks to the simplicity of the microformat pattern. For authoring tools, the basic goal is to encourage authors to provide additional information whenever there is uncertainty about the usage or meaning of a phrase.
Most authoring tools now have fairly sophisticated grammar, spellchecking and built-in dictionaries. For these tools, it will be pretty easy to add a pop-up dialog that is triggered whenever the user types a word or phrase whose meaning is ambiguous, has multiple meanings, etc. When this occurs, the author would see a pop-up that asks for a list of synonyms, glossary entry, and optional free-form comment. If the author enters any of these, it would insert TML as shown in the examples above. This is a trivial modification to make, so this functionality could easily be added to a wide variety of authoring tools if the microformat is adopted.
Implications For Translation Tools
It will likewise be easy to add TML support to translation editing tools, which simply need to look for <span class=”TML”>...</span> segments, and then extract the hidden information within these regions. This is straightforward HTML/XML parsing, and very easy to add.
More importantly, because the author will be encouraged to embed information in the document as he/she is writing it, the translator will typically have much better information to work from when composing or post-editing translations. Currently this information has to be obtained offline, typically via a back-and-forth email conversations.
Machine translation engines will also be able to use the synonyms element to better guess at the intended meaning of a word or phrase, which may be especially useful for rules based translation engines when the encounter ambiguous words or phrases.
Implications for TAUS
TML, or a microformat like it, is an example of where TAUS could play a leading role. As a microformat it doesn’t require large changes to the existing web toolchain, just small changes to tools that need to be aware of this microformat. With its relationships throughout the IT industry, it is well positioned to bring a microformat like this to fruition.