This article originally appeared in TAUS Review #3 in April 2015

Instead of being a comprehensive catalogue of translation-related data, this article will generalize data gathering and curating approaches, especially for Asian translation demands. Before going into details, let’s review the famous hierarchy of data, information, knowledge, understanding, and wisdom (DIKUW). We will then morph this into a circle, or a positive feedback loop, in hope of shedding some light on the path of pursuing bigger and smarter data.

In 1989, American organizational theorist Russell Ackoff published the paper “From Data to Wisdom” in the Journal of Applied Systems Analysis, proposing the relationship of DIKUW:

  • Data: symbols;
  • Information: “who”, “what”, “where”, and “when” questions;
  • Knowledge: “how” questions;
  • Understanding: “why” questions;
  • Wisdom: evaluated understanding.

The hierarchy is usually described as a pyramid, and sometimes with implication of time or value: the earlier/rawer/easier, the less valuable. It is quite intriguing to link the current trend of big data with this particular perception: not only is the data valuated by size, but also by swiftness. Thanks to the Internet, it is seemingly possible now, if not contradicting. The question is, however, is the urge of big data really not a contradiction to DIKUW?

Perhaps it is more likely a context issue whether two concepts are compatible or not. After dropping the desire to have a universal theory, it would become clearer that, at the very least, in the context of translation, disambiguation is no doubt a crucial part. Hence the compatibility in doubt could be transformed into a definition problem: when we talk about data, are you thinking what I am thinking?

In the shared field of translation-related research and industry, data in need is relatively more sophisticated than just symbols. Unlike common stories of big data these days, data required by translation is not just some search result or server log, hence the terms “corpus”, “bi-text”, “translation memory”, etc., along with the actions “curation”, “alignment”, “annotation”, and so on. 

In other words, when I said translation-related data, the “data” solely without the modifier is actually just a bunch of raw ingredients, such as texts, images, audio, or even video files, while for the whole phrase “translation-related data”, it is more information like: who is involved with the data? What is the data about? What is the data source? When is the data originated? Here comes a new quest: how do you acquire the above information?

Again, thanks to the Internet and the search engines, the ingredients are almost free. The catch is, there is still no free lunch. At first glance, a simple keyword search may lead us to some nice resources. For example, combining what just popped up in the previous paragraph, one may formulate queries with specific language pairs, like “Japanese English corpus”, which happens to be a nice list. The quest of information acquisition for translation-related data requires inter-discipline collaborations. 

Particularly for Asian translation business, it is not hard to imagine that besides the typical prerequisite of domain knowledge and genre/style of outcome, the deep understanding of the differences between Asian languages, or the heterogeneousness from Asian to non-Asian languages. If it sounds exaggerated and intimidating, allow me to provide an almost stupid example beginning with a naïve question (and please bear with me if you know the answer already): where to find the data to assist Japanese-to-English place name translation for online shopping/shipping?

As a computational linguist, the answer was trivial and tedious to me: just go trolling Wikipedia, or if you want to be competitive for job security, query DBpedia by SPARQL as a practice of “Semantic Web.” Of course it was disappointing that both the coverage and the quality did not suffice. But here it comes: One of my colleagues in the sales department suggested that with the worry of being amateuristic and stepping over: how about the address data of Japan Post? Ta-da!

Well, unlike fairy tales, there is always much more after the happy endings. Japan Post’s address data turned out to be Romanization in upper case. Normalizing cases is not a big deal, but some Romanized terms proved problematic: basement, floor, ward, and several other typical units are still Roman scripts of Japanese.

Luckily, it is still not too hard to search-and-replace them. The real important thing here is to be aware of the situation in the first place, and then talk to customers for a mutual understanding: do you want “ward” to be “ku” or “area”, or…? Why?

So here we are. For shipment, if the place name will be presented to Japan Post eventually, why not keep it as it is? For other potential customers, say an online photo-sharing site wants to have Japanese-English bilingual geo-locations, then it better be plain English than Romanized Japanese up to certain level, and floor and basement will probably be useless anyway. Furthermore, now DBpedia is welcome. 

Every decision a customer approves will subsequently become evaluated knowledge, hopefully qualified as wisdom, even when it is so small and silly after looking back to the above story.

Wait, isn’t it still a long, tiresome and uncertain journey towards the idea of bigger, quicker and smarter data? I certainly hope not. Imagine that, the wisdom of why and how to prepare the data of place names, will soon embody the next round of data acquisition, and inspire more keywords for search engine queries. Even better, if one is willing to invest time and money to semi-automate this positive feedback loop, the pyramid of DIKUW will become the circle of DIKUW, for the translation industry. 

Once the engine of the circle is started, the collision between big data and DIKUW will ease, and the next post-happy-ending quest shall be revealed.