This article originally appeared in TAUS Review #3 in April 2015
In a landmark article well over a decade ago, the US psycholinguist George Miller (and initiator of WordNet), the well-known computational thesaurus of English) showed how a fairly banal couplet from the American poet Robert Frost’s poem Stopping by Woods on a Snowy Evening that goes:
But I have promises to keep
And miles to go before I sleep
can cumulatively generate a total of 3,616,013,016,000 possible compound meanings if each individual word’s different dictionary meanings are broken out and aligned. This way semantic madness lies – at least for a computer. Luckily the outcome of a European project called BabelNet is now making it much easier to think through the classic problem of word ambiguity for the translation industry and others. Let’s look at how we got there and what BabelNet can offer by way of a solution.
The Frost poem, in this context, can be conceived as a vast word cloud in which each of the thirteen words interlink to all possible combinations, generating an average of 9.247 meanings per word. This discovery is of course partly an artefact of the very process of using computers to handle human language. Most humans would probably never have thought of this as a problem in a pre-digital world.
If you add a little augmented reality to this cloud picture by overlaying the words with grammatical information about their parts of speech, to distinguish for example ‘but’ as a conjunction from ‘but’ as a verb as in “don’t but in”, you can radically reduce the rate of ambiguity to a geometric mean of 2.026 senses per word.
Imagine now that you want to translate the phrase into another language. You would according to the computing scenario have to choose fairly systematically between the various meanings of each (full) word to find an equivalent in the target language.
Luckily humans have special access to knowledge about contexts that computers don’t have. They also know that we (poets especially) can make words work harder for us by packing in two or three meanings at a time. Remember Lewis Carroll’s Humpty Dumpty in Alice through the Looking Glass, who cunningly said, “When I make a word do a lot of work like that, I pay it extra.”
The fact is the human brain can engender some 10100 concepts (more than the total number of particles in the universe) but we only know about 106 words.
So concepts are forced to share word containers in order to be practically communicable by a carbon biological system with power, memory and other constraints.
This word disambiguation conundrum inevitably frustrated the pioneers of Natural Language Processing 50 years ago. Yehoshua Bar-Hillel, one of the founders of the discipline, famously claimed that he could not see how a computer could be brought to automatically ‘understand’ the difference between the two rather dreary English expressions “the box is in the pen” and “the pen is in the box”. In other words, how could you program a machine to distinguish pen meaning ‘enclosure’ (first phrase) from pen meaning ‘writing instrument’?
This resounding academic doubt about the linguistic capacity of computers to access knowledge to disambiguate word meanings has been cited as one of main reasons for the US government turning off its funding tap for machine translation back in the mid-1960s.
So if WordNet can now give us information about the sets of synonyms, antonyms and hyponyms etc. associated with a lexical item in our linguistic repertoire, how far have we come in being able to identify automatically the world knowledge contexts in which a given synonym/word meaning is appropriate?
The best place to find out today is Rome, where Roberto Navigli from the city’s Sapienza University has been working on the BabelNet project for several years now.
BabelNet is an online multilingual semantic dictionary and lexical database that provides a powerful resource via an API for anyone doing natural language processing - from translation to text analytics and more. It is poised to make a major impact as a multilingual resource for the digital agenda because it has the virtues of combining two crucial components: world knowledge and lexical information. In a nutshell, a seamless merge of WordNet, Wikipedia, Wiktionary and other electronic dictionaries.
More generally, BabelNet puts into practice the strategy that Tim Berners-Lee propounded in 2009 of linked open semantic data as the underlying architecture of the next Web. After first showing us how to link documents, Berners-Lee has long been pleading for a level of linked meanings, not texts, by exploiting various types of openly-accessible linguistic data.
But, in addition to the merging of the knowledge about ‘word’ meanings and the knowledge about ‘world’ meanings, BabelNet is – as its name suggests – also dedicated to ensuring that this linkage is multilingual. This therefore constitutes a remarkable step forward towards the Human Language Project so dear to TAUS. BabelNet began life as a project with a multimillion European Commission grant in 2011 called MultiJEDI. Navigli’s principle concern at the time was to find effective, wide-coverage and powerful ways of representing lexical-semantic knowledge in appropriate formats and then use it in NLP applications.
He soon realized that one “over-ambitious” consequence was that the project needed to move from a monolingual to multilingual approach, by building on the WordNet model of linguistic principles but transposing it to a multilingual setting, whereby concepts are understood as sets of synonyms in different languages.
The next challenge was to represent WordNet multilingually and then create the huge network of multilingual lexicalizations connected to each other and to the meanings and distinctions so that lexicographic sources (words) could link to encyclopedic resources (named entity mentions).
If the lexical knowledge issue could be solved, then lots of applications could be enabled in as many languages as possible. As Wikipedia was constructed as sets of equivalent pages linked in different languages, the idea was to interconnect all these language versions.
The trouble is, Navigli realized, you can’t just make translations from Wikipedia in the form of named entities. What you also needed was translations over abstract concepts that are not actually in the encyclopedia already.
So he had the idea of applying statistical machine translation to what they call “sense annotated” corpora – i.e. corpora that had their words associated with explicit meanings derived from WordNet or Wikipedia. This trick helped increase the translations of abstract concepts.
So now BabelNet had a way of linking dictionary and encyclopedic knowledge such that when you process a sentence automatically you don’t need to worry about words or world knowledge. You just get the appropriate meaning: that word piano in a music context is not the same as Piano in an architectural context (Renzo Piano).
The original purpose of BabelNet, after the fairly complex automatic process of linking word meanings with entity identities was completed, was to support a typical NLP task – e.g. Word Sense Disambiguation – over multiple languages, it can handle any amount of data and create only multilingual or even language agnostic systems that can be applied to text analytics, or searching large text databases semantically.
In the case of search, BabelNet ought to return the “best” results in context, not the most obvious or typically crowdsourced results.
Improving machine translation is another key application. Although we are aware of the “unreasonable effectiveness of data”, there are frequent cases of data scarcity, especially in languages other than English, and recourse to the semantic resources of BabelNet should be able to help out.
In the emerging area of multilingual text analytics, especially in Europe with its geographical language silos, connecting analytics across languages, even with translations, will almost inevitably confront the problem of similar ideas being expressed in different ways or different words. In such cases, some form of synonym linking across languages will be needed to capture the facts about named entities and smooth the quality of the translation.
So who is using BabelNet these days? Navigli says that computer-assisted translation is probably the biggest use case. People can expand their translation memories by using BabelNet to deliver confidence scores for each translation, for example.
Even more importantly, for each translation whereby you know which concepts are involved, you can provide a term in any language and the system gives you back an option with translations, definitions, relations, and even pictures for millions of concepts.
It has also been suggested (by Andrzej Zydron of XTM Intl. among others) that BabelNet can be used to create a rapid document categorizer by automatically generating a wordlist of the contents (words plus translations) which then provides a quick semantic fingerprint of the document and its domain, plus a checklist of multilingual versions.
Roberto Navigli is most of all impressed by the system’s ability to process and connect text across almost any language. And this apparently means dozens and dozens of languages today from Abkhazian to Zulu. Once you link your text string, he says, to a node in a semantic network such as WordNet and Wikipedia, you move up to a new level of insight and open up a whole new world for your process.
For the future, the key for the resource will be to keep it open for research purposes, with user companies giving funds to continue with the development. This means that the language technology industry is finally able to leverage the power of linked linguistic data and linked encyclopedic knowledge for its own purposes. Let’s hope it will prompt some interesting disruptive innovation in Europe and elsewhere.