Breaking the Publishing Ground: From Dictionaries to Linguistic Data

TAUS Data Marketplace has brought new opportunities to everyone, from individual linguists and LSPs to data and publishing companies, to leverage and monetize their content. The key to being a part of the surging trend of language data for AI is the successful conversion of available multilingual content into language data that is directly usable for AI model training.

This remains to be challenging for many companies that have their roots in the publishing business. Lexicala is a peculiar example that has emerged from the publishing world, as a provider of quality lexicographic content for leading dictionary publishers worldwide, and has professionally overcome this challenge and joined the TAUS Data Marketplace as a data seller. They have come across the Marketplace in the context of their market research and decided that it would be an interesting platform for their business development goals and that it’d be fairly simple to adapt their data to publish and sell as language data.

We talked to Ilan Kernerman, CEO; Raya Abu Ahmad, Content Manager; and Maayan Orner, Software Manager, from Lexicala about the journey that has led them onto this path.

Lexicala was established in Tel Aviv as K Dictionaries, which had its roots in English learner’s dictionaries. During the 1990’s, K Dictionaries has developed a unique collaborative network with publishing partners around its innovative customized dictionaries, and established its name as a pioneer in bilingual, pedagogical, digital, and user-oriented lexicography. These days, their most notable close partner in these domains is Cambridge University Press and their world’s most popular dictionary website for learners of English, which includes dozens of K Dictionaries titles.

“At the turn of the century, we expanded to multilingual lexicography and started exploring new methodologies and technologies. This has led to the creation of a systemic, ground-breaking series of monolingual datasets for selected world languages, focusing on the data structure and format, that served for developing fully bilingual language pairs and diverse multilingual combinations, and to our gradual evolution into a technology-driven content creator,” says Ilan.

Today, they’ve converged smart automated processes for data generation and validation with expert human-curated editing, to make their resources interoperable and beneficial for NMT and other NLP and AI applications, offering high-end cross-lingual lexical data under the new trade name Lexicala.

“TAUS Data Marketplace presents us with an excellent opportunity to reach more potential customers who can benefit from the added value of our parallel corpora to enhance the training of their ML models and improve the results of their NMT solutions,” adds Ilan, in line with their business strategy.

In August 2021 Lexicala uploaded to TAUS Data Marketplace the first release of 357 bilingual datasets in 20 languages, including a total of 1.7 million parallel segments with 43 million tokens. The languages include Arabic, Chinese (Simplified), Danish, Dutch, English, French, German, Greek, Hebrew, Italian, Japanese, Korean, Norwegian, Polish, Portuguese – Brazilian and European, Russian, Spanish, Swedish, and Turkish as well as Latin, translated to French only. The segments in their datasets all stem from manually curated examples of usage and their translation equivalents, consisting only of full sentences and featuring general language, i.e. domain-independent – not vertical – vocabularies. Lexicala datasets are available for purchase on the TAUS Data Marketplace. Check the samples and start training!

“The data were created by our editors around the world based on corpus evidence and frequency for each language. They create, review, select and manually curate examples of usage as part of compiling dictionary entries for the most important lemmas, senses and multiword expressions. These usage examples are then translated by professional translators and are at the heart of the parallel corpora now available on DMP,” says Raya.

Lexicala explains that they’ve faced several challenges in the process, such as noisy segments and mislabeled datasets. For the first one, they have developed an algorithm to eliminate noisy segments based on basic statistic and heuristic rules and for the latter, they’ve developed and used another algorithm that classifies tuples of <sentence, language> as correct or incorrect, based on an existing language identification model and a feed-forward neural network. “Our custom model improved over the baseline language identification model (checking if the labeled language is the same as the identified language) significantly for the specific task, mostly for highly ambiguous and mutually intelligible language groups, such as the Nordic ones,” explains Maayan.

They hope that joining the TAUS Data Marketplace will increase their exposure in the MT training market and expand their clientele. “And vice-versa, the considerable volume and diversity of our data, and its advantages over more conventional automatically harvested parallel and comparable corpora, can help boost the appeal of the Marketplace to buyers,” says Ilan.

As for data privacy and ownership concerns, Lexicala attains the utmost importance to this topic. “This has also been a vital topic in our discussions with TAUS, to make sure that the data we upload to DMP is both highly protected and serves customers uniquely for incorporating it into their NMT systems to upgrade inhouse processes and results without making them available as-is to others,” explains Ilan. As the CEO of a data company with roots in the publishing industry, Ilan shares that traditionally, in the dictionary industry, publishers tended to be conservative with regard to sharing their resources with others, but that has been changing with making dictionaries available freely online and gaining revenues from ads. Lexicala is one of the early adopters, however, it seems that more companies from the publishing world are about to hop on the LD4AI (Language Data for AI) bandwagon soon.

Although it’s difficult to predict the future, particularly in the face of fast-paced advancements in the AI training sphere, Lexicala expects that the global NLP market will continue to grow enormously, as shown in a recent Fortune report estimating the overall size shooting from USD 21 billion in 2021 to USD 127 billion in 2028. They think that there will be a mix of more demand with more specialization and more customization for data sharing and marketplaces of all kinds.

Breaking the Publishing Ground: From Dictionaries to Linguistic Data

Lexicala has emerged from the publishing world, as a provider of quality lexicographic content for leading dictionary publishers worldwide, and joined the TAUS Data Marketplace as a language data seller.