TAUS Data Marketplace has brought new opportunities to everyone, from individual linguists and LSPs to data and publishing companies, to leverage and monetize their content. The key to being a part of the surging trend of language data for AI is the successful conversion of available multilingual content into language data that is directly usable for AI model training.
This remains to be challenging for many companies that have their roots in the publishing business. Lexicala is a peculiar example that has emerged from the publishing world, as a provider of quality lexicographic content for leading dictionary publishers worldwide, and has professionally overcome this challenge and joined the TAUS Data Marketplace as a data seller. They have come across the Marketplace in the context of their market research and decided that it would be an interesting platform for their business development goals and that it’d be fairly simple to adapt their data to publish and sell as language data.
We talked to Ilan Kernerman, CEO; Raya Abu Ahmad, Content Manager; and Maayan Orner, Software Manager, from Lexicala about the journey that has led them onto this path.
Lexicala was established in Tel Aviv as K Dictionaries, which had its roots in English learner’s dictionaries. During the 1990’s, K Dictionaries has developed a unique collaborative network with publishing partners around its innovative customized dictionaries, and established its name as a pioneer in bilingual, pedagogical, digital, and user-oriented lexicography. These days, their most notable close partner in these domains is Cambridge University Press and their world’s most popular dictionary website for learners of English, which includes dozens of K Dictionaries titles.
“At the turn of the century, we expanded to multilingual lexicography and started exploring new methodologies and technologies. This has led to the creation of a systemic, ground-breaking series of monolingual datasets for selected world languages, focusing on the data structure and format, that served for developing fully bilingual language pairs and diverse multilingual combinations, and to our gradual evolution into a technology-driven content creator,” says Ilan.
Today, they’ve converged smart automated processes for data generation and validation with expert human-curated editing, to make their resources interoperable and beneficial for NMT and other NLP and AI applications, offering high-end cross-lingual lexical data under the new trade name Lexicala.
“TAUS Data Marketplace presents us with an excellent opportunity to reach more potential customers who can benefit from the added value of our parallel corpora to enhance the training of their ML models and improve the results of their NMT solutions,” adds Ilan, in line with their business strategy.
In August 2021 Lexicala uploaded to TAUS Data Marketplace the first release of 357 bilingual datasets in 20 languages, including a total of 1.7 million parallel segments with 43 million tokens. The languages include Arabic, Chinese (Simplified), Danish, Dutch, English, French, German, Greek, Hebrew, Italian, Japanese, Korean, Norwegian, Polish, Portuguese – Brazilian and European, Russian, Spanish, Swedish, and Turkish as well as Latin, translated to French only. The segments in their datasets all stem from manually curated examples of usage and their translation equivalents, consisting only of full sentences and featuring general language, i.e. domain-independent – not vertical – vocabularies. Lexicala datasets are available for purchase on the TAUS Data Marketplace. Check the samples and start training!
“The data were created by our editors around the world based on corpus evidence and frequency for each language. They create, review, select and manually curate examples of usage as part of compiling dictionary entries for the most important lemmas, senses and multiword expressions. These usage examples are then translated by professional translators and are at the heart of the parallel corpora now available on DMP,” says Raya.
Lexicala explains that they’ve faced several challenges in the process, such as noisy segments and mislabeled datasets. For the first one, they have developed an algorithm to eliminate noisy segments based on basic statistic and heuristic rules and for the latter, they’ve developed and used another algorithm that classifies tuples of <sentence, language> as correct or incorrect, based on an existing language identification model and a feed-forward neural network. “Our custom model improved over the baseline language identification model (checking if the labeled language is the same as the identified language) significantly for the specific task, mostly for highly ambiguous and mutually intelligible language groups, such as the Nordic ones,” explains Maayan.
They hope that joining the TAUS Data Marketplace will increase their exposure in the MT training market and expand their clientele. “And vice-versa, the considerable volume and diversity of our data, and its advantages over more conventional automatically harvested parallel and comparable corpora, can help boost the appeal of the Marketplace to buyers,” says Ilan.
As for data privacy and ownership concerns, Lexicala attains the utmost importance to this topic. “This has also been a vital topic in our discussions with TAUS, to make sure that the data we upload to DMP is both highly protected and serves customers uniquely for incorporating it into their NMT systems to upgrade inhouse processes and results without making them available as-is to others,” explains Ilan. As the CEO of a data company with roots in the publishing industry, Ilan shares that traditionally, in the dictionary industry, publishers tended to be conservative with regard to sharing their resources with others, but that has been changing with making dictionaries available freely online and gaining revenues from ads. Lexicala is one of the early adopters, however, it seems that more companies from the publishing world are about to hop on the LD4AI (Language Data for AI) bandwagon soon.
Although it’s difficult to predict the future, particularly in the face of fast-paced advancements in the AI training sphere, Lexicala expects that the global NLP market will continue to grow enormously, as shown in a recent Fortune report estimating the overall size shooting from USD 21 billion in 2021 to USD 127 billion in 2028. They think that there will be a mix of more demand with more specialization and more customization for data sharing and marketplaces of all kinds.
Şölen is the Head of Digital Marketing at TAUS where she leads digital growth strategies with a focus on generating compelling results via search engine optimization, effective inbound content and social media with over seven years of experience in related fields. She holds BAs in Translation Studies and Brand Communication from Istanbul University in addition to an MA in European Studies: Identity and Integration from the University of Amsterdam. After gaining experience as a transcreator for marketing content, she worked in business development for a mobile app and content marketing before joining TAUS in 2017. She believes in keeping up with modern digital trends and the power of engaging content. She also writes regularly for the TAUS Blog/Reports and manages several social media accounts she created on topics of personal interest with over 100K followers.
Globalization lies at the core of the contemporary age. Knowing this, it’d be ideal to think that the enterprise of academic research would capitalize on contributions from researchers globally and also wants these contributions to be accessible by all students and academics all around the world. Yet, language barriers still present a considerable stumbling block when it comes to the global circulation of academic findings. English is the dominant language in the academic world, which means that researchers around the world are under pressure to publish their findings in English and academic students are expected to understand and digest all of these significant findings in English. This overall contributes to the creation of an academic monoculture.
The total volume of data created worldwide is expected to reach 149 zettabytes by 2045. Therefore, capitalizing on data has become as important as human, financial, or any other capital. Data as capital has gained even more importance now that data-trained systems start to dominate all imaginable aspects of the world we live in.
"The Web does not just connect machines, it connects people," said Tim Berners-Lee, the inventor of the World Wide Web. Whether online or offline, language is just as important to building human connections: it forms the basis of how users identify with each other and the boundaries within which communities come together for common interests.