The total volume of data created worldwide is expected to reach 149 zettabytes by 2045. Therefore, capitalizing on data has become as important as human, financial, or any other capital. Data as capital has gained even more importance now that data-trained systems start to dominate all imaginable aspects of the world we live in.
On the backdrop of this great surge in the demand for data emerge marketplaces specializing in all forms of data, from language data to consumer or geospatial data. Businesses that are smaller in size and reach, or operate in a niche category, have to strive harder to gain visibility and put their offering on the map. That is where marketplaces such as TAUS Data Marketplace come into play.
We talked with Larry Cady, Solutions Consultant and partner with Chilin. Chilin (HK) Ltd is a privately held technology company based in Hong Kong. Founded in 2005, Chilin is a spin-off of the City University Enterprises Ltd. of Hong Kong. The company draws on research initiated at the university's Language Information Sciences Research Centre for more than 20 years. It is a successful provider of rigorously curated Chinese language and bilingual data for many kinds of research organizations and enterprises. “The challenge for language data providers is reaching potential buyers or other players in the industry to make them aware of the useful data that the provider can offer. Chilin has joined the TAUS Data Marketplace in order to solve this problem,” says Larry.
Chilin has made two datasets available on the TAUS Data Marketplace so far. One dataset contains 12,947 segments; 475,509 en-US words; and 401,629 zh-CN characters. It is based on the CPC Patent Classification category A61K in the pharmaceuticals domain. The second dataset contains 10,377 segments; 379,898 en-US words; and 327,637 zh-CN words. It is based on the CPC Patent Classification C12N which contains many biotechnology filings. “By adding these Chilin parallel sentence datasets to the base Google model, the BLEU score was increased by almost 5 points – within the Pharmaceutical-Biotechnology domain,” explains Larry.
According to Forrester, firms that exploit next-generation data marketplaces will gain a digital edge. TAUS Data Marketplace is here to address the supply and demand challenges for language data, making its marketing outreach available to all language data sellers regardless of their size, language pair, domain, or content type they specialize in.
High-quality language data for Chinese is highly scarce therefore the Chinese outputs generated by MT engines are still far from perfect. Chilin’s English-Chinese datasets are proven to improve MT output quality based on the training tests performed previously. For those who look to improve their MT engine performance in this language pair, these datasets generated by a specialist data provider can be game-changing.
By making these high-performing datasets available through the TAUS Data Marketplace, Chilin hopes to address a wider audience of potential buyers and to sell more data, effortlessly. And they plan on publishing additional Chinese-English datasets in the near future. These will be extracted from their collection of over 30 million sentence pairs and classified in various other domains.
Şölen is the Head of Digital Marketing at TAUS where she leads digital growth strategies with a focus on generating compelling results via search engine optimization, effective inbound content and social media with over seven years of experience in related fields. She holds BAs in Translation Studies and Brand Communication from Istanbul University in addition to an MA in European Studies: Identity and Integration from the University of Amsterdam. After gaining experience as a transcreator for marketing content, she worked in business development for a mobile app and content marketing before joining TAUS in 2017. She believes in keeping up with modern digital trends and the power of engaging content. She also writes regularly for the TAUS Blog/Reports and manages several social media accounts she created on topics of personal interest with over 100K followers.
TAUS Data Marketplace has brought new opportunities to everyone, from individual linguists and LSPs to data and publishing companies, to leverage and monetize their content. The key to being a part of the surging trend of language data for AI is the successful conversion of available multilingual content into language data that is directly usable for AI model training.
Globalization lies at the core of the contemporary age. Knowing this, it’d be ideal to think that the enterprise of academic research would capitalize on contributions from researchers globally and also wants these contributions to be accessible by all students and academics all around the world. Yet, language barriers still present a considerable stumbling block when it comes to the global circulation of academic findings. English is the dominant language in the academic world, which means that researchers around the world are under pressure to publish their findings in English and academic students are expected to understand and digest all of these significant findings in English. This overall contributes to the creation of an academic monoculture.