Going Beyond 100 Languages with Data
10 minute read
TAUS Virtual Data Summit 2020 hosted insightful conversations on language data trends, expectations and best practices with expert panels and presenters. Here is a briefing if you've missed it.

On 10-11 November 2020, TAUS held its first virtual Data Summit. On the agenda were presentations and conversations discussing the new “Language Data for AI” subsector, the impact of neural MT on data needs, collection methods and data services, massive web-crawling projects, “mindful” AI and data ownership, privacy and copyright. An international audience of 100+ people came together to learn more about language data.

Language Data for AI

Earlier this month, TAUS published the Language Data for AI (LD4AI) Report and announced this as a new industry sub-sector. To analyze how language really became data and how that megatrend starts to overhaul the professional language industry, we need to go back a few decades, Jaap van der Meer and Andrew Joscelyne said in their opening conversation with Şölen Aslan. And that’s exactly what we’ve done in this new TAUS report that is freely downloadable on the TAUS website: it offers context and perspective and assesses the opportunities and challenges for both buyers and new providers entering this industry. The main takeaways are:

  • Language is core to AI
  • Data-first paradigm shift
  • Acceleration of change
  • Rise of the new cultural professional
  • New markets move faster

In the conversation with Aaron Schliem (Welocalize), Adam LaMontagne (RWS Moravia) and Satish Annapureddy (Microsoft), we dove a little deeper into a few of these takeaways. When it comes to data, Aaron says: in our industry so far data has been used to produce content or optimize content flows, but now we use data to more intelligently interact with human beings. Specifically how we use data and what we use it for differs per stakeholder. This means that one of the takeaways (language is core to AI) will most likely mean something slightly different for everyone. It also extends beyond just language.

Ever since the rise of the machine, talks about translators and other language professionals fearing to lose their jobs have been trending. However, as also Adam confirmed, we’re just going to see a shift beyond professional translation services, and towards linguistic data tagging, labeling and other data-related services and tasks. The “rise of the new cultural professional” means there is a great opportunity in our industry for a new kind of professional. Aaron says the changes in our industry just opens the door to bring in different kinds of talents who want to get involved in building language solutions.

In addition to the new skills, we’re also looking into “new” (low-resource) languages. Both Aaron and Adam say that the demand in these low-resource languages is mostly for machine translation and training the engines. How do you get this data? There are initiatives like the TAUS Human Language Project and the newly launched Data Marketplace that creates and sells datasets in various languages. 

With all these new technologies, MT engines, platforms, etc. emerging so quickly we just need to make sure we adapt well to the digital systems and make sure we use it to our advantage. The end goal of all professionals working in the language industry, everyone agreed, is of course for every individual to be able to seamlessly communicate with each other, regardless of the language they speak.

Massively Multilingual

To talk more in-depth about the changes in the MT field when it comes to data we were joined by Orhan Firat (Google), Paco Guzmán (Facebook), Hany Hassan Awadalla (Microsoft) and Achim Ruopp (Polyglot Technologies). 

Massively Multilingual Machine Translation is a new buzzword in the MT world. What it means exactly for the developments, Orhan explained in three simple bullet points:

  1. Positive transfer: learning of one language will help learning of another language. This helps boost the quality of low resource languages. Also enabling usage of less data.
  2. Less supervision: massively multilingual systems can make better use of monolingual data.
  3. Need for less structured data: being massively multilingual allows them to use less structured data like monolingual phrases, documents etc.

The focus is shifting towards exploiting publicly available monolingual data to generate artificial data. Bilingual parallel data is expensive and hard to obtain. The main role of parallel data comes up in the evaluation and benchmarking step. Paco says they are reaching the long tail of languages and it’s important to evaluate the translations better.

Hany adds: you can now jumpstart a low resource language translation with minimal data. However, it’s far away from any high-quality language and this gap is not easy to close. 

Currently, MT only covers about 100+ languages. There are more languages out there with fewer language resources, but millions of speakers. What does it take to go beyond 100+ languages? Is there a playbook for defining the minimum amount of data (monolingual or bilingual) needed to run a system?

Hany says that about 100K monolingual segments would be enough to start it, but then it depends on the difficulty of the language and other similar variables. Paco adds that the answer may change for certain domains, genres, languages. Close languages such as Ukrainian-Belarussian are easier. If you have supervision from one of them it will benefit the other. If they are farther away in terms of language families it will be harder. 

After 1000 languages, the only data you’ll have is bibles or religious texts (a very specific domain). If you have at least some, but very well-structured, parallel data (even under 1000 segments), it’s good enough. When you approach the tail, you face vocabulary and learning problems. Other important factors are the noise level, data quality, the objective function you are optimizing etc. He concludes by saying that one is always better than zero when it comes to data.

There are some institutions and governments that fund research that try to tackle low resource languages, but we do not have enough open-source datasets and this should be a shared responsibility. Breaking language barriers and connecting people is crucial for our business, Paco says. Orhan adds that it should be a collaborative effort. The presence of data is not always related to a language being out there on the internet. It can be the case there is no online content in that language. We need to enable that community to create content online. If a person is speaking a language that is on the verge of extinction but has no phone or internet, we should give him the means to go online and generate the content. For that, we need companies and governments to work collaboratively.

When asked what role the new Data Marketplace can play in this massively multilingual world, Paco says it might help solve the benchmarking issue. If we can have datasets that are highly curated, they can become the standard. For high-quality languages, we are in the evaluation crisis phase, adds Orhan.

The future will be all about less data but higher quality data, even if it’s monolingual, that is natural and not machine-translated, Orhan says. Hany adds that we need better data in more domains and speech-to-speech data is an interesting area to get to 100 languages. Paco emphasizes the high-quality data for evaluation once more. Revising processes for low-resource languages and finding what causes catastrophic mistakes and sourcing data to solve those issues.

Mindful AI

The term “Mindful AI” stands for being thoughtful and purposeful about the intentions and emotions evoked by an artificially intelligent experience. In a recent AI survey by Gartner, it was found that people believe that AI will transform businesses and that it’ll be part of future business and consumer applications. However, in reality, AI adoption has not lived up to its potential. The failure to embrace AI seems to be a human problem rather than a technology problem. In order to allow a machine to make decisions for us we need to trust it to be fair, safe and reliable. We need to know the quality of the data and transparency about how they are used. Generic models must be free of bias (gender, racial, political, religious), clean and balanced, trained on large quantities of unprejudiced and diverse data.

These are three key pillars to operationalize AI in a mindful way: 

  1. Human-centric approach to designing the AI systems - end-to-end, human-in-the-loop integration in the AI solution lifecycle, from concept discovery, data collection, to model testing, training and scaling
  2. Trustworthy - transparent about the way the models are built and how they work
  3. Responsible - ensuring that AI systems are free of bias and grounded in ethics

AI localization at Pactera EDGE is defined as cultural and linguistic adaptation of machine learning solutions through the injection of securely obtained and meticulously processed localized AI data, Ahmer Inam and Ilia Shifrin (Pactera EDGE) explained in their presentation. The majority of consumers nowadays expect personalized services. There is not sufficient data for such customizations, which is why AI sometimes gets it wrong (inherits historical gender bias, etc.). Moreover, all AI models naturally become obsolete, production models can degrade in accuracy by 20% every year, meaning they need to be fed new data continuously.

Here’s what they recommend to get started with AI localization:

  1. Build a strong foundation of professionally localized data to build AI models
  2. Find a reliable and scalable AI data partner
  3. Designate R&D resources to drive AI solutions

The Data Marketplace

In November 2020 TAUS launched the Language Data Marketplace, a project funded by the European Commission. The main objectives of the platform are to provide high-quality data for machine translation engines, and bridge the gap for low-resource language and domains. The TAUS team presented the available features, such as data analysis, cleaning and smart price suggestion for data sellers and the easy exploration flow for the buyers to be able to identify the desired data. The team also shared the roadmap with the upcoming functionalities.

Some of the early adopters (data sellers) explained why Data Marketplace was a great opportunity for them:

  • Mikhail Gilin from TransLink explained that as a large Russian LSP they like the prospect of using this new unique technology and its advanced NLP capabilities to sell the data that they’ve created internally.
  • Margarita Menyailova from EGO Translating Company recognized the need for data in order to develop new services and explained how Data Marketplace is key in shortening the supplier-customer chain.
  • Adéṣínà Ayẹni, journalist with an ambition to bring the Yoruba language into the digital space, was excited about the opportunity Data Marketplace gives him: to address the marginalization of the African languages while earning a monetary reward. 

Who Owns My Language Data

Questions around privacy and ownership of language data become more pressing in this time of AI and machine learning. Wouter Seinen, Partner at Baker McKenzie and leader of the IP and Commercial Practice Group in Amsterdam, shared the highlights of the White Paper on ownership, privacy and best practices around language data, jointly published by Baker McKenzie and TAUS.

Use your common sense, was Wouter’s main advice. Language as such can not be owned, although the use of language data can be within the realm of intellectual property (IP) rights. In the era of digitalization, the reality is diverging from the law books from 20-30 years ago, as content is freely copied and shared online.

The type of data that is in question here is functional, segment-based text, and not highly creative content. Yes, there is a chance that there could be a name in the data that is subject to GDPR, or a unique set of words subject to IP rights, but in the specific niche where we are operating, these issues are more likely to be an exception than a rule.

Massive web-crawling projects (like Paracrawl), without written consent from the owner is an issue solely from a legal perspective. In theory, copying someone else’s text can only be done if you have permission. But then again, the Google caching program is basically copying the whole internet. Since it happens on a very large scale and the majority of people don't have an issue with it, there seems to be a shift that comes from the discrepancy in what the law prescribes and what people are doing.

Cleaning data doesn’t move the needle of ownership much from an IP perspective, similarly to the translation that is considered an infringement of the original, if done without permission. In light of this and other new data scenarios, the TAUS Data Terms of Use from 2008 were updated to include the new use scenarios, and are now available as Data Marketplace Terms of Use.  



Anne-Maj van der Meer is a marketing professional with over 10 years of experience in event organization and management. She has a BA in English Language and Culture from the University of Amsterdam and a specialization in Creative Writing from Harvard University. Before her position at TAUS, she was a teacher at primary schools in regular as well as special needs education. Anne-Maj started her career at TAUS in 2009 as the first TAUS employee where she became a jack of all trades, taking care of bookkeeping and accounting as well as creating and managing the website and customer services. For the past 5 years, she works in the capacity of Events Director, chief content editor and designer of publications. Anne-Maj has helped in the organization of more than 35 LocWorld conferences, where she takes care of the program for the TAUS track and hosts and moderates these sessions.

Related Articles
Purchase TAUS's exclusive data collection, featuring close to 7.4 billion words, covering 483 language pairs, now available at discounts exceeding 95% of the original value.
Explore the crucial role of language data in training and fine-tuning LLMs and GenAI, ensuring high-quality, context-aware translations, fostering the symbiosis of human and machine in the localization sector.
Domain Adaptation can be classified into three types - supervised, semi-supervised, and unsupervised - and three methods - model-centric, data-centric, or hybrid.