TAUS Data Sale to Boost Multilingual LLMs

5 minute read

Purchase TAUS's exclusive data collection, featuring close to 7.4 billion words, covering 483 language pairs, now available at discounts exceeding 95% of the original value.

TAUS offers its data collection of close to 7.4 billion words for sale this spring at discounts of more than 95% of the original value. The sale opens on March 11 and will end on April 30, 2024. The 7.4 billion words on offer are all non-public, unique, human translation quality data covering 483 language pairs.


In the early days of Statistical and then Neural MT, TAUS data has served a relatively small audience of a few dozen MT developers. The landscape has changed drastically since 2023. With GenAI and LLMs there are thousands of new players interested in customizing and improving generic models. The TAUS multilingual data is very relevant and valuable, especially because most of the LLMs have been trained almost solely, for more than 90%, on English language data. However the rates TAUS has historically charged  - 1,500 to 2,500 Euros per million words of training data - are now too high for the new generation of smaller scale users, who are less focused on generic models and more on customized models. That’s why the TAUS data assets are now available at steep discounts of up to 95%. 


“There are shifts in the needs for data”, says Amir Kamran, solution architect at TAUS. “The LLM developers are now looking for data with a lot more context to improve the overall performance and accuracy of the language generation features. For the translation performance they tend to rely on transfer learning, which results in underperformance of the multilingual and translation features of LLMs. The TAUS data helps to improve the translation quality scores with double digit percentage points.” 


Please contact us or go to our Data For AI page to acquire the data catalog, samples and the pricing table. You can purchase the entire collection or choose specific language pairs.


Anne-Maj van der Meer is a marketing professional with over 10 years of experience in event organization and management. She has a BA in English Language and Culture from the University of Amsterdam and a specialization in Creative Writing from Harvard University. Before her position at TAUS, she was a teacher at primary schools in regular as well as special needs education. Anne-Maj started her career at TAUS in 2009 as the first TAUS employee where she became a jack of all trades, taking care of bookkeeping and accounting as well as creating and managing the website and customer services. For the past 5 years, she works in the capacity of Events Director, chief content editor and designer of publications. Anne-Maj has helped in the organization of more than 35 LocWorld conferences, where she takes care of the program for the TAUS track and hosts and moderates these sessions.

Related Articles
Explore the crucial role of language data in training and fine-tuning LLMs and GenAI, ensuring high-quality, context-aware translations, fostering the symbiosis of human and machine in the localization sector.
Domain Adaptation can be classified into three types - supervised, semi-supervised, and unsupervised - and three methods - model-centric, data-centric, or hybrid.
Machine learning and AI applications need data in order to work. And in order to get good results and output, the cleaner the data, the better.