TAUS Data Sale to Boost Multilingual LLMs

TAUS offers its data collection of close to 7.4 billion words for sale this spring at discounts of more than 95% of the original value. The sale opens on March 11 and will end on April 30, 2024. The 7.4 billion words on offer are all non-public, unique, human translation quality data covering 483 language pairs.

In the early days of Statistical and then Neural MT, TAUS data has served a relatively small audience of a few dozen MT developers. The landscape has changed drastically since 2023. With GenAI and LLMs there are thousands of new players interested in customizing and improving generic models. The TAUS multilingual data is very relevant and valuable, especially because most of the LLMs have been trained almost solely, for more than 90%, on English language data. However the rates TAUS has historically charged - 1,500 to 2,500 Euros per million words of training data - are now too high for the new generation of smaller scale users, who are less focused on generic models and more on customized models. That’s why the TAUS data assets are now available at steep discounts of up to 95%.

“There are shifts in the needs for data”, says Amir Kamran, solution architect at TAUS. “The LLM developers are now looking for data with a lot more context to improve the overall performance and accuracy of the language generation features. For the translation performance they tend to rely on transfer learning, which results in underperformance of the multilingual and translation features of LLMs. The TAUS data helps to improve the translation quality scores with double digit percentage points.”

Please contact us or go to our Data For AI page to acquire the data catalog, samples and the pricing table. You can purchase the entire collection or choose specific language pairs.

TAUS Data Sale to Boost Multilingual LLMs

Purchase TAUS's exclusive data collection, featuring close to 7.4 billion words, covering 483 language pairs, now available at discounts exceeding 95% of the original value.