Case Study

Speech Data Collection to Increase Performance & Diversity in Voice-based AI Systems

AI-powered systems that operate with voice such as virtual assistants require great amounts of high-quality voice or speech data to perform optimally and elevate the customer experience. Quality in speech data is tightly related to the diversity of accents and demographics of the community that provides the data. That’s where the TAUS Human Language Project Platform can help.

Ready to get started?
The Client

One of the world’s largest technology corporations

The Challenge

Our client is one of the world’s leading multinational technology companies. They were looking to improve voice and speech-to-text-based applications in the voice recognition of the speakers using a local accent instead of a ‘standard’ language pronunciation.

To achieve this they needed to train their systems with voice data from a group of speakers with highly diverse demographics in terms of accent, age, gender, and ethnicity.
This would help their technology act more responsibly and improve its accessibility to more end-users, preventing the users from feeling discriminated or frustrated if their local way of speaking is not properly processed.
The Solution
After identifying the client’s unique requirements, TAUS curated a diverse team of workers who created over 1,400 hours of speech data in English (GB) in nine specific dialects with no recurring submissions from one person.
Combining TAUS's in-house data expertise and the TAUS HLP Platform, a dedicated microtask platform for language data services with access to a global community of data contributors, TAUS formed a community across many demographic sections requested by the client, created a workflow where dialects could be defined and confirmed by speech and accent experts and implemented further automated and manual quality checks.
The Result
Complex recruitment based on the intersection of multiple and sensitive demographic attributes: 9 dialects, 3 genders, 3 age groups, 4 ethnic groups
Unique speaker per dataset requirement
Need to recruit in a very broad manner to overcome legal restrictions on targeting specific demographics. This meant recruiting a community up to 10 times bigger than the actual target group to get to the desired submissions.
Data collection platform targeted to project needs: speech collection + QA + transcription + speakers metadata collection
Over 1,400 hours of speech by unique speakers across 65 demographic intersections, calculated on an average of 30 minutes of speech for 2,763 delivered speakers datasets.
Let's connect

Talk to our Data Experts to help you find the right type of data for your next project. Niche domains or rare languages? We have a large suite of services to generate your dataset.

Discover more Case Studies

Enabling 15% Increase in Number of Perfect Translations for ING Hubs poland

ING Hubs Poland found out that training with TAUS datasets improves the number of perfect translations by 15% and with 95% precision.

Domain-Specific Training Data Generation for SYSTRAN

After the training with TAUS datasets in the pandemic domain, the SYSTRAN engines improved on average by 18% across all twelve language pairs compared to the baseline engines.

Customization of Amazon Active Custom Translate with TAUS Data

The customization of Amazon Translate with TAUS Data always improved the BLEU score measured on the test sets by more than 6 BLEU points on average and 2 BLEU points at a minimum.