Case Study: Utterance Collection

Delivered 7M+ Utterances to build Multi-lingual digital assistants in 13 languages

Utterance collection

Real World Solution

Data that powers global conversations

The need for Utterance training arises because not all customers use the exact words or phrases while interacting or asking questions to their voice assistants in a scripted format. That’s why specific voice applications must be trained on spontaneous speech data. E.g., “Where is the closest hospital located?” “Find a hospital near me” or “Is there a hospital nearby?” all indicate the same search intent but are phrased differently.

Utterance collection1


To execute clients’ Digital Assistant’s speech roadmap for worldwide languages, the team needed to acquire large volumes of training data for the speech recognition AI model. The critical requirements of the client were:

  • Acquire large volumes of training data (single speaker utterance prompts of not more than 3-30 seconds long ) for speech recognition services in 13 global languages
  • For each language, the supplier will generate text prompts for speakers to record (unless the
    client supplies) and transcribe the resulting audio.
  • Provide audio data and transcription of recorded utterances with corresponding JSON files
    containing the metadata for all recordings.
  • Ensure a diverse mix of speakers by age, gender, education & dialect
  • Ensure a diverse mix of recording environments as per Specifications.
  • Each audio recording shall be at least 16kHz but preferably 44kHz

Accelerate your Conversational AI
application development by 100%

“After evaluating many vendors, the client chose Shaip because of their expertise in conversational AI projects. We were impressed with Shaip’s project execution competence, their expertise to source, transcribe and deliver the required utterances from expert linguists in 13 languages within stringent timelines and with the required quality”


With our deep understanding of conversational AI, we helped the client collect, transcribe and annotate the data with a team of expert linguists and annotators to train their AI-powered Speech Processing multilingual Voice Suite.

The scope of work for Shaip included but was not limited to acquiring large volumes of audio training data for speech recognition, transcribing audio recordings in multiple languages for all languages on our Tier 1 and Tier 2 language roadmap, and delivering corresponding JSON files containing the metadata. Shaip collected utterances of 3-30 seconds at scale while maintaining desired levels of quality required to train ML models for complex projects.

  • Audio Collected, Transcribed & Annotated: 22,250 hours
  • Languages Supported: 13 (Danish, Korean, Saudi Arabian Arabic, Dutch, Mainland & Taiwan Chinese, French Canadian, Mexican Spanish, Turkish, Hindi, Polish, Japanese, Russian)
  • No. of Utterances: 7M+
  • Timeline: 7-8 months

Ai-powered speech processing multilingual voice suite

While collecting audio utterances at 16 kHz, we ensured a healthy mix of speakers by age, gender, education, and dialects in diverse recording environments.


The high-quality utterance audio data from expert linguists empowered the client to accurately train their multilingual Speech Recognition model in 13 Global Tier 1 & 2 languages. With gold-standard training datasets, the client can offer intelligent and robust digital assistance to solve future real-world problems.

High-quality utterance audio data

Our Expertise

Hours of Speech Collected
0 +
Team of Voice Data Collectors
PII Compliant
0 %
Cool Number
0 +
Data Acceptance & Accuracy
> 0
Fortune 500 Clientele
0 +

Tell us how we can help with your next AI initiative.