Most Trusted Speech Data Collection Services for your AI

Train your NLP models, VAs, TTS prototypes, and more with quality conversational data, with our audio and speech data collection services

Audio data collection

Discover audio data pipelines without bottlenecks

Featured Clients

Professional Audio / Voice Data Collection Services

Any subject. Any scenario.

At Shaip, our expertise lies in creating high-quality speech datasets designed for varied AI/ML requirements. We offer an expansive range of languages and record in diverse settings making our datasets comprehensive and adaptable. Our focus is on feeding models with the highest volume of custom speech data, in the least possible time. With us on board, you can expect: 

Speech collection
  • Curated high-quality multilingualaudio / voice data to improve accuracy
  • Highest possible level of domain specificity to target diverse scenario setup
  •  Scale your ML model to suit diverse demographics and verticals
  • Recording Environments: Studio Quality, featuring crystal-clear audio with minimal background noise, & Natural Environments, where recordings incorporate ambient sounds to mimic real-world situations.

100+

Countries

55K+

Hours of Speech Data

250+

Projects

60+

Languages (100+ Dialects)

8/16/44/48 kHz

Sampling rate

Our Expertise

Align Audio Data to for Smarter NLP Models

Shaip offers end-to-end speech/audio data collection services in over 100+ languages to enable voice-enabled technologies to cater to a diverse set of audiences across the globe. We can work on projects of any scope and size; from licensing existing off-the-shelf audio datasets, to managing custom audio data collection, to audio transcription and annotation. No matter how big is your speech data collection project, we can customize the audio collection services to suit your needs to build high-quality NLP datasets that target dialects, tones, and languages. Choose from our wide range of speech datasets and audio data collection resources, for voice-enabling intelligent setups.

Monologue speech

Monologue Speech Collection

It focuses on processing speech from a single speaker. Utilize scripted prompts to feed into single-channel audio files, ensuring the capture of unique speech patterns, tones, and nuances specific to that individual.

Dialogue speech

Dialogue Speech
Collection

Two-person interaction, replicating real-world conversations and dialogues with multilingual exposure via dual-channel files and transcribed resources.

Multi-party conversations

Group / Muti-party
Conversations

Multi-person discussions, capturing group dynamics, overlaps, and varied tones so as to accurately train speech models.

Natural language utterance

Natural Language Utterance Collection

Train AIs to identify phrases or wake words with similar meanings using diverse, rich, and authentic utterances for advanced natural language processing and understanding.

Acoustic speech

Acoustic Data
Collection

We can professionally record studio-quality audio data be it restaurants, offices, or homes or from various environments and languages, whilst covering a wider acoustic range (Comprehensive Sound Datasets).

Automatic speech recognition

Automatic Speech Recognition (ASR)

Improve accuracy of your automatic speech recognition (ASR) systems by having access to state-of-art diversified speech/audio datasets, from a wide array of demographics.

Natural language utterance

Multilingual Speech/Audio Training Data

Our skilled language professionals, across the globe offer multilingual audio/speech data in various languages and dialects. This effort fosters global communication and bridges language barriers, contributing to more inclusive and effective AI solutions.

Digital virtual assistants

Text-to-Speech
(TTS)

Build a text-to-speech (TTS) multilingual model with the help of our global workforce, who help you collect speech data in 150+ languages & dialects to enhance your AI models from in-car controls to chatbots and learning solutions with high-quality audio data.

Call center recordings

Call Center
Recordings

Genuine exchanges between agents and clients, supporting numerous languages such as Spanish, German, American English, Bengali, Japanese, Chinese, and Hindi.

Success Stories

Conversational AI datasets with over 3k hours of data across 8 languages

Looking to build a multilingual platform for Indian languages, the client partnered with Shaip to collect, segment and transcribe large datasets in multiple Indian languages. This would help develop effective speech models that could power the client’s innovative new platform.

Problem: Over 3,000 hours of audio data collected in 8 Indian languages, segmented and transcribed to develop automatic speech recognition.

Solution: We provided data collection, segmentation, transcription, and delivered JSON files with metadata. We collected 3000 hours of audio data in 8 Indian languages at scale for the client’s speech technology project.

Speech data collection case study

Reasons to choose Shaip as your Trustworthy Speech Data Collection Partner

People

People

Dedicated and trained teams:

  • 30,000+ collaborators for Data Creation, Labeling & QA
  • Credentialed Project Management Team
  • Experienced Product Development Team
  • Talent Pool Sourcing & Onboarding Team
Process

Process

Highest process efficiency is assured with:

  • Robust 6 Sigma Stage-Gate Process
  • A dedicated team of 6 Sigma black belts – Key process owners & Quality compliance
  • Continuous Improvement & Feedback Loop
Platform

Platform

The patented platform offers benefits:

  • Web-based end-to-end platform
  • Impeccable Quality
  • Faster TAT
  • Seamless Delivery

Off-the-Shelf Speech / Audio Datasets

DetailsLanguage DatasetSample RateDataset TypeTotal Audio HoursShort DescriptionDataset DescriptionAudio ChannelRecording PlatformWER (%)Audio FormatTranscription FormatUse CaseNumber of SpeakersCTA
Speechen_US_CC_8African American VernacularAfrican American Vernacularen_US8 kHzCall-center211African American Vernacular Call-center dataUnscripted, synthetic telephonic conversation between "agent" and "customer", Approx. Audio Duration (Range) 5-15 Minutes,DualDesktop5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingFemale: 612, Male: 1242, and Unknown: 12
Speechen_US_MA_16African American VernacularAfrican American Vernacularen_US16 kHzMedia Audio154African American Vernacular Media dataLicensable Public domain audio/video files such as interviews, podcasts etc - 1 to 5 people. Approx. Audio Duration (Range) 15-60 minutesMonoWeb Sourcing5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingFemale: 151, Male: 150, and Unknown: 10
SpeechAfrikaans_GC_8AfrikaansAfrikaansaf_ZA8 kHzGeneral Conversation368Afrikaans General Conversation dataUnscripted telephonic conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, Afrikaans spoken in AfricaDualDesktop5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingFemale: 502, Male: 390, and Unknown: 2
SpeechAfrikaans_MA_16AfrikaansAfrikaansaf_ZA16 kHzMedia Audio658Afrikaans Media FilesLicensable Public domain audio/video files such as interviews, podcasts etc - 1 to 5 people. Approx. Audio Duration (Range) 15-60 minutesMonoWeb Sourcing5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingFemale: 750, Male: 1278, and Unknown: 52
SpeechArabic_GC_8ArabicArabicar_AE8 kHzGeneral Conversation292Arabic General Conversation dataUnscripted telephonic conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, Arabic from Gulf countriesDualDesktop5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingFemale: 171, Male: 534, and Unknown: 1
SpeechArabic_SM_48ArabicArabicar-SA48 kHzScripted Monologue1,947Arabic Scripted MonologueSingle-utterance recordings, which tend to fall in the 5 to 30 second rangeMonoMobile App5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingFemale 838 Male 1209 Unknown 78
SpeechAssamese_CC_8AssameseAssamese (In Pipeline) as_INCall-Center60Assamese (In Pipeline) Call-Center dataUnscripted, synthetic telephonic conversation between "agent" and "customer", Approx. Audio Duration (Range) 5-15 Minutes,Desktop5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling
SpeechAssamese_GCAssameseAssamese (In Pipeline) as_INGeneral Conversation100Assamese (In Pipeline) General Conversation dataUnscripted, synthetic telephonic conversation between "agent" and "customer", Approx. Audio Duration (Range) 5-15 Minutes,Desktop5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling
SpeechAssamese_MAAssameseAssamese (In Pipeline) as_INMedia Audio40Assamese (In Pipeline) Media audio dataLicensable Public domain audio/video files such as interviews, podcasts etc - 1 to 5 people. Approx. Audio Duration (Range) 15-60 minutesWeb Sourcing5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling
SpeechBengali_CC_8BengaliBengali (In Pipeline) bn_INCall-Center60Bengali (In Pipeline) Call-Center dataUnscripted, synthetic telephonic conversation between "agent" and "customer", Approx. Audio Duration (Range) 5-15 Minutes,Desktop5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling
SpeechBengali_GCBengaliBengali (In Pipeline) bn_INGeneral Conversation100Bengali (In Pipeline) General Conversation dataUnscripted, synthetic telephonic conversation between "agent" and "customer", Approx. Audio Duration (Range) 5-15 Minutes,Desktop5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling
SpeechBengali_MABengaliBengali (In Pipeline) bn_INMedia Audio40Bengali (In Pipeline) Media audio dataLicensable Public domain audio/video files such as interviews, podcasts etc - 1 to 5 people. Approx. Audio Duration (Range) 15-60 minutesWeb Sourcing5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling
SpeechBoston_CC_8Boston EnglishBoston Englishen_US8 kHzCall-Center177Boston Call-center dataUnscripted, synthetic telephonic conversation between "agent" and "customer", Approx. Audio Duration (Range) 5-15 Minutes,DualDesktop5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingFemale: 605, Male: 711, and Unknown: 0
SpeechBoston_GC_8Boston EnglishBoston Englishen_US8 kHzGeneral Conversation32Boston General Conversation dataUnscripted telephonic conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, DualDesktop5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingFemale: 53, Male: 83, and Unknown: 0
SpeechBoston_MA_16Boston EnglishBoston Englishen_US16 kHzMedia Audio93Boston Media audio dataLicensable Public domain audio/video files such as interviews, podcasts etc - 1 to 5 people. Approx. Audio Duration (Range) 15-60 minutesMonoWeb Sourcing5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingFemale: 43, Male: 181, and Unknown: 2
SpeechCanadian_SM_48Canadian FrenchCanadian Frenchfr-CA48 kHzScripted Monologue1,222Canadian FrenchSingle-utterance recordings, which tend to fall in the 5 to 30 second rangeMonoMobile App5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingFemale 974 Male 631 Unknown 1
SpeechChinese_CC_8Chinese EnglishChinese Englishen_US8 kHzCall-Center169Chinese Call-center dataUnscripted, synthetic telephonic conversation between "agent" and "customer", Approx. Audio Duration (Range) 5-15 Minutes,DualDesktop5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingFemale: 1790, Male: 523 and Unknown: 13
SpeechChinese_MA_16Chinese EnglishChinese Englishen_US16 kHzMedia Audio249Chinese Media audio dataLicensable Public domain audio/video files such as interviews, podcasts etc - 1 to 5 people. Approx. Audio Duration (Range) 15-60 minutesMonoWeb Sourcing5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingFemale: 126, Male: 346 and Unknown: 6
SpeechChinese Simplified_SM_48Chinese SimplifiedChinese Simplifiedzh-CN48 kHzScripted Monologue2,762Chinese SimplifiedSingle-utterance recordings, which tend to fall in the 5 to 30 second rangeMonoMobile App5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingFemale 1920 Male 1535 Unknown 270
SpeechChinese Traditional_SM_48Chinese TraditionalChinese Traditionalzh-TW48 kHzScripted Monologue1,028Chinese TraditionalSingle-utterance recordings, which tend to fall in the 5 to 30 second rangeMonoMobile App5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingFemale 1069 Male 262 Unknown 3
SpeechDanish_GC_8DanishDanishda_DK8 kHzGeneral Conversation372Danish General Conversation dataUnscripted telephonic conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, DualDesktop5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingFemale: 311, Male: 417, Unknown: 0
SpeechDanish_MA_16DanishDanishda_DK16 kHzMedia Audio664Danish Media audio dataLicensable Public domain audio/video files such as interviews, podcasts etc - 1 to 5 people. Approx. Audio Duration (Range) 15-60 minutesMonoWeb Sourcing5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingFemale: 369, Male: 864, Unknown: 27
SpeechDanish_SM_48DanishDanishda-DK48 kHzScripted Monologue2,579Danish Scripted MonologueSingle-utterance recordings, which tend to fall in the 5 to 30 second range, Danish from DenmarkMonoMobile App5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingFemale 1551 Male 1233 Unknown 42
SpeechEnglish Deep South_CC_8English Deep SouthEnglish Deep Southen_US8 kHzCall-Center151English Deep South Call-center dataUnscripted, synthetic telephonic conversation between "agent" and "customer", Approx. Audio Duration (Range) 5-15 Minutes,DualDesktop5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingFemale 221 , Male 1004 , Unknown 7
SpeechEnglish Deep South_GC_8English Deep SouthEnglish Deep Southen_US8 kHzGeneral Conversation56English Deep South General Conversation dataUnscripted telephonic conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, DualDesktop5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingFemale 99, Male 31, Unknown 0
SpeechEnglish Deep South_MA_16English Deep SouthEnglish Deep Southen_US16 kHzMedia Audio266English Deep South Media audio dataLicensable Public domain audio/video files such as interviews, podcasts etc - 1 to 5 people. Approx. Audio Duration (Range) 15-60 minutesMonoWeb Sourcing5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingFemale 204, Male 356, Unknown 21
SpeechGerman_CC_8GermanGermande-De8 kHzCall-Center64German Call-center data Unscripted, synthetic telephonic conversation between "agent" and "customer", Approx. Audio Duration (Range) 5-15 Minutes,MonoDesktop.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingFemale 478 Male 1440 Unknown 0
SpeechGerman_IVR_8GermanGermande-De8 kHz IVR200German IVR dataHuman to Machine. An IVR type of flow where there is a TTS prompt (e.g. ”How may I help you”) followed by a spontaneous human responseMonoDesktop.wav .jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling Female 10115 Male 8750 Unknown 0
SpeechGujarati_CC_8GujaratiGujarati (In Pipeline) gu_INCall-Center60Gujarati (In Pipeline) Call-Center dataUnscripted, synthetic telephonic conversation between "agent" and "customer", Approx. Audio Duration (Range) 5-15 Minutes,Desktop5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling
SpeechGujarati_GCGujaratiGujarati (In Pipeline) gu_INGeneral Conversation100Gujarati (In Pipeline) General Conversation dataUnscripted, synthetic telephonic conversation between "agent" and "customer", Approx. Audio Duration (Range) 5-15 Minutes,Desktop5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling
SpeechGujarati_MAGujaratiGujarati (In Pipeline) gu_INMedia Audio40Gujarati (In Pipeline) Media audio dataLicensable Public domain audio/video files such as interviews, podcasts etc - 1 to 5 people. Approx. Audio Duration (Range) 15-60 minutesWeb Sourcing5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling
SpeechHebrew_General Conversation_8HebrewHebrewhe_IL8 kHzGeneral Conversation399Hebrew General Conversation dataUnscripted telephonic conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, Hebrew in IsraelDualDesktop5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingFemale 414 , Male 399 , Unknown 1
SpeechHebrew_MA_16HebrewHebrewhe_IL16 kHzMedia Audio427Hebrew Media audio dataLicensable Public domain audio/video files such as interviews, podcasts etc - 1 to 5 people. Approx. Audio Duration (Range) 15-60 minutesMonoWeb Sourcing5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingFemale 361 , Male 513, Unknown 13
SpeechHindi_MA_16HindiHindihi_IN16 kHzMedia Audio219Hindi Media audio dataLicensable Public domain audio/video files such as interviews, podcasts etc - 1 to 5 people. Approx. Audio Duration (Range) 15-60 minutesMonoWeb Sourcing5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingFemale 83 , Male 309, Unknown 0
SpeechHindi_SM_48HindiHindihi-IN48 kHzScripted Monologue2,867Hindi Scripted MonologueSingle-utterance recordings, which tend to fall in the 5 to 30 second rangeMonoMobile App5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingFemale 1977 Male 1864 Unknown 147
SpeechHINGLISH_CC_8HinglishHinglishhg_IN8 kHzCall-Center208HINGLISH Call-center dataUnscripted, synthetic telephonic conversation between "agent" and "customer", Approx. Audio Duration (Range) 5-15 Minutes,DualDesktop5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingFemale 822, Male 1262 , Unknown 0
SpeechHINGLISH_MA_16HinglishHinglishhg_IN16 kHzMedia Audio216HINGLISH Media audio dataLicensable Public domain audio/video files such as interviews, podcasts etc - 1 to 5 people. Approx. Audio Duration (Range) 15-60 minutesMonoWeb Sourcing5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingFemale 75, Male 380, Unknown 0
SpeechHispanic_CC_8Hispanic EnglishHispanic Englishen_US8 kHzCall-Center212Hispanic Call-center dataUnscripted, synthetic telephonic conversation between "agent" and "customer", Approx. Audio Duration (Range) 5-15 Minutes,DualDesktop5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingFemale 822, Male 1262, Unknown 0
SpeechHispanic_MA_16Hispanic EnglishHispanic Englishen_US16 kHzMedia Audio155Hispanic Call Media audioLicensable Public domain audio/video files such as interviews, podcasts etc - 1 to 5 people. Approx. Audio Duration (Range) 15-60 minutesMonoWeb Sourcing5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingFemale 140, Male 219, Unknown 5
SpeechIndonesian_GC_8IndonesianIndonesianid_ID8 kHzGeneral Conversation496Indonesian General Conversation dataUnscripted telephonic conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, Bahasa IndonesianDualDesktop5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingFemale 524, Male 454, Unknown 2
SpeechIndonesian_MA_16IndonesianIndonesianid_ID16 kHzMedia Audio643Indonesian Media audio dataLicensable Public domain audio/video files such as interviews, podcasts etc - 1 to 5 people. Approx. Audio Duration (Range) 15-60 minutesMonoWeb Sourcing5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingFemale 746, Male 1507, Unknown 129
SpeechIrish_GC_8IrishIrishen_IE8 kHzGeneral Conversation192Irish General Conversation dataUnscripted telephonic conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, DualDesktop5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingFemale 213 , Male 153 , Unknown 0
SpeechJapanese_SM_48JapaneseJapaneseja-JP48 kHzScripted Monologue2,335Japanese Scripted MonologueSingle-utterance recordings, which tend to fall in the 5 to 30 second rangeMonoMobile App5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingFemale 1460 Male 1221 Unknown 194
SpeechKannada_CC_8KannadaKannada (In Pipeline) kn_INCall-Center60Kannada (In Pipeline) Call-Center dataUnscripted, synthetic telephonic conversation between "agent" and "customer", Approx. Audio Duration (Range) 5-15 Minutes,Desktop5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling
SpeechKannada_GCKannadaKannada (In Pipeline) kn_INGeneral Conversation100Kannada (In Pipeline) General Conversation dataUnscripted, synthetic telephonic conversation between "agent" and "customer", Approx. Audio Duration (Range) 5-15 Minutes,Desktop5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling
SpeechKannada_MAKannadaKannada (In Pipeline) kn_INMedia Audio40Kannada (In Pipeline) Media audio dataLicensable Public domain audio/video files such as interviews, podcasts etc - 1 to 5 people. Approx. Audio Duration (Range) 15-60 minutesWeb Sourcing5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling
SpeechKorean_CC_8KoreanKoreanko_KR8 kHzCall-Center107Korean Call-center dataUnscripted, synthetic telephonic conversation between "agent" and "customer", Approx. Audio Duration (Range) 5-15 Minutes,DualDesktop5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingFemale 1086, Male 210 , Unknown 4
SpeechKorean_MA_16KoreanKoreanko_KR16 kHzMedia Audio204Korean media audio dataLicensable Public domain audio/video files such as interviews, podcasts etc - 1 to 5 people. Approx. Audio Duration (Range) 15-60 minutesMonoWeb Sourcing5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingFemale 70 Male 303, Unknown 25
SpeechKorean_SM_48KoreanKoreanko-KR48 kHzScripted Monologue1,955Korean Scripted MonologueSingle-utterance recordings, which tend to fall in the 5 to 30 second rangeMonoMobile App5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingFemale 1195 Male 1134 Unknown 122
SpeechMalay_GC_8MalayMalayms_MY8 kHzGeneral Conversation266Malay General Conversation dataUnscripted telephonic conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, Malay in MalaysiaDualDesktop5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingFemale 316, Male 176 , Unknown 0
SpeechMalay_MA_16MalayMalayms_MY16 kHzMedia Audio344Malay Media audio dataLicensable Public domain audio/video files such as interviews, podcasts etc - 1 to 5 people. Approx. Audio Duration (Range) 15-60 minutesMonoWeb Sourcing5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingFemale 236, Male 626, Unknown 47
SpeechMalayalam_CC_8MalayalamMalayalam (In Pipeline) ml_INCall-Center60Malayalam (In Pipeline) Call-Center dataUnscripted, synthetic telephonic conversation between "agent" and "customer", Approx. Audio Duration (Range) 5-15 Minutes,Desktop5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling
SpeechMalayalam_GCMalayalamMalayalam (In Pipeline) ml_INGeneral Conversation100Malayalam (In Pipeline) General Conversation dataUnscripted, synthetic telephonic conversation between "agent" and "customer", Approx. Audio Duration (Range) 5-15 Minutes,Desktop5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling
SpeechMalayalam_MAMalayalamMalayalam (In Pipeline) ml_INMedia Audio40Malayalam (In Pipeline) Media audio dataLicensable Public domain audio/video files such as interviews, podcasts etc - 1 to 5 people. Approx. Audio Duration (Range) 15-60 minutesWeb Sourcing5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling
SpeechMarathi_CC_8MarathiMarathi (In Pipeline) mr_INCall-Center60Marathi (In Pipeline) Call-Center dataUnscripted, synthetic telephonic conversation between "agent" and "customer", Approx. Audio Duration (Range) 5-15 Minutes,Desktop5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling
SpeechMarathi_GCMarathiMarathi (In Pipeline) mr_INGeneral Conversation100Marathi (In Pipeline) General Conversation dataUnscripted, synthetic telephonic conversation between "agent" and "customer", Approx. Audio Duration (Range) 5-15 Minutes,Desktop5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling
SpeechMarathi_MAMarathiMarathi (In Pipeline) mr_INMedia Audio40Marathi (In Pipeline) Media audio dataLicensable Public domain audio/video files such as interviews, podcasts etc - 1 to 5 people. Approx. Audio Duration (Range) 15-60 minutesWeb Sourcing5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling
SpeechMexican_SM_48Spanish (Mexico)Spanish (Mexico)es-MX48 kHzScripted Monologue1,492Mexican Spanish Scripted MonologueSingle-utterance recordings, which tend to fall in the 5 to 30 second rangeMonoMobile App5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingFemale 1016 Male 1069 Unknown 95
SpeechNetherlands_SM_48DutchDutchnl-NL48 kHzScripted Monologue1,205Dutch Scripted MonologueSingle-utterance recordings, which tend to fall in the 5 to 30 second rangeMonoMobile App5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingFemale 1285 Male 531 Unknown 3
SpeechNew York English_CC_8New York EnglishNew York Englishen_US8 kHzCall-Center103New York English Call-center dataUnscripted, synthetic telephonic conversation between "agent" and "customer", Approx. Audio Duration (Range) 5-15 Minutes,DualDesktop5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingFemale 610, Male 532, Unknow 0
SpeechNew York English_GC_8New York EnglishNew York Englishen_US8 kHzGeneral Conversation107New York English General Conversation dataUnscripted telephonic conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, DualDesktop5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingFemale 118, Male 114, Unknown 0
SpeechNew York English_MA_16New York EnglishNew York Englishen_US16 kHzMedia Audio140New York English Media audio dataLicensable Public domain audio/video files such as interviews, podcasts etc - 1 to 5 people. Approx. Audio Duration (Range) 15-60 minutesMonoWeb Sourcing5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingFemale 66, Male 230, Unknown 11
SpeechNew Zealand_GC_8New Zealand English New Zealand English en_NZ8 kHzGeneral Conversation148New Zealand English General Conversation dataUnscripted telephonic conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, DualDesktop5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingFemale 167, male 121, Unknown 4
SpeechNew Zealand_MA_16New Zealand English New Zealand English en_NZ16 kHzMedia Audio400New Zealand English Media audioLicensable Public domain audio/video files such as interviews, podcasts etc - 1 to 5 people. Approx. Audio Duration (Range) 15-60 minutesMonoWeb Sourcing5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingFemale 367, male 678, Unknown 26
SpeechOriya_CC_8OriyaOriya (In Pipeline) or_INCall-Center60Oriya (In Pipeline) Call-Center dataUnscripted, synthetic telephonic conversation between "agent" and "customer", Approx. Audio Duration (Range) 5-15 Minutes,Desktop5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling
SpeechOriya_GCOriyaOriya (In Pipeline) or_INGeneral Conversation100Oriya (In Pipeline) General Conversation dataUnscripted, synthetic telephonic conversation between "agent" and "customer", Approx. Audio Duration (Range) 5-15 Minutes,Desktop5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling
SpeechOriya_MAOriyaOriya (In Pipeline) or_INMedia Audio40Oriya (In Pipeline) Media audio dataLicensable Public domain audio/video files such as interviews, podcasts etc - 1 to 5 people. Approx. Audio Duration (Range) 15-60 minutesWeb Sourcing5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling
SpeechPolish_MA_16PolishPolishpl_PL16 kHzMedia Audio269Polish Media audioLicensable Public domain audio/video files such as interviews, podcasts etc - 1 to 5 people. Approx. Audio Duration (Range) 15-60 minutesMonoWeb Sourcing5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingFemale 173 Male 354 Unknown 6
SpeechPolish Poland_SM_48Polish (Poland)Polish (Poland)pl-PL48 kHzScripted Monologue1,482Polish Poland - Scripted MonologueSingle-utterance recordings, which tend to fall in the 5 to 30 second rangeMonoMobile App5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingFemale 1324 Male 701 Unknown 24
SpeechPunjabi_CC_8PunjabiPunjabi (In Pipeline) PunjabiCall-Center60Punjabi (In Pipeline) Call-Center dataUnscripted, synthetic telephonic conversation between "agent" and "customer", Approx. Audio Duration (Range) 5-15 Minutes,Desktop5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling
SpeechPunjabi_GCPunjabiPunjabi (In Pipeline) PunjabiGeneral Conversation100Punjabi (In Pipeline) General Conversation dataUnscripted, synthetic telephonic conversation between "agent" and "customer", Approx. Audio Duration (Range) 5-15 Minutes,Desktop5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling
SpeechPunjabi_MAPunjabiPunjabi (In Pipeline) Punjabi Media Audio40Punjabi (In Pipeline) Media audio dataLicensable Public domain audio/video files such as interviews, podcasts etc - 1 to 5 people. Approx. Audio Duration (Range) 15-60 minutesWeb Sourcing5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling
SpeechRussian_SM_48RussianRussianru-RU48 kHzScripted Monologue2,398Russian Scripted MonologueSingle-utterance recordings, which tend to fall in the 5 to 30 second rangeMonoMobile App5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingFemale 1689 Male 1937 Unknown 214
SpeechScottish_GC_8Scottish (English Accent)Scottish (English Accent)en_AB8 kHzGeneral Conversation292Scottish General Conversation dataUnscripted telephonic conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, DualDesktop5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingFemale 285 , Male 260, Unknown 3
SpeechSingapore_CC_8Singapore EnglishSingapore Englishen_SG8 kHzCall-Center218Singapore Call-Center dataUnscripted, synthetic telephonic conversation between "agent" and "customer", Approx. Audio Duration (Range) 5-15 Minutes,DualDesktop5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingFemale 2139 , Male 884, Unknown 21
SpeechSingapore_MA_16Singapore EnglishSingapore Englishen_SG16 kHzMedia Audio247Singapore Media audio dataLicensable Public domain audio/video files such as interviews, podcasts etc - 1 to 5 people. Approx. Audio Duration (Range) 15-60 minutesMonoWeb Sourcing5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingFemale 160, Male 455, Unknown 37
SpeechSouth African English_CC_8South African EnglishSouth African Englishen_ZA8 kHzCall-Center261South African English Call-Center dataUnscripted, synthetic telephonic conversation between "agent" and "customer", Approx. Audio Duration (Range) 5-15 Minutes,DualDesktop5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingFemale 1274 , Male 935 , Unknown 1
SpeechSouth African English_MA_16South African EnglishSouth African Englishen_ZA16 kHzMedia Audio251South African English Media audio dataLicensable Public domain audio/video files such as interviews, podcasts etc - 1 to 5 people. Approx. Audio Duration (Range) 15-60 minutesMonoWeb Sourcing5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingFemale 235, Male 432, Unknown 36
SpeechSwahili_CC_8SwahiliSwahilisw_KE8 kHzCall-Center230Swahili Call-Center dataUnscripted, synthetic telephonic conversation between "agent" and "customer", Approx. Audio Duration (Range) 5-15 Minutes,DualDesktop5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingFemale 611, Male 833, Unknown 0
SpeechSwahili_MA_16SwahiliSwahilisw_KE16 kHzMedia Audio265Swahili Media audio dataLicensable Public domain audio/video files such as interviews, podcasts etc - 1 to 5 people. Approx. Audio Duration (Range) 15-60 minutesMonoWeb Sourcing5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingFemale 118, Male 493, Unknown 25
SpeechSwedish_CC_8SwedishSwedishsv_SE8 kHzCall-Center250Swedish Call-Center dataUnscripted, synthetic telephonic conversation between "agent" and "customer", Approx. Audio Duration (Range) 5-15 Minutes,DualDesktop5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingFemale 1581, male 727, Unknown 2
SpeechSwedish_MA_16SwedishSwedishsv_SE16 kHzMedia Audio278Swedish Media audio dataLicensable Public domain audio/video files such as interviews, podcasts etc - 1 to 5 people. Approx. Audio Duration (Range) 15-60 minutesMonoWeb Sourcing5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingFemale 195, male 500, Unknown 21
SpeechTamil_CC_8TamilTamil (In Pipeline) ta_INCall-Center60Tamil (In Pipeline) Call-Center dataUnscripted, synthetic telephonic conversation between "agent" and "customer", Approx. Audio Duration (Range) 5-15 Minutes,Desktop5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling
SpeechTamil_GCTamilTamil (In Pipeline) ta_INGeneral Conversation100Tamil (In Pipeline) General Conversation dataUnscripted, synthetic telephonic conversation between "agent" and "customer", Approx. Audio Duration (Range) 5-15 Minutes,Desktop5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling
SpeechTamil_MATamil Tamil (In Pipeline) ta_INMedia Audio40Tamil (In Pipeline) Media audio dataLicensable Public domain audio/video files such as interviews, podcasts etc - 1 to 5 people. Approx. Audio Duration (Range) 15-60 minutesWeb Sourcing5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling
SpeechTelugu_GC_8TeluguTelugute_IN8 kHzGeneral Conversation553Telugu General Conversation dataUnscripted telephonic conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, DualDesktop5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingFemale 574 , Male 564, Unknown 0
SpeechTelugu_MA_16TeluguTelugute_IN16 kHzMedia Audio648Telugu Media audio dataLicensable Public domain audio/video files such as interviews, podcasts etc - 1 to 5 people. Approx. Audio Duration (Range) 15-60 minutesMonoWeb Sourcing5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingFemale 207, Male 963, Unknown 2
SpeechTelugu_CC_8TeluguTelugu (In Pipeline) te_INCall-Center30Telugu (In Pipeline) Call-Center dataUnscripted, synthetic telephonic conversation between "agent" and "customer", Approx. Audio Duration (Range) 5-15 Minutes,Desktop5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling
SpeechTelugu_GCTeluguTelugu (In Pipeline) te_INGeneral Conversation50Telugu (In Pipeline) General Conversation dataUnscripted, synthetic telephonic conversation between "agent" and "customer", Approx. Audio Duration (Range) 5-15 Minutes,Desktop5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling
SpeechTelugu_MATeluguTelugu (In Pipeline) te_INMedia Audio20Telugu (In Pipeline) Media audio dataLicensable Public domain audio/video files such as interviews, podcasts etc - 1 to 5 people. Approx. Audio Duration (Range) 15-60 minutesWeb Sourcing5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling
SpeechThai_GC_8ThaiThaith_TH8 kHzGeneral Conversation183Thai General ConversationUnscripted telephonic conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, An informal register used between friendsDualDesktop5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingFemale 338, Male 96, Unknown 8
SpeechThai_MA_8ThaiThaith_TH16 kHzMedia Audio173Thai Media audioLicensable Public domain audio/video files such as interviews, podcasts etc - 1 to 5 people. Approx. Audio Duration (Range) 15-60 minutesMonoWeb Sourcing5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingFemale 143, Male 502, Unknown 26
SpeechTurkish Turkey_SM_48Turkish TurkeyTurkish Turkeytr-TR48 kHzScripted Monologue2,027Turkish TurkeySingle-utterance recordings, which tend to fall in the 5 to 30 second rangeMonoMobile App5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingFemale 1561 Male 1241 Unknown 31
SpeechVietnamese_GC_8VietnameseVietnamesevi_VN8 kHzGeneral Conversation295Vietnamese General Conversation dataUnscripted, synthetic telephonic conversation between "agent" and "customer", Approx. Audio Duration (Range) 5-15 Minutes, Northern (e.g.,Hanoi), Central, and Southern (e.g., Ho Chi Minh City).DualDesktop5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingFemale 400, male 380, Unknowns 2
SpeechVietnamese_MA_16VietnameseVietnamesevi_VN16 kHzMedia Audio257Vietnamese Media audio dataLicensable Public domain audio/video files such as interviews, podcasts etc - 1 to 5 people. Approx. Audio Duration (Range) 15-60 minutesMonoWeb Sourcing5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingFemale 249, male 200, Unknowns 45
SpeechWelsh_GC_8Welsh (English Accent)Welsh (English Accent)en_WL8 kHzGeneral Conversation278Welsh General Conversation dataUnscripted, synthetic telephonic conversation between "agent" and "customer", Approx. Audio Duration (Range) 5-15 Minutes,DualDesktop5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingFemale 270, Male 324, Unknown 0
SpeechUK English_WW_16UK EnglishUK Englishen_uk16 kHzWake Word200 SpeakersWake Word UK Englishkeyphrases collection of data
  • 200 speakers
  • 4 unique keyphrases per speaker
  • 25-30 repeated keyphrases recordings per unique keyphrase
  • 25-30 audio files per unique keyphrase
  • 120 total recorded utterances per speaker
1 channelMobile App5.0.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingGender: 50% male, 50% female, +/- 10%.

Services Offered

Expert audio data collection isn’t all-hands-on-deck for comprehensive AI setups. At Shaip, you can even consider the following services to make models way more widespread than usual:

Text data collection

Text Data Collection
Services

The true value of Shaip cognitive data collection services is that it gives organizations the key to unlock critical information found within unstructured data

Image data collection

Image Data Collection Services

Make sure that your computer vision model identifies every image accurately, to seamlessly train next-gen AI models of the future

Video data collection

Video Data Collection Services

Now focus on computer vision along with NLP for training your models to identify objects, individuals, deterrents, and other visual elements to perfection

Shaip contact us

Want to build your own audio dataset?

Connect with our in-house speech data collection expert to set up an audio repository that best fits your requirement

  • By registering, I agree with Shaip Privacy Policy and Terms of Service and provide my consent to receive B2B marketing communication from Shaip.

Speech Data Collection for an ML Model refers to the process of gathering audio recordings of spoken language. This collection aids in training and refining machine learning algorithms, particularly those centered on understanding and processing human voices.

When aiming to collect audio data for Automatic Speech Recognition (ASR), you should start by defining your project’s specific needs, including the desired language, accent, and type of speech. After setting these parameters, ensure you obtain all necessary permissions to respect user privacy. Then, use appropriate recording devices or software to capture clear audio samples. Each recording should be meticulously annotated with its transcription or other pertinent metadata and stored systematically for effortless access.

A speech dataset in machine learning is pivotal for training, testing, and validating models tailored to recognize, transcribe, or interpret spoken language. Such datasets pave the way for a myriad of applications, from voice assistants and transcription services to voice biometrics.

For collecting precise data from diverse languages and accents, collaboration with native speakers of the desired linguistic backgrounds is vital. Aim for a varied and representative sample to cover a broad spectrum of demographic nuances. Employ standardized recording equipment in uniform environments to ensure audio consistency. And importantly, annotate each data piece with detailed transcriptions and metadata, denoting the specific language and accent.