Now Get 50% OFF* on Conversational AI Off-the-Shelf Datasets

Speech & Audio dataset for chatbots, voice assistants, speech-enabled devices.

*Limited Period Offer

  • By registering, I agree with Shaip Privacy Policy and Terms of Service and provide my consent to receive B2B marketing communication from Shaip.

Trusted by Industry Leaders

DetailsOff-the-shelf Language DatasetCall Center Conversations 8khz*Generic Conversations 8khz*Media & Podcasts 16khz*Utterance/ Scripted Monologue 16khz*Total Volume in HoursDialects coveredAudio FormatText Transcription FormatUse CaseSourceCTA
SpeechAfrikaansAfrikaans Audio Dataset6009001500Afrikaans spoken in Africa.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingShaip
SpeechArabicArabic Audio Dataset80015002300Arabic from Gulf countries.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingShaip
SpeechChineseChinese Audio Dataset20002000Chinese from China.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingShaip
SpeechDanishDanish Audio Dataset40060020003000Danish from Denmark.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingShaip
SpeechDutchDutch Audio Dataset20002000Dutch from Netherland.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingShaip
SpeechEnglish - AAVE AccentEnglish - AAVE (African American Vernacular English) Audio Dataset5005001000The vernacular variety (sometimes known as AAVE, typically spoken by the vast majority of working- and middle-class African Americans) and the more standard variety (typically spoken by middle-class African Americans in formal and public situations) but with a stronger emphasis on the vernacular..wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingShaip
SpeechEnglish - Boston/New York AccentEnglish - Boston/New York Audio Dataset225225350800This is a collection of several regional accents spoken in and around the cities of Boston, New York, and Philadelphia. These accents might sound similar to non-locals, but distinct from other American accents. Despite some local vocabulary that is different from other parts of the English-speaking world, these accents are mutually intelligible with English spoken elsewhere..wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingShaip
SpeechEnglish - Chinese AccentEnglish - Chinese Accented Audio Dataset150300450Speakers who speak Chinese as their first language and who moved/immigrated to the United States as teenagers/adults and learned English as their second language..wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingShaip
SpeechEnglish - Deep South AccentEnglish - Deep South Audio Dataset2752754501000Speakers from (i) Texas; (ii) North Carolina, South Carolina, Georgia; (iii) New Orleans; (iv) Florida panhandle; (v) Tennessee, Arkansas, Michigan..wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingShaip
SpeechEnglish - Hispanic AccentEnglish - Hispanic Accented Audio Dataset400400800Hispanic English refers to the varieties of US English spoken by Hispanic Americans of diverse national heritage. The main focus was on Mexican Americans, speakers of different national origins (e.g. Mexico, Puerto Rico, Dominican Republic, Ecuador, Cuba, etc) and from different regions (e.g. California, New York, Florida) as well. Speakers included were who speak Spanish as a first language as well as speakers of Hispanic origin who speak Spanish has a heritage language..wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingShaip
SpeechEnglish - New Zealand AccentEnglish - New Zealand Audio Dataset2507501000Speakers on both islands, including a mix of younger speakers (<40 years old) and older speakers (>40 years old) in equal proportions..wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingShaip
SpeechEnglish - Singapore AccentEnglish - Singapore Audio Dataset4006001000Both Standard Singapore English and Colloquial Singapore English. Singaporeans of different ethnic backgrounds (e.g. Chinese, Malay, Indian, etc) and of different educational levels..wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingShaip
SpeechEnglish - South Africa AccentEnglish - South Africa Audio Dataset4006001000Representatives from various socioeconomic classes and ethnological backgrounds (e.g. South Africans of European, African, Indian, or mixed background)..wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingShaip
SpeechEnglish - Irish AccentEnglish - Irish Audio Dataset500500English spoken in Ireland.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingShaip
SpeechEnglish - Scottish AccentEnglish - Scottish Audio Dataset800800English spoken by Scottish.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingShaip
SpeechEnglish - Welsh AccentEnglish - Welsh Audio Dataset800800Welsh English.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingShaip
SpeechFrench CanadianFrench Canadian Audio Dataset10001000Canadian French.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingShaip
SpeechHebrewHebrew Audio Dataset7507501500Hebrew in Israel.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingShaip
SpeechIndonesianIndonesian Audio Dataset100010002000Bahasa Indonesian.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingShaip
SpeechJapaneseJapanese Audio Dataset20002000Japanese from Japan.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingShaip
SpeechKoreanKorean Audio Dataset10020015001800Speakers spread throughout South Korea..wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingShaip
SpeechMalayMalay Audio Dataset5005001000Malay in Malaysia.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingShaip
SpeechMexican SpanishMexican Spanish Audio Dataset12501250Mexican from Mexico.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingShaip
SpeechPolishPolish Audio Dataset25020002250Polish from Poland.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingShaip
SpeechRussianRussian Audio Dataset20002000Russian from Russia.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingShaip
SpeechSwahiliSwahili Audio Dataset3506501000South African and Kenyan Swahili.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingShaip
SpeechSwedishSwedish Audio Dataset3506501000Swedish in Sweden.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingShaip
SpeechTaiwan ChineseTaiwan Chinese Audio Dataset10001000Chinese from Taiwan.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingShaip
SpeechThaiThai Audio Dataset350450800An informal register used between friends,.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingShaip
SpeechTurkishTurkish Audio Dataset20002000Turkish from Turkey.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingShaip
SpeechVietnameseVietnamese Audio Dataset6004001000Northern (e.g.,Hanoi), Central, and Southern (e.g., Ho Chi Minh City)..wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingShaip
SpeechHindiHindi Audio Dataset80020002800Hindi in India specifically in North, East and West regions.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingShaip
SpeechHinglishIndian English Audio Dataset300500800Collected from urban Indian cities that are financial hubs of the country due to growing economic opportunities. Such places can be Noida, Delhi, Dehradun, Chandigarh, Mumbai, Kolkata, Bangalore, Pune, Chennai, Hyderabad, etc.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingShaip
SpeechEnglishEnglish Audio Dataset700700.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingShaip
SpeechKannadaKannada Audio Dataset6010040200Kannada from Karnataka, India.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingShaip
SpeechMalayalamMalayalam Audio Dataset6010040200Malayalam from Kerala, Lakshadweep and Puducherry.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingShaip
SpeechOriyaOriya Audio Dataset6010040200Oriya from parts of Odisha, West Bengal, Jharkhand and Chhattisgarh.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingShaip
SpeechPunjabiPunjabi Audio Dataset6010040200Punjabi from Punjab, India.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingShaip
SpeechTamilTamil Audio Dataset60100240400Tamil from Tamil Nadu, India.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingShaip
SpeechTeluguTelugu Audio Dataset1009509502000Telugu from Andhra Pradesh, India.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingShaip
SpeechBengaliBengali Audio Dataset6010040200Bengali from West Bengal, India.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingShaip
SpeechGujaratiGujarati Audio Dataset6010040200Gujarati from Gujarat, India.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingShaip
SpeechMarathiMarathi Audio Dataset6010040200Marathi from Maharashtra, India.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingShaip
SpeechAssameseAssamese Audio Dataset6010040200Assamese from Asssam, India.wav.jsonASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language ModellingShaip

Deep expertise in Conversational AI

Conversational AI or Chatbots or Virtual / Digital Assistants are only as smart as the technology and data behind them. At Shaip, we offer you a broad set of the diversified audio dataset for Natural Language Processing (NLP) that mimic conversations with real people that lets you bring your AI to life. With our deep understanding, we help you build and localize AI-enabled speech models, with utmost precision with rich and structured datasets in multiple languages from all across the globe. We offer multi-lingual audio collection, audio transcription, and audio annotation services based on your requirement, while fully customizing desired intent, utterances, and demographic distribution.

Scripted Speech Collection

Spontaneous Speech collection

Audio Data Transcription

Data Labeling & Annotation

Shaip lets you accurately train your Conversational AI Platform so it can:

  • Seamlessly talk, text, and chat across multiple channels.
  • Learn from existing interactions in the form of chat, voice transcripts, transactions, etc. and suggest & converse, based on these learnings.
  • Understand the intent behind human speech and remove ambiguity in understanding human language.
  • Interact with you on a one-on-one basis and can be trained to identify users and remember past conversations.

A World Leader in Conversational AI Training Data

Hours of audio data in 100+ languages – Sourced, Transcribed & Annotated

Speech Data Licensing​

20k+ hours of Speech Data in 40+ languages and dialects covering a range of 55+ topics from different domains i.e., Call-center, Debates, General conversations, Speeches, podcasts, etc.

Speech Data Collection

Collect audio & speech data (monologue, 2-person conversation, human-bot chat) in over 100 languages from across the world, customized to your AI requirement.

Speech Data Transcription

Cost-effective audio transcription or audio annotation through a strong workforce of 30,000 collaborators with guaranteed TAT, accuracy, and savings

Accelerate your Conversational AI app development with Audio Collection & Audio Annotation Services

The Shaip Advantage

Scale​

We can source, scale, and deliver audio data from across the world in multiple languages and dialects based on your requirements.

Expertise

We have the right expertise concerning accurate and unbiased data collection, transcription, and gold-standard annotation.

Network

A network of 30,000+ qualified contributors, who can be quickly assigned data collection tasks to build AI training model & scale-up services.

Technology

We have a fully AI-based platform with proprietary tools & processes to leverage the workflow management 24*7 round the clock.

Agility

We adapt to changes in customer requirements very fast and help in accelerating AI development with quality speech data 5-10x faster than competition.

Security

We give utmost importance to data security and privacy and are also certified to handle highly regulated sensitive data.

What We Do Best

Training Data

Get the highest quality labeled data in a fraction of the time. It’s gold-standard, reliable and ready to train your AI and ML models to attain the highest levels of performance.

Learn More

Data Collection, Labeling & Annotation

With Shaip you get 15+ years of proven expertise in collecting, transcribing and annotating quality data. With our global labor force we can collect data from across the globe, then provide labeling and annotation services with the perfect amount of skill level and expertise required for your data.

Learn More

Data Catalogs & Licensing

With our vast inventory of millions of datasets you can collect and organize as required. We can then license that quality data for your specific AI and ML use requirements. Plus, this data is available at a fraction of the cost if you were to create it yourself.

Learn More

Want to build your own data set?

Contact us now to learn how we can collect a custom data set for your unique AI solution.