Now Get 50% OFF* on Conversational AI Off-the-Shelf Datasets
Speech & Audio dataset for chatbots, voice assistants, speech-enabled devices.
*Limited Period Offer
Trusted by Industry Leaders
Details | Keyword | Off-the-shelf Language Dataset | Call Center Conversations 8khz* | Generic Conversations 8khz* | Media & Podcasts 16khz* | Utterance/ Scripted Monologue 16khz* | Total Volume in Hours | Dialects covered | Audio Format | Text Transcription Format | Use Case | Source | CTA |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Afrikaans | Afrikaans Audio Dataset | 600 | 900 | 1500 | Afrikaans spoken in Africa | .wav | .json | ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling | Shaip | Contact Contact | |||
Arabic | Arabic Audio Dataset | 800 | 1500 | 2300 | Arabic from Gulf countries | .wav | .json | ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling | Shaip | Contact Contact | |||
Chinese | Chinese Audio Dataset | 2000 | 2000 | Chinese from China | .wav | .json | ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling | Shaip | Contact Contact | ||||
Danish | Danish Audio Dataset | 400 | 600 | 2000 | 3000 | Danish from Denmark | .wav | .json | ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling | Shaip | Contact Contact | ||
Dutch | Dutch Audio Dataset | 2000 | 2000 | Dutch from Netherland | .wav | .json | ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling | Shaip | Contact Contact | ||||
English - AAVE Accent | English - AAVE (African American Vernacular English) Audio Dataset | 500 | 500 | 1000 | The vernacular variety (sometimes known as AAVE, typically spoken by the vast majority of working- and middle-class African Americans) and the more standard variety (typically spoken by middle-class African Americans in formal and public situations) but with a stronger emphasis on the vernacular. | .wav | .json | ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling | Shaip | Contact Contact | |||
English - Boston/New York Accent | English - Boston/New York Audio Dataset | 225 | 225 | 350 | 800 | This is a collection of several regional accents spoken in and around the cities of Boston, New York, and Philadelphia. These accents might sound similar to non-locals, but distinct from other American accents. Despite some local vocabulary that is different from other parts of the English-speaking world, these accents are mutually intelligible with English spoken elsewhere. | .wav | .json | ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling | Shaip | Contact Contact | ||
English - Chinese Accent | English - Chinese Accented Audio Dataset | 150 | 300 | 450 | Speakers who speak Chinese as their first language and who moved/immigrated to the United States as teenagers/adults and learned English as their second language. | .wav | .json | ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling | Shaip | Contact Contact | |||
English - Deep South Accent | English - Deep South Audio Dataset | 275 | 275 | 450 | 1000 | Speakers from (i) Texas; (ii) North Carolina, South Carolina, Georgia; (iii) New Orleans; (iv) Florida panhandle; (v) Tennessee, Arkansas, Michigan. | .wav | .json | ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling | Shaip | Contact Contact | ||
English - Hispanic Accent | English - Hispanic Accented Audio Dataset | 400 | 400 | 800 | Hispanic English refers to the varieties of US English spoken by Hispanic Americans of diverse national heritage. The main focus was on Mexican Americans, speakers of different national origins (e.g. Mexico, Puerto Rico, Dominican Republic, Ecuador, Cuba, etc) and from different regions (e.g. California, New York, Florida) as well. Speakers included were who speak Spanish as a first language as well as speakers of Hispanic origin who speak Spanish has a heritage language. | .wav | .json | ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling | Shaip | Contact Contact | |||
English - New Zealand Accent | English - New Zealand Audio Dataset | 250 | 750 | 1000 | Speakers on both islands, including a mix of younger speakers (<40 years old) and older speakers (>40 years old) in equal proportions. | .wav | .json | ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling | Shaip | Contact Contact | |||
English - Singapore Accent | English - Singapore Audio Dataset | 400 | 600 | 1000 | Both Standard Singapore English and Colloquial Singapore English. Singaporeans of different ethnic backgrounds (e.g. Chinese, Malay, Indian, etc) and of different educational levels. | .wav | .json | ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling | Shaip | Contact Contact | |||
English - South Africa Accent | English - South Africa Audio Dataset | 400 | 600 | 1000 | Representatives from various socioeconomic classes and ethnological backgrounds (e.g. South Africans of European, African, Indian, or mixed background). | .wav | .json | ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling | Shaip | Contact Contact | |||
English - Irish Accent | English - Irish Audio Dataset | 500 | 500 | English spoken in Ireland | .wav | .json | ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling | Shaip | Contact Contact | ||||
English - Scottish Accent | English - Scottish Audio Dataset | 800 | 800 | English spoken by Scottish | .wav | .json | ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling | Shaip | Contact Contact | ||||
English - Welsh Accent | English - Welsh Audio Dataset | 800 | 800 | Welsh English | .wav | .json | ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling | Shaip | Contact Contact | ||||
French Canadian | French Canadian Audio Dataset | 1000 | 1000 | Canadian French | .wav | .json | ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling | Shaip | Contact Contact | ||||
Hebrew | Hebrew Audio Dataset | 750 | 750 | 1500 | Hebrew in Israel | .wav | .json | ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling | Shaip | Contact Contact | |||
Indonesian | Indonesian Audio Dataset | 1000 | 1000 | 2000 | Bahasa Indonesian | .wav | .json | ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling | Shaip | Contact Contact | |||
Japanese | Japanese Audio Dataset | 2000 | 2000 | Japanese from Japan | .wav | .json | ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling | Shaip | Contact Contact | ||||
Korean | Korean Audio Dataset | 100 | 200 | 1500 | 1800 | Speakers spread throughout South Korea. | .wav | .json | ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling | Shaip | Contact Contact | ||
Malay | Malay Audio Dataset | 500 | 500 | 1000 | Malay in Malaysia | .wav | .json | ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling | Shaip | Contact Contact | |||
Mexican Spanish | Mexican Spanish Audio Dataset | 1250 | 1250 | Mexican from Mexico | .wav | .json | ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling | Shaip | Contact Contact | ||||
Polish | Polish Audio Dataset | 250 | 2000 | 2250 | Polish from Poland | .wav | .json | ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling | Shaip | Contact Contact | |||
Russian | Russian Audio Dataset | 2000 | 2000 | Russian from Russia | .wav | .json | ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling | Shaip | Contact Contact | ||||
Swahili | Swahili Audio Dataset | 350 | 650 | 1000 | South African and Kenyan Swahili | .wav | .json | ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling | Shaip | Contact Contact | |||
Swedish | Swedish Audio Dataset | 350 | 650 | 1000 | Swedish in Sweden | .wav | .json | ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling | Shaip | Contact Contact | |||
Taiwan Chinese | Taiwan Chinese Audio Dataset | 1000 | 1000 | Chinese from Taiwan | .wav | .json | ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling | Shaip | Contact Contact | ||||
Thai | Thai Audio Dataset | 350 | 450 | 800 | An informal register used between friends, | .wav | .json | ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling | Shaip | Contact Contact | |||
Turkish | Turkish Audio Dataset | 2000 | 2000 | Turkish from Turkey | .wav | .json | ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling | Shaip | Contact Contact | ||||
Vietnamese | Vietnamese Audio Dataset | 600 | 400 | 1000 | Northern (e.g.,Hanoi), Central, and Southern (e.g., Ho Chi Minh City). | .wav | .json | ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling | Shaip | Contact Contact | |||
Hindi | Hindi Audio Dataset | 800 | 2000 | 2800 | Hindi in India specifically in North, East and West regions | .wav | .json | ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling | Shaip | Contact Contact | |||
Hinglish | Indian English Audio Dataset | 300 | 500 | 800 | Collected from urban Indian cities that are financial hubs of the country due to growing economic opportunities. Such places can be Noida, Delhi, Dehradun, Chandigarh, Mumbai, Kolkata, Bangalore, Pune, Chennai, Hyderabad, etc | .wav | .json | ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling | Shaip | Contact Contact | |||
English | English Audio Dataset | 700 | 700 | .wav | .json | ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling | Shaip | Contact Contact | |||||
Kannada | Kannada Audio Dataset | 60 | 100 | 40 | 200 | Kannada from Karnataka, India | .wav | .json | ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling | Shaip | Contact Contact | ||
Malayalam | Malayalam Audio Dataset | 60 | 100 | 40 | 200 | Malayalam from Kerala, Lakshadweep and Puducherry | .wav | .json | ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling | Shaip | Contact Contact | ||
Oriya | Oriya Audio Dataset | 60 | 100 | 40 | 200 | Oriya from parts of Odisha, West Bengal, Jharkhand and Chhattisgarh | .wav | .json | ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling | Shaip | Contact Contact | ||
Punjabi | Punjabi Audio Dataset | 60 | 100 | 40 | 200 | Punjabi from Punjab, India | .wav | .json | ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling | Shaip | Contact Contact | ||
Tamil | Tamil Audio Dataset | 60 | 100 | 240 | 400 | Tamil from Tamil Nadu, India | .wav | .json | ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling | Shaip | Contact Contact | ||
Telugu | Telugu Audio Dataset | 100 | 950 | 950 | 2000 | Telugu from Andhra Pradesh, India | .wav | .json | ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling | Shaip | Contact Contact | ||
Bengali | Bengali Audio Dataset | 60 | 100 | 40 | 200 | Bengali from West Bengal, India | .wav | .json | ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling | Shaip | Contact Contact | ||
Gujarati | Gujarati Audio Dataset | 60 | 100 | 40 | 200 | Gujarati from Gujarat, India | .wav | .json | ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling | Shaip | Contact Contact | ||
Marathi | Marathi Audio Dataset | 60 | 100 | 40 | 200 | Marathi from Maharashtra, India | .wav | .json | ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling | Shaip | Contact Contact | ||
Assamese | Assamese Audio Dataset | 60 | 100 | 40 | 200 | Assamese from Asssam, India | .wav | .json | ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling | Shaip | Contact Contact |
Deep expertise in Conversational AI
Conversational AI or Chatbots or Virtual / Digital Assistants are only as smart as the technology and data behind them. At Shaip, we offer you a broad set of the diversified audio dataset for Natural Language Processing (NLP) that mimic conversations with real people that lets you bring your AI to life. With our deep understanding, we help you build and localize AI-enabled speech models, with utmost precision with rich and structured datasets in multiple languages from all across the globe. We offer multi-lingual audio collection, audio transcription, and audio annotation services based on your requirement, while fully customizing desired intent, utterances, and demographic distribution.
Scripted Speech Collection
Spontaneous Speech collection
Audio Data Transcription
Data Labeling & Annotation
Shaip lets you accurately train your Conversational AI Platform so it can:
- Seamlessly talk, text, and chat across multiple channels.
- Learn from existing interactions in the form of chat, voice transcripts, transactions, etc. and suggest & converse, based on these learnings.
- Understand the intent behind human speech and remove ambiguity in understanding human language.
- Interact with you on a one-on-one basis and can be trained to identify users and remember past conversations.
A World Leader in Conversational AI Training Data
Hours of audio data in 100+ languages – Sourced, Transcribed & Annotated
Speech Data Licensing​
20k+ hours of Speech Data in 40+ languages and dialects covering a range of 55+ topics from different domains i.e., Call-center, Debates, General conversations, Speeches, podcasts, etc.
Speech Data Collection
Collect audio & speech data (monologue, 2-person conversation, human-bot chat) in over 100 languages from across the world, customized to your AI requirement.
Speech Data Transcription
Cost-effective audio transcription or audio annotation through a strong workforce of 30,000 collaborators with guaranteed TAT, accuracy, and savings
Accelerate your Conversational AI app development with Audio Collection & Audio Annotation Services
The Shaip Advantage
Scale​
We can source, scale, and deliver audio data from across the world in multiple languages and dialects based on your requirements.
Expertise
We have the right expertise concerning accurate and unbiased data collection, transcription, and gold-standard annotation.
Network
A network of 30,000+ qualified contributors, who can be quickly assigned data collection tasks to build AI training model & scale-up services.
Technology
We have a fully AI-based platform with proprietary tools & processes to leverage the workflow management 24*7 round the clock.
Agility
We adapt to changes in customer requirements very fast and help in accelerating AI development with quality speech data 5-10x faster than competition.
Security
We give utmost importance to data security and privacy and are also certified to handle highly regulated sensitive data.
What We Do Best
Training Data
Get the highest quality labeled data in a fraction of the time. It’s gold-standard, reliable and ready to train your AI and ML models to attain the highest levels of performance.
Data Collection, Labeling & Annotation
With Shaip you get 15+ years of proven expertise in collecting, transcribing and annotating quality data. With our global labor force we can collect data from across the globe, then provide labeling and annotation services with the perfect amount of skill level and expertise required for your data.
Data Catalogs & Licensing
With our vast inventory of millions of datasets you can collect and organize as required. We can then license that quality data for your specific AI and ML use requirements. Plus, this data is available at a fraction of the cost if you were to create it yourself.
Want to build your own data set?
Contact us now to learn how we can collect a custom data set for your unique AI solution.