Now Get 50% OFF* on Conversational AI Off-the-Shelf Datasets

Speech & Audio dataset for chatbots, voice assistants, speech-enabled devices.

*Limited Period Offer

First Name*
Last Name*
Email*
Phone*
Company*
Country*
Country
Volume of Data*
Untitled*
Comments*
By registering, I agree with Shaip Privacy Policy and Terms of Service and provide my consent to receive B2B marketing communication from Shaip.
CAPTCHA

Trusted by Industry Leaders

Keyword	Off-the-shelf Language Dataset	Call Center Conversations 8khz*	Generic Conversations 8khz*	Media & Podcasts 16khz*	Utterance/ Scripted Monologue 16khz*	Total Volume in Hours	Dialects covered	Audio Format	Text Transcription Format	Use Case	Source	CTA
Afrikaans	Afrikaans Audio Dataset		600	900		1500	Afrikaans spoken in Africa	.wav	.json	ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling	Shaip	Contact Contact
Arabic	Arabic Audio Dataset		800		1500	2300	Arabic from Gulf countries	.wav	.json	ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling	Shaip	Contact Contact
Chinese	Chinese Audio Dataset				2000	2000	Chinese from China	.wav	.json	ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling	Shaip	Contact Contact
Danish	Danish Audio Dataset		400	600	2000	3000	Danish from Denmark	.wav	.json	ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling	Shaip	Contact Contact
Dutch	Dutch Audio Dataset				2000	2000	Dutch from Netherland	.wav	.json	ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling	Shaip	Contact Contact
English - AAVE Accent	English - AAVE (African American Vernacular English) Audio Dataset	500		500		1000	The vernacular variety (sometimes known as AAVE, typically spoken by the vast majority of working- and middle-class African Americans) and the more standard variety (typically spoken by middle-class African Americans in formal and public situations) but with a stronger emphasis on the vernacular.	.wav	.json	ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling	Shaip	Contact Contact
English - Boston/New York Accent	English - Boston/New York Audio Dataset	225	225	350		800	This is a collection of several regional accents spoken in and around the cities of Boston, New York, and Philadelphia. These accents might sound similar to non-locals, but distinct from other American accents. Despite some local vocabulary that is different from other parts of the English-speaking world, these accents are mutually intelligible with English spoken elsewhere.	.wav	.json	ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling	Shaip	Contact Contact
English - Chinese Accent	English - Chinese Accented Audio Dataset	150		300		450	Speakers who speak Chinese as their first language and who moved/immigrated to the United States as teenagers/adults and learned English as their second language.	.wav	.json	ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling	Shaip	Contact Contact
English - Deep South Accent	English - Deep South Audio Dataset	275	275	450		1000	Speakers from (i) Texas; (ii) North Carolina, South Carolina, Georgia; (iii) New Orleans; (iv) Florida panhandle; (v) Tennessee, Arkansas, Michigan.	.wav	.json	ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling	Shaip	Contact Contact
English - Hispanic Accent	English - Hispanic Accented Audio Dataset	400		400		800	Hispanic English refers to the varieties of US English spoken by Hispanic Americans of diverse national heritage. The main focus was on Mexican Americans, speakers of different national origins (e.g. Mexico, Puerto Rico, Dominican Republic, Ecuador, Cuba, etc) and from different regions (e.g. California, New York, Florida) as well. Speakers included were who speak Spanish as a first language as well as speakers of Hispanic origin who speak Spanish has a heritage language.	.wav	.json	ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling	Shaip	Contact Contact
English - New Zealand Accent	English - New Zealand Audio Dataset		250	750		1000	Speakers on both islands, including a mix of younger speakers (<40 years old) and older speakers (>40 years old) in equal proportions.	.wav	.json	ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling	Shaip	Contact Contact
English - Singapore Accent	English - Singapore Audio Dataset	400		600		1000	Both Standard Singapore English and Colloquial Singapore English. Singaporeans of different ethnic backgrounds (e.g. Chinese, Malay, Indian, etc) and of different educational levels.	.wav	.json	ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling	Shaip	Contact Contact
English - South Africa Accent	English - South Africa Audio Dataset	400		600		1000	Representatives from various socioeconomic classes and ethnological backgrounds (e.g. South Africans of European, African, Indian, or mixed background).	.wav	.json	ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling	Shaip	Contact Contact
English - Irish Accent	English - Irish Audio Dataset		500			500	English spoken in Ireland	.wav	.json	ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling	Shaip	Contact Contact
English - Scottish Accent	English - Scottish Audio Dataset		800			800	English spoken by Scottish	.wav	.json	ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling	Shaip	Contact Contact
English - Welsh Accent	English - Welsh Audio Dataset		800			800	Welsh English	.wav	.json	ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling	Shaip	Contact Contact
French Canadian	French Canadian Audio Dataset				1000	1000	Canadian French	.wav	.json	ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling	Shaip	Contact Contact
Hebrew	Hebrew Audio Dataset		750	750		1500	Hebrew in Israel	.wav	.json	ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling	Shaip	Contact Contact
Indonesian	Indonesian Audio Dataset		1000	1000		2000	Bahasa Indonesian	.wav	.json	ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling	Shaip	Contact Contact
Japanese	Japanese Audio Dataset				2000	2000	Japanese from Japan	.wav	.json	ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling	Shaip	Contact Contact
Korean	Korean Audio Dataset	100		200	1500	1800	Speakers spread throughout South Korea.	.wav	.json	ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling	Shaip	Contact Contact
Malay	Malay Audio Dataset		500	500		1000	Malay in Malaysia	.wav	.json	ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling	Shaip	Contact Contact
Mexican Spanish	Mexican Spanish Audio Dataset				1250	1250	Mexican from Mexico	.wav	.json	ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling	Shaip	Contact Contact
Polish	Polish Audio Dataset			250	2000	2250	Polish from Poland	.wav	.json	ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling	Shaip	Contact Contact
Russian	Russian Audio Dataset				2000	2000	Russian from Russia	.wav	.json	ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling	Shaip	Contact Contact
Swahili	Swahili Audio Dataset	350		650		1000	South African and Kenyan Swahili	.wav	.json	ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling	Shaip	Contact Contact
Swedish	Swedish Audio Dataset	350		650		1000	Swedish in Sweden	.wav	.json	ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling	Shaip	Contact Contact
Taiwan Chinese	Taiwan Chinese Audio Dataset				1000	1000	Chinese from Taiwan	.wav	.json	ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling	Shaip	Contact Contact
Thai	Thai Audio Dataset		350	450		800	An informal register used between friends,	.wav	.json	ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling	Shaip	Contact Contact
Turkish	Turkish Audio Dataset				2000	2000	Turkish from Turkey	.wav	.json	ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling	Shaip	Contact Contact
Vietnamese	Vietnamese Audio Dataset		600	400		1000	Northern (e.g.,Hanoi), Central, and Southern (e.g., Ho Chi Minh City).	.wav	.json	ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling	Shaip	Contact Contact
Hindi	Hindi Audio Dataset			800	2000	2800	Hindi in India specifically in North, East and West regions	.wav	.json	ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling	Shaip	Contact Contact
Hinglish	Indian English Audio Dataset	300		500		800	Collected from urban Indian cities that are financial hubs of the country due to growing economic opportunities. Such places can be Noida, Delhi, Dehradun, Chandigarh, Mumbai, Kolkata, Bangalore, Pune, Chennai, Hyderabad, etc	.wav	.json	ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling	Shaip	Contact Contact
English	English Audio Dataset			700		700		.wav	.json	ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling	Shaip	Contact Contact
Kannada	Kannada Audio Dataset	60	100	40		200	Kannada from Karnataka, India	.wav	.json	ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling	Shaip	Contact Contact
Malayalam	Malayalam Audio Dataset	60	100	40		200	Malayalam from Kerala, Lakshadweep and Puducherry	.wav	.json	ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling	Shaip	Contact Contact
Oriya	Oriya Audio Dataset	60	100	40		200	Oriya from parts of Odisha, West Bengal, Jharkhand and Chhattisgarh	.wav	.json	ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling	Shaip	Contact Contact
Punjabi	Punjabi Audio Dataset	60	100	40		200	Punjabi from Punjab, India	.wav	.json	ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling	Shaip	Contact Contact
Tamil	Tamil Audio Dataset	60	100	240		400	Tamil from Tamil Nadu, India	.wav	.json	ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling	Shaip	Contact Contact
Telugu	Telugu Audio Dataset	100	950	950		2000	Telugu from Andhra Pradesh, India	.wav	.json	ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling	Shaip	Contact Contact
Bengali	Bengali Audio Dataset	60	100	40		200	Bengali from West Bengal, India	.wav	.json	ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling	Shaip	Contact Contact
Gujarati	Gujarati Audio Dataset	60	100	40		200	Gujarati from Gujarat, India	.wav	.json	ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling	Shaip	Contact Contact
Marathi	Marathi Audio Dataset	60	100	40		200	Marathi from Maharashtra, India	.wav	.json	ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling	Shaip	Contact Contact
Assamese	Assamese Audio Dataset	60	100	40		200	Assamese from Asssam, India	.wav	.json	ASR, Virtual Assistant, Chatbot, Conversational AI, Speech Analytics, TTS, Language Modelling	Shaip	Contact Contact

Deep expertise in Conversational AI

Conversational AI or Chatbots or Virtual / Digital Assistants are only as smart as the technology and data behind them. At Shaip, we offer you a broad set of the diversified audio dataset for Natural Language Processing (NLP) that mimic conversations with real people that lets you bring your AI to life. With our deep understanding, we help you build and localize AI-enabled speech models, with utmost precision with rich and structured datasets in multiple languages from all across the globe. We offer multi-lingual audio collection, audio transcription, and audio annotation services based on your requirement, while fully customizing desired intent, utterances, and demographic distribution.

Scripted Speech Collection

Spontaneous Speech collection

Audio Data Transcription

Data Labeling & Annotation

Shaip lets you accurately train your Conversational AI Platform so it can:

Seamlessly talk, text, and chat across multiple channels.
Learn from existing interactions in the form of chat, voice transcripts, transactions, etc. and suggest & converse, based on these learnings.
Understand the intent behind human speech and remove ambiguity in understanding human language.
Interact with you on a one-on-one basis and can be trained to identify users and remember past conversations.

A World Leader in Conversational AI Training Data

Hours of audio data in 100+ languages – Sourced, Transcribed & Annotated

Speech Data Licensing

20k+ hours of Speech Data in 40+ languages and dialects covering a range of 55+ topics from different domains i.e., Call-center, Debates, General conversations, Speeches, podcasts, etc.

Speech Data Collection

Collect audio & speech data (monologue, 2-person conversation, human-bot chat) in over 100 languages from across the world, customized to your AI requirement.

Speech Data Transcription

Cost-effective audio transcription or audio annotation through a strong workforce of 30,000 collaborators with guaranteed TAT, accuracy, and savings

Accelerate your Conversational AI app development with Audio Collection & Audio Annotation Services

The Shaip Advantage

Scale

We can source, scale, and deliver audio data from across the world in multiple languages and dialects based on your requirements.

Expertise

We have the right expertise concerning accurate and unbiased data collection, transcription, and gold-standard annotation.

Network

A network of 30,000+ qualified contributors, who can be quickly assigned data collection tasks to build AI training model & scale-up services.

Technology

We have a fully AI-based platform with proprietary tools & processes to leverage the workflow management 24*7 round the clock.

Agility

We adapt to changes in customer requirements very fast and help in accelerating AI development with quality speech data 5-10x faster than competition.

Security

We give utmost importance to data security and privacy and are also certified to handle highly regulated sensitive data.

What We Do Best

Training Data

Get the highest quality labeled data in a fraction of the time. It’s gold-standard, reliable and ready to train your AI and ML models to attain the highest levels of performance.

Learn More

Data Collection, Labeling & Annotation

With Shaip you get 15+ years of proven expertise in collecting, transcribing and annotating quality data. With our global labor force we can collect data from across the globe, then provide labeling and annotation services with the perfect amount of skill level and expertise required for your data.

Learn More

Data Catalogs & Licensing

With our vast inventory of millions of datasets you can collect and organize as required. We can then license that quality data for your specific AI and ML use requirements. Plus, this data is available at a fraction of the cost if you were to create it yourself.

Learn More

Creating clinical NLP is a critical task that requires tremendous domain expertise to solve. I can clearly see that you are several years ahead of Google in this area. I want to work with you and scale you.

Google, Inc. Director

My engineering team worked with Shaip’s team for 2+ years during the development of healthcare speech APIs. We have been impressed with their work done in healthcare-specific NLP and what they are able to achieve with complex datasets.

Google, Inc. Head of Engineering

Now Get 50% OFF* on Conversational AI Off-the-Shelf Datasets

Trusted by Industry Leaders

Deep expertise in Conversational AI

A World Leader in Conversational AI Training Data

Speech Data Licensing​

Speech Data Collection

Speech Data Transcription

The Shaip Advantage

Scale​

Expertise

Network

Technology

Agility

Security

What We Do Best

Training Data

Data Collection, Labeling & Annotation

Data Catalogs & Licensing

Want to build your own data set?

Speech Data Licensing

Scale