In a country as culturally diverse and linguistically rich as India, building inclusive AI begins with collecting representative, high-quality datasets. That’s the vision behind Project Vaani—a large-scale, open-source initiative led by ARTPARK, IISc Bengaluru, and Google, aiming to give voice to every Indian language and dialect.
The ambitious goal? To collect 150,000+ hours of speech and 15,000+ hours of transcriptions from 1 million people across 773 districts of India.
As one of the key vendors for this national mission, Shaip played a pivotal role in curating spontaneous speech data, transcription, and metadata collection—laying the groundwork for equitable voice technologies that truly represent the real India.
The Vision Behind Project Vaani
Project Vaani is designed to bridge the AI inclusion gap by creating the largest multimodal, multilingual, open-source dataset in India. This data is foundational for developing accurate speech recognition, translation, and generative AI systems in native Indian languages—many of which are underrepresented in global tech ecosystems.
The long-term vision is to power impactful applications in:
- Healthcare – Voice-based telemedicine
- Education – Vernacular learning platforms
- Governance – Conversational interfaces for citizen services
- Accessibility – Voice tools for differently-abled users
- Disaster response – Real-time communication in local dialects
Shaip’s Role in Project Vaani
Shaip was entrusted with the collection of 8,000 hours of spontaneous speech and 800 hours of manually verified transcriptions. Our responsibility spanned speaker onboarding, audio capture, metadata tagging, transcription coordination, and quality control.
8,000 hours of spontaneous audio data
Recordings from 400+ native speakers per district, representing diverse age groups, genders, and dialects
80 districts, covered
Image-based prompting to ensure natural, contextual speech
Here’s what made our approach unique:
District-Level Diversity
We sourced recordings from 80 districts spread across states like Bihar, Uttar Pradesh, Karnataka, West Bengal, and Maharashtra. Each district contributed 100 hours of audio data, ensuring regional balance. We engaged native speakers, ensuring representation of regional accents and dialects often overlooked in mainstream AI datasets.
Linguistic & Demographic Representation
We sourced recordings from 80 districts spread across states like Bihar, Uttar Pradesh, Karnataka, West Bengal, and Maharashtra. Each district contributed 100 hours of audio data, ensuring regional balance. We engaged native speakers, ensuring representation of regional accents and dialects often overlooked in mainstream AI datasets.
Image-Prompted Speech
To stimulate spontaneous and natural vocabulary, participants were shown 45–90 images per session and asked to describe them. Participants were prompted using diverse images—ranging from cultural symbols to everyday objects—to elicit natural, spontaneous responses in their native language. This ensured recordings reflected real-world, contextual speech—essential for training advanced NLP systems.
High-Quality Transcription Standards
Only 10% of speech data was transcribed—amounting to 800 hours. Transcriptions were performed by local linguists within a 20–50 km radius of the speaker, ensuring familiarity with dialects and nuances. A second-layer check ensured <5% word error rate (WER).
Strict Quality Assurance
Audio data had to meet a high bar: no background noise, echoes, phone vibrations, or distortions. Audio was recorded in quiet, echo-free environments. Files underwent rigorous review to meet guidelines for speech clarity, noise levels, metadata accuracy, and speaker verification. Metadata tagging had to be accurate across all files, and all recordings were checked for speaker and location alignment.
Challenges We Solved
- Remote logistics – Managing teams across 80 districts
- Speaker diversity – Onboarding 32,000+ verified speakers in remote locations
- Cultural sensitivity – Respecting local customs and dialects
- Data integrity – Meeting quality and compliance standards
- Quality Control – across multiple linguistic and cultural contexts
Our success came down to meticulous planning, technology-driven validation, and partnerships with local teams who understood the cultural nuances of each region.
Impact and Applications
Shaip’s contribution has not only accelerated the progress of Project Vaani but also set the foundation for inclusive AI in India. The curated speech dataset is already being used to build and fine-tune AI models for:
- Vernacular voice assistants
- Regional translation engines
- Accessible communication tools for the visually impaired
- AI-driven edtech platforms for rural students
- Rural telemedicine
- Voice-based citizen services
- Real-time translation and transcription
Conclusion
Project Vaani is a bold step toward inclusive, accessible AI—and Shaip is honored to play a foundational role. Shaip’s work on Project Vaani reaffirms our commitment to building ethical, inclusive AI systems rooted in diversity and representation. With over 8,000 hours of speech collected and 800 hours transcribed, we are proud to have played a part in one of India’s most visionary digital inclusion projects.
As Project Vaani continues toward its larger goal of 150,000+ hours of data, we stand ready to support the next frontier of AI innovation that speaks to—and for—every Indian.
Want to partner with us to build AI that understands the real world? www.shaip.com