Project Vaani

Building Inclusive AI for India: Shaip’s Role in Project Vaani

In a country as culturally diverse and linguistically rich as India, building inclusive AI begins with collecting representative, high-quality datasets. That’s the vision behind Project Vaani—a large-scale, open-source initiative led by ARTPARK, IISc Bengaluru, and Google, aiming to give voice to every Indian language and dialect.

The ambitious goal? To collect 150,000+ hours of speech and 15,000+ hours of transcriptions from 1 million people across 773 districts of India.

As one of the key vendors for this national mission, Shaip played a pivotal role in curating spontaneous speech data, transcription, and metadata collection—laying the groundwork for equitable voice technologies that truly represent the real India.

The Vision Behind Project Vaani

Project Vaani is designed to bridge the AI inclusion gap by creating the largest multimodal, multilingual, open-source dataset in India. This data is foundational for developing accurate speech recognition, translation, and generative AI systems in native Indian languages—many of which are underrepresented in global tech ecosystems.

The long-term vision is to power impactful applications in:

Shaip’s Role in Project Vaani

Shaip was entrusted with the collection of 8,000 hours of spontaneous speech and 800 hours of manually verified transcriptions. Our responsibility spanned speaker onboarding, audio capture, metadata tagging, transcription coordination, and quality control.

8,000 hours of spontaneous audio data

800 hours of high-quality manual transcriptions

Recordings from 400+ native speakers per district, representing diverse age groups, genders, and dialects

80 districts, covered

Image-based prompting to ensure natural, contextual speech

Here’s what made our approach unique:

District-level diversity

District-Level Diversity

We sourced recordings from 80 districts spread across states like Bihar, Uttar Pradesh, Karnataka, West Bengal, and Maharashtra. Each district contributed 100 hours of audio data, ensuring regional balance. We engaged native speakers, ensuring representation of regional accents and dialects often overlooked in mainstream AI datasets.

Linguistic & demographic representation

Linguistic & Demographic Representation

We sourced recordings from 80 districts spread across states like Bihar, Uttar Pradesh, Karnataka, West Bengal, and Maharashtra. Each district contributed 100 hours of audio data, ensuring regional balance. We engaged native speakers, ensuring representation of regional accents and dialects often overlooked in mainstream AI datasets.

Image-Prompted Speech

To stimulate spontaneous and natural vocabulary, participants were shown 45–90 images per session and asked to describe them. Participants were prompted using diverse images—ranging from cultural symbols to everyday objects—to elicit natural, spontaneous responses in their native language. This ensured recordings reflected real-world, contextual speech—essential for training advanced NLP systems.

High-quality transcription standards

High-Quality Transcription Standards

Only 10% of speech data was transcribed—amounting to 800 hours. Transcriptions were performed by local linguists within a 20–50 km radius of the speaker, ensuring familiarity with dialects and nuances. A second-layer check ensured <5% word error rate (WER).

Strict Quality Assurance

Audio data had to meet a high bar: no background noise, echoes, phone vibrations, or distortions. Audio was recorded in quiet, echo-free environments. Files underwent rigorous review to meet guidelines for speech clarity, noise levels, metadata accuracy, and speaker verification. Metadata tagging had to be accurate across all files, and all recordings were checked for speaker and location alignment.

Challenges We Solved

Our success came down to meticulous planning, technology-driven validation, and partnerships with local teams who understood the cultural nuances of each region.

Impact and Applications

Shaip’s contribution has not only accelerated the progress of Project Vaani but also set the foundation for inclusive AI in India. The curated speech dataset is already being used to build and fine-tune AI models for:

  • Vernacular voice assistants
  • Regional translation engines
  • Accessible communication tools for the visually impaired
  • AI-driven edtech platforms for rural students
  • Rural telemedicine
  • Voice-based citizen services
  • Real-time translation and transcription

Conclusion

Project Vaani is a bold step toward inclusive, accessible AI—and Shaip is honored to play a foundational role. Shaip’s work on Project Vaani reaffirms our commitment to building ethical, inclusive AI systems rooted in diversity and representation. With over 8,000 hours of speech collected and 800 hours transcribed, we are proud to have played a part in one of India’s most visionary digital inclusion projects.

As Project Vaani continues toward its larger goal of 150,000+ hours of data, we stand ready to support the next frontier of AI innovation that speaks to—and for—every Indian.

Want to partner with us to build AI that understands the real world? www.shaip.com

Social Share