Definition
Speech-to-text (STT) is the process of converting spoken language into written text automatically using AI models. It is closely related to ASR.
Purpose
The purpose is to make spoken content accessible and searchable. It is widely used in transcription, accessibility, and digital assistants.
Importance
- Supports accessibility for hearing-impaired users.
- Provides transcripts for meetings and lectures.
- Accuracy depends on accents and noise conditions.
- Used in nearly all voice-driven applications.
How It Works
- Capture audio input.
- Preprocess and normalize audio signal.
- Apply ASR models to recognize words.
- Output text transcription.
- Review or correct with human oversight if needed.
Examples (Real World)
- Google Cloud Speech-to-Text API.
- Microsoft Azure Speech Services.
- Otter.ai meeting transcription.
References / Further Reading
- Automatic Speech Recognition — NIST.
- ISO/IEC 15938-4: Multimedia Content Description.
- Jurafsky & Martin. Speech and Language Processing.