Text-to-Speech (TTS)

Text-to-Speech

Definition

Text-to-Speech (TTS) is the technology that converts written text into spoken voice output using AI models.

Purpose

The purpose is to provide natural voice output for accessibility, virtual assistants, and media applications.

Importance

  • Critical for accessibility for visually impaired users.
  • Widely used in digital assistants and IVR systems.
  • Risks synthetic voices being used for fraud.
  • Quality depends on prosody and naturalness.

How It Works

  1. Input text is processed and normalized.
  2. Text is converted into phonemes.
  3. Acoustic models generate speech features.
  4. Vocoders synthesize waveforms.
  5. Output audio is delivered to users.

Examples (Real World)

  • Google Cloud TTS: generates natural voices for apps.
  • Amazon Polly: text-to-speech service.
  • Apple Siri: voice output from text.

References / Further Reading

  • Tacotron 2: Natural TTS with Neural Networks — Google Research.
  • ISO/IEC 15938-4: Multimedia Content Description.
  • IEEE Signal Processing Magazine: TTS Systems.