FastSpeech
FastSpeech is a neural text-to-speech (TTS) model designed to speed up speech synthesis by removing autoregressive dependencies. Introduced as a non-autoregressive approach, it generates speech in parallel rather than frame by frame, which considerably speeds up inference compared with earlier autoregressive systems.
The core architecture uses a Transformer-based encoder to convert input text, typically represented as phonemes, into
FastSpeech’s design emphasizes efficiency and robustness, delivering high-quality speech with significantly faster-than-real-time generation on standard hardware
Variants and extensions: FastSpeech 2, introduced to improve naturalness and expressiveness, extends the original approach by
Applications and impact: FastSpeech has influenced subsequent non-autoregressive TTS research and is used in real-time or