FastSpeech - Infinite Lexicon - Infinite Lexicon

FastSpeech

FastSpeech is a neural text-to-speech (TTS) model designed to speed up speech synthesis by removing autoregressive dependencies. Introduced as a non-autoregressive approach, it generates speech in parallel rather than frame by frame, which considerably speeds up inference compared with earlier autoregressive systems.

The core architecture uses a Transformer-based encoder to convert input text, typically represented as phonemes, into

FastSpeech’s design emphasizes efficiency and robustness, delivering high-quality speech with significantly faster-than-real-time generation on standard hardware

Variants and extensions: FastSpeech 2, introduced to improve naturalness and expressiveness, extends the original approach by

Applications and impact: FastSpeech has influenced subsequent non-autoregressive TTS research and is used in real-time or

a

representations.

A

a

mel-spectrogram

A

mel-spectrogram

mel-spectrograms

phoneme-to-frame

a

mel-spectrograms

non-autoregressive

a

a

spectrogram-to-waveform