WaveGlow
WaveGlow is a flow-based generative model for speech synthesis developed by NVIDIA. Introduced in 2018, it functions as a neural vocoder that converts mel-spectrograms into time-domain audio waveforms. The model combines ideas from Glow, a flow-based framework using invertible 1x1 convolutions and affine coupling layers, with concepts from WaveNet to model complex audio distributions. WaveGlow enables high-quality speech synthesis without autoregressive sampling during generation.
Technically, WaveGlow uses a sequence of invertible 1x1 convolutions to permute data channels and a stack of
In a text-to-speech pipeline, WaveGlow serves as the vocoder component that converts predicted mel-spectrograms into waveform.
Limitations include substantial computational and memory requirements for training, as well as potential artifacts or noise