VITS
VITS, short for Variational Inference TTS, is a neural text-to-speech framework designed to generate natural-sounding speech in an end-to-end fashion. It aims to simplify traditional TTS pipelines by learning a direct mapping from text to waveform or to high-quality intermediate representations within a single model, using variational inference and adversarial training to improve realism and stability.
The core idea of VITS is to model the acoustic front end with a variational latent representation,
VITS supports multi-speaker synthesis and can be conditioned on speaker identity or embedding to control voice
Limitations include the need for substantial training data and computational resources, as well as potential sensitivity
See also: text-to-speech, variational autoencoder, normalizing flow, neural vocoder.