VITS - Infinite Lexicon - Infinite Lexicon

VITS

VITS, short for Variational Inference TTS, is a neural text-to-speech framework designed to generate natural-sounding speech in an end-to-end fashion. It aims to simplify traditional TTS pipelines by learning a direct mapping from text to waveform or to high-quality intermediate representations within a single model, using variational inference and adversarial training to improve realism and stability.

The core idea of VITS is to model the acoustic front end with a variational latent representation,

VITS supports multi-speaker synthesis and can be conditioned on speaker identity or embedding to control voice

Limitations include the need for substantial training data and computational resources, as well as potential sensitivity

See also: text-to-speech, variational autoencoder, normalizing flow, neural vocoder.

a

a

a

a

a

A

characteristics.

a

a

implementations

language-specific