VoxCeleb2

VoxCeleb2 is a large-scale audio-visual dataset for speaker recognition, developed as an extension of VoxCeleb1. It collects speech from public YouTube videos and provides speaker labels across a diverse set of unseen environments and recording conditions. The dataset is designed to support robust speaker verification and identification research in real-world scenarios.

VoxCeleb2 contains thousands of speakers and hundreds of thousands of utterances, drawn from a large corpus

Standard experimental splits provide training, development, and test sets. The dataset is commonly used with state-of-the-art

VoxCeleb2 was released by researchers from the Visual Geometry Group at the University of Oxford and collaborators.

a