Home

VoxCeleb2

VoxCeleb2 is a large-scale audio-visual dataset for speaker recognition, developed as an extension of VoxCeleb1. It collects speech from public YouTube videos and provides speaker labels across a diverse set of unseen environments and recording conditions. The dataset is designed to support robust speaker verification and identification research in real-world scenarios.

VoxCeleb2 contains thousands of speakers and hundreds of thousands of utterances, drawn from a large corpus

Standard experimental splits provide training, development, and test sets. The dataset is commonly used with state-of-the-art

VoxCeleb2 was released by researchers from the Visual Geometry Group at the University of Oxford and collaborators.

of
YouTube
clips.
The
clips
vary
in
language,
accent,
background
noise,
channel
quality,
and
recording
device.
Each
clip
is
labeled
with
the
corresponding
speaker
identity,
and
many
clips
include
video
frames
that
can
be
used
for
auxiliary
tasks
such
as
face
verification
or
lip
synchrony,
although
the
primary
focus
is
on
the
auditory
signal.
neural
features
such
as
x-vectors
and
deep
speaker
embeddings,
evaluated
using
verification
metrics
like
equal
error
rate
and
identification
accuracy.
Baseline
results
and
open-source
toolkits
are
widely
cited
in
literature,
making
VoxCeleb2
a
common
benchmark
for
cross-domain
and
cross-language
speaker
recognition.
It
is
distributed
for
research
use
under
terms
that
require
citation
and
adherence
to
its
licensing
conditions.
The
dataset
has
been
adopted
by
numerous
studies
in
machine
learning
and
speech
processing
to
assess
generalization
to
real-world
data
and
cross-language
variation.