Home

MFCCs

MFCCs, or mel-frequency cepstral coefficients, are a widely used feature representation in speech and audio processing. They capture the spectral envelope of a sound by mapping the power spectrum onto a perceptually motivated mel scale and then decorrelating the result with a discrete cosine transform.

The computation of MFCCs typically involves several steps. The audio signal is pre-emphasized and divided into

Extensions and variants include adding delta and delta-delta (temporal derivative) features to capture dynamics, and applying

MFCCs are standard inputs for automatic speech recognition, speaker identification, and many audio classification tasks. They

short
frames,
usually
20–40
milliseconds
in
length
with
overlap.
Each
frame
is
windowed,
commonly
with
a
Hamming
window,
and
the
power
spectrum
is
estimated
via
the
Fourier
transform.
The
spectrum
is
passed
through
a
bank
of
triangular
filters
spaced
on
the
mel
scale,
and
the
energies
of
these
filters
are
summed.
These
log
energies
approximate
human
loudness
perception.
A
discrete
cosine
transform
is
then
applied
to
the
log
energies
to
produce
the
cepstral
coefficients.
In
practice,
12–13
coefficients
are
commonly
retained
per
frame,
with
the
0th
coefficient
sometimes
omitted
or
replaced
by
the
frame’s
log
energy.
normalization
or
alternative
perceptual
scales.
Robustness
can
be
enhanced
with
techniques
such
as
cepstral
mean
and
variance
normalization
(CMVN).
offer
compact,
interpretable
representations
of
spectral
shape
but
can
be
sensitive
to
noise,
channel
effects,
and
frame
parameters,
and
may
be
complemented
by
other
features
in
noisy
environments.