Home

Selfdistillation

Self-distillation is a training paradigm in machine learning in which a model is trained using its own predictions as soft supervision. In its standard form, the model minimizes a combination of a conventional supervised loss against hard labels and a distillation loss that encourages its current outputs to align with softened probability distributions produced by a teacher. When the teacher is the model itself or a previous version, this is called self-distillation.

Variants include single-model self-distillation, where predictions from the model at an earlier training stage (or with

Advantages of self-distillation include improved generalization, reduced overfitting, and sometimes gains comparable to traditional knowledge distillation

Limitations include data- and model-dependent gains, requiring careful tuning of the temperature and the relative weight

Applications span image classification, natural language processing, and other supervised learning tasks. Related concepts are knowledge

a
higher
temperature)
guide
the
current
model,
and
iterative
or
Born-Again
Networks,
where
a
sequence
of
models
is
trained
and
each
serves
as
the
teacher
for
the
next
one
of
the
same
architecture.
Temperature
scaling
is
typically
used
to
soften
the
probability
estimates,
and
the
distillation
loss
is
often
a
KL
divergence
between
teacher
and
student
distributions.
while
avoiding
a
separate
teacher
model.
It
acts
as
a
form
of
regularization
and
can
stabilize
training
on
some
tasks.
of
the
distillation
loss.
It
can
add
computational
overhead
due
to
the
need
to
generate
teacher
predictions,
especially
in
iterative
setups.
distillation,
pseudo-labeling,
and
label
smoothing.