Selfdistillation

Self-distillation is a training paradigm in machine learning in which a model is trained using its own predictions as soft supervision. In its standard form, the model minimizes a combination of a conventional supervised loss against hard labels and a distillation loss that encourages its current outputs to align with softened probability distributions produced by a teacher. When the teacher is the model itself or a previous version, this is called self-distillation.

Variants include single-model self-distillation, where predictions from the model at an earlier training stage (or with

Advantages of self-distillation include improved generalization, reduced overfitting, and sometimes gains comparable to traditional knowledge distillation

Limitations include data- and model-dependent gains, requiring careful tuning of the temperature and the relative weight

Applications span image classification, natural language processing, and other supervised learning tasks. Related concepts are knowledge

a

a

a

a

a

pseudo-labeling,