Home

pseudolabeling

Pseudolabeling is a semi-supervised learning technique in which a classifier trained on a labeled dataset is used to assign labels to unlabeled data. The newly labeled examples, called pseudolabels, are then added to the training data to retrain the model. Pseudolabeling is a form of self-training and is widely used when labeled data are scarce but unlabeled data are plentiful.

Typical procedure: train initial model on labeled set; apply it to the unlabeled set; select predictions with

Assumptions and considerations: the unlabeled data should come from the same distribution as the labeled data;

Challenges and limitations: pseudolabels can be incorrect and propagate errors, a phenomenon known as confirmation bias

Applications: used in computer vision, natural language processing, and speech recognition, particularly for image classification or

high
confidence
(or
high
probability)
or
probability
distribution
thresholds;
treat
those
predictions
as
true
labels;
augment
labeled
dataset
with
these
pseudolabeled
examples;
retrain
the
model;
repeat.
Some
approaches
use
soft
labels
(the
predicted
probability
distributions)
rather
than
hard
class
assignments,
or
use
confidence
calibration
to
weight
examples.
the
model's
confident
predictions
are
more
likely
to
be
correct.
Benefits
include
leveraging
large
unlabeled
corpora
or
datasets
to
improve
accuracy
with
limited
supervision.
or
label
noise
accumulation.
Sensitive
to
class
imbalance
and
domain
mismatch.
Requires
careful
thresholding,
monitoring
on
a
validation
set,
and
sometimes
combining
with
other
semi-supervised
methods
such
as
consistency
regularization
or
co-training.
sequence
labeling
tasks.
The
method
has
variants
and
theoretical
analyses
focusing
on
conditions
under
which
it
improves
performance.