Home

CLIP

CLIP, or Contrastive Language-Image Pretraining, is a multimodal neural network architecture developed by OpenAI for learning joint representations of images and text. Introduced in 2021, it enables zero-shot image classification by aligning image and text embeddings in a shared latent space.

The model comprises two encoders: an image encoder (based on a convolutional neural network such as ResNet

Training data consists of hundreds of millions of image–caption pairs collected from the web, with no manual

Applications include zero-shot classification, image search, and multimodal retrieval; the embeddings can serve as a feature

Limitations include sensitivity to prompts and prompt engineering, biases and safety concerns inherited from the training

Variants and impact: CLIP-inspired models and related vision–language research have influenced subsequent multimodal systems and open-source

See also: Vision-language models; Contrastive learning; Multimodal embeddings.

or
a
Vision
Transformer)
and
a
text
encoder
(a
Transformer).
It
is
trained
with
a
contrastive
loss
on
a
large
dataset
of
image–text
pairs,
so
that
the
embedding
of
an
image
is
close
to
the
embeddings
of
its
describing
text
and
far
from
random
text.
labeling.
The
resulting
embeddings
can
be
used
to
perform
tasks
by
computing
similarities
to
text
prompts
rather
than
fine-tuning.
extractor
for
downstream
tasks
and
as
a
component
in
multimodal
pipelines.
data,
and
reduced
performance
under
distribution
shift
or
when
class
labels
are
not
well
represented
in
text
prompts.
It
is
also
computationally
intensive
and
not
guaranteed
to
outperform
supervised
models
on
every
task.
implementations.