CLIP
CLIP, or Contrastive Language-Image Pretraining, is a multimodal neural network architecture developed by OpenAI for learning joint representations of images and text. Introduced in 2021, it enables zero-shot image classification by aligning image and text embeddings in a shared latent space.
The model comprises two encoders: an image encoder (based on a convolutional neural network such as ResNet
Training data consists of hundreds of millions of image–caption pairs collected from the web, with no manual
Applications include zero-shot classification, image search, and multimodal retrieval; the embeddings can serve as a feature
Limitations include sensitivity to prompts and prompt engineering, biases and safety concerns inherited from the training
Variants and impact: CLIP-inspired models and related vision–language research have influenced subsequent multimodal systems and open-source
See also: Vision-language models; Contrastive learning; Multimodal embeddings.