CLIP - Infinite Lexicon - Infinite Lexicon

CLIP

CLIP, or Contrastive Language-Image Pretraining, is a multimodal neural network architecture developed by OpenAI for learning joint representations of images and text. Introduced in 2021, it enables zero-shot image classification by aligning image and text embeddings in a shared latent space.

The model comprises two encoders: an image encoder (based on a convolutional neural network such as ResNet

Training data consists of hundreds of millions of image–caption pairs collected from the web, with no manual

Applications include zero-shot classification, image search, and multimodal retrieval; the embeddings can serve as a feature

Limitations include sensitivity to prompts and prompt engineering, biases and safety concerns inherited from the training

Variants and impact: CLIP-inspired models and related vision–language research have influenced subsequent multimodal systems and open-source

See also: Vision-language models; Contrastive learning; Multimodal embeddings.

a

a

a

a

a

computationally

implementations.