Viet-Anh on Software Logo

What is: Contrastive Language-Image Pre-training?

SourceLearning Transferable Visual Models From Natural Language Supervision
Data SourceCC BY-SA -

Contrastive Language-Image Pre-training (CLIP), consisting of a simplified version of ConVIRT trained from scratch, is an efficient method of image representation learning from natural language supervision. , CLIP jointly trains an image encoder and a text encoder to predict the correct pairings of a batch of (image, text) training examples. At test time the learned text encoder synthesizes a zero-shot linear classifier by embedding the names or descriptions of the target dataset’s classes.

For pre-training, CLIP is trained to predict which of the NXNN X N possible (image, text) pairings across a batch actually occurred. CLIP learns a multi-modal embedding space by jointly training an image encoder and text encoder to maximize the cosine similarity of the image and text embeddings of the NN real pairs in the batch while minimizing the cosine similarity of the embeddings of the N2NN^2 - N incorrect pairings. A symmetric cross entropy loss is optimized over these similarity scores.

Image credit: Learning Transferable Visual Models From Natural Language Supervision