Contrastive Language–Image Pre-training (CLIP)
Contrastive Language–Image Pre-training (CLIP) is SOTA model published by openAI
The innovation of the model is contrastive training approach, where positive (image, text pair) and negative (other images, and text) samples are employed to learn a scoring function in-order to obtain a representation of the data. In the model consist of the encoder for images and text, in the contrastive training, images and text pairs were label, these where then label with various “objects” to learning abstracts in the data, to training a zero shot classifier on image and text embedding as shown below.
Practical Applications: CLIP models can be widely used for example, Image caption, Image search and ranking, content moderator, object tracking.
Basic Implementation, consist of the projection, which can be employed to obtain image, text embeddings, with contrastive pretraining using cross entropy loss function as show below.
In the paper the authors employed res and vision transformers for image encoders and transformers for text embeddings.
In additional to model specifics, data labelling was an important factor training of these models, and pretrain models can be employed for example via hugginface interface.