Zero-Shot learning: Contrastive Language-Image Pretraining (CLIP)

Larege Labeled data is really hard to get. One way is to decrease our models’ reliance on labeled data. This is the motivation behind zero-shot learning, in which your model learns how to classify classes that it hasn’t seen before. 【大规模标签数据很难获取。一种方法是减少模型对标记数据的依赖。这就是零样本学习背后的动机，在零样本学习中，您的模型学习如何对以前从未见过的类进行分类。】

比如说在图片分类里，尽管训练数据中没有标签为Panda的样本，分类器在看到熊猫的img时仍然可以把它分类为Panda。

输入的样本不是带有标签的图像，而是带有辅助描述性的文本。

比如：

Untitled

和人类似， by seeing 400 million image-text pairings of different objects, the model is able to understand how certain phrases and words correspond to certain patterns in the images. Once it has this understanding, the model can use this accumulated knowledge to extrapolate to other classification tasks.

注意，辅助的信息确实是一种监督信息，但是不是label！Through this auxiliary information, we are able to use information-rich unstructured data instead of having to parse the unstructured data ourselves to handcraft a single label；人工设计标签需要时间，并且会删除潜在的有用信息。通过使用CLIP的方法，我们可以绕过这个瓶颈，最大限度地提高模型可以访问的信息量。

对比学习：constrastive learning 【TBD】用来学习img和text对之间的关系。

总体来说：CLIP的目标是最小化img和它对应的text的encoding之间的差距。也就是使得图片的encdoing和对应的text的encoding越接近越好。

CLIP的完整流程：

Untitled