Filip Radenovic presents Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training

On 2023-01-12 13:00:00 at G205, Karlovo náměstí 13, Praha 2

Vision-language models trained with contrastive learning on large-scale noisy
data are becoming increasingly popular for zero-shot recognition problems. In
this paper we improve the following three aspects of the contrastive
pre-training pipeline: dataset noise, model initialization and the training
objective. First, we propose a straightforward filtering strategy titled
Complexity, Action, and Text-spotting (CAT) that significantly reduces dataset
size, while achieving improved performance across zero-shot vision-language
tasks. Next, we propose an approach titled Concept Distillation to leverage
strong unimodal representations for contrastive training that does not increase
training complexity while outperforming prior work. Finally, we modify the
traditional contrastive alignment objective, and propose an importance-sampling
approach to up-sample the importance of hard-negatives without adding
additional
complexity. On an extensive zero-shot benchmark of 28 tasks, our Distilled and
Hard-negative Training (DiHT) approach improves on 19 tasks compared to the
baseline. Furthermore, for few-shot linear probing, we propose a novel approach
that bridges the gap between zero-shot and few-shot performance, substantially
improving over prior work.

Paper link: https://arxiv.org/abs/2301.02280