Zero-Shot Detection via Vision and Language Knowledge Distillation

In this talk, I will introduce our recent work about ViLD, a training method via Vision and Language knowledge Distillation. We distill the knowledge from a pre-trained zero-shot image classification model (e.g., CLIP) into a two-stage detector (e.g., Mask R-CNN). Our method aligns the region embeddings in the detector to the text and image embeddings inferred by the pre-trained model. We use the text embeddings as the detection classifier, obtained by feeding category names into the pre-trained text encoder. We then minimize the distance between the region embeddings and image embeddings, obtained by feeding region proposals into the pre-trained image encoder. During inference, we include text embeddings of novel categories into the detection classifier for zero-shot detection. We benchmark the performance on LVISv1.0 dataset by holding out all rare categories as novel categories. ViLD obtains 16.1 mask APr with a Mask R-CNN (ResNet-50 FPN) for zero-shot detection, outperforming the supervised counterpart by 3.8. The model can directly transfer to other datasets, achieving 72.2 AP50, 36.6 AP and 11.8 AP on PASCAL VOC, COCO and Objects365, respectively.

View slides

发言人详细信息

Yin Cui is a Research Scientist at Google. Yin’s research in learning-based computer vision focuses on label efficiency and multimodal. Before joining Google, he received a Ph.D. in Computer Science from Cornell University and Cornell Tech in 2019, advised by Professor Serge Belongie. Yin co-organized COCO Visual Recognition Workshops and Fine-Grained Visual Categorization Workshops at major computer vision conferences.

专题：: Microsoft Vision+Language Summer Talk Series
日期：: 2021年7月28日
演讲者：: Yin Cui
所属机构：: Google

- Chunyuan Li
  
  Principal Researcher
- Jianwei Yang
  
  Principal Researcher
- Pengchuan Zhang
  
  Senior Researcher
- Zhe Gan
  
  Principal Researcher
研究领域
- Artificial intelligence
研究院
- Microsoft Research Lab - Redmond
组
- Deep Learning Group