Microsoft at CVPR 2023: Pushing the boundaries of computer vision

Publié juin 20, 2023

Par Baining Guo , Distinguished Scientist Steve Lin , Senior Principal Research Manager

Partagez cette page

In the vast realm of artificial intelligence, few fields have captivated our imagination and pushed the boundaries of possibility quite like computer vision. At the core of this domain of research and innovation lies the ambition to empower technologies for real-world vision-based systems, enabling machines to take in and respond to visual stimuli with unparalleled precision and sophistication. Through the combination of AI, deep learning, and vast amounts of data, computer vision has made great strides in recent years, catapulting us into an era in which the seemingly impossible becomes achievable.

The Computer Vision and Pattern Recognition (opens in new tab) (CVPR) 2023, held June 10 through June 22, is a widely recognized event that brings together leading experts in the field of computer vision. It serves as a platform for showcasing some of the most compelling and innovative work in this domain.

The contributions presented by Microsoft researchers and their collaborators at this year’s CVPR cover a wide spectrum of research endeavors. From generative models and network pretraining to sign language understanding and neural video codecs, these cutting-edge advancements underscore the evolving capabilities of systems to analyze and extract valuable insights from visual data.

Here are some of the highlights (see below for a list of published papers and their authors):

The paper, “Image as a Foreign Language: BEiT Pretraining for Vision and Vision-Language Tasks” lies at the intersection of vision, language, and multimodal pretraining. To learn from these different forms of data, we present a general-purpose foundational model that treats images as a «foreign language.» The data from different modalities are encoded with Multiway Transformers, a modular architecture that enables modality-specific encoding and deep fusion. The model is pretrained on images, text, and image-text pairs in a way that generalizes the masked language modeling approach to different modalities. By substantially scaling the model and data, we found that these advances in foundational architecture and pretraining lead to excellent transfer performance over a variety of vision and vision-language tasks, including object detection, semantic segmentation, image classification, visual reasoning, visual question answering, image captioning, and cross-modal image retrieval.

Scaling training data for large vision models

The strength of large language models stems from their ability to leverage unlabeled training data on a massive scale. By using this data, these models acquire a broad understanding of language, enhance their generalization abilities, and improve their performance across a wide range of language-related tasks. Inspired by this achievement, our research focuses on the possibilities of scaling training data for large vision models. In the paper “On Data Scaling in Masked Image Modeling,” we explore the effects of data scaling on large vision models that are pretrained through masked image modeling. Through extensive investigation, we discovered that masked image modeling in large vision models requires large-scale data for effective pretraining. However, unlike large language models, large vision models cannot benefit from more data in a non-overfitting scenario. These findings deepen our understanding of masked image modeling and may pave the way for future advancements in large-scale vision models.

Creating 3D avatars with a diffusion network

In the world of image generation, incredible strides have been made in transforming text descriptions into stunning visuals. The rise of DALL-E and diffusion models has brought these cutting-edge tools into the hands of everyday users. In the paper “RODIN: A Generative Model for Sculpting 3D Digital Avatars Using Diffusion,” we expand on this innovation by introducing the power of diffusion to 3D avatar generation. To do this, it is necessary to transfer diffusion from 2D to 3D. However, transferring diffusion from 2D to 3D is a significant challenge due to the prohibitive memory and processing costs for producing high-quality results with rich details in 3D. We overcome this problem by proposing the roll-out diffusion network (RODIN), which unrolls a 3D neural radiance field into a single 2D feature plane and performs 3D-aware diffusion on it. Supported by other technical contributions, including latent conditioning to promote global coherence and hierarchical synthesis to further enhance details, RODIN significantly accelerates the otherwise tedious 3D modeling process and opens new opportunities for 3D artists.

Microsoft papers published at CVPR 2023 with their authors:

3D Human Mesh Estimation from Virtual Markers
Xiaoxuan Ma, Peking University; Jiajun Su, Peking University; Chunyu Wang, Microsoft Research; Wentao Zhu, Peking University; Yizhou Wang, Peking University and National Engineering Research Center of Visual Technology
3D Line Mapping Revisited
Shaohui Liu, ETH Zurich; Yifan Yu, ETH Zurich; Rémi Pautrat, ETH Zurich; Marc Pollefeys, ETH Zurich and Microsoft Research; Viktor Larsson, Lund University
BlendFields: Few-Shot Example-Driven Facial Modeling
Kacper Kania, Warsaw University of Technology; Stephan J. Garbin, Microsoft Research; Andrea Tagliasacchi, Simon Fraser and University and Google Brain; Virginia Estellers, Microsoft Research; Kwang Moo Yi, University of British Columbia; Julien Valentin, Microsoft Research; Tomasz Trzciński, Jagiellonian University; Marek Kowalski, Microsoft Research
CiCo: Domain-Aware Sign Language Retrieval via Cross-Lingual Contrastive Learning
Yiting Cheng, Fudan University; Fangyun Wei, Microsoft Research; Jianmin Bao, Microsoft Research; Dong Chen, Microsoft Research; Wenqiang Zhang, Fudan University
Deep Frequency Filtering for Domain Generalization
Shiqi Lin, University of Science and Technology of China; Zhizheng Zhang, Microsoft Research; Zhipeng Huang, University of Science and Technology of China; Yan Lu, Microsoft Research; Cuiling Lan, Microsoft Research; Peng Chu, Microsoft; Quanzeng You, Microsoft; Jiang Wang, Microsoft; Zicheng Liu, Microsoft Research; Amey Parulkar, Microsoft; Viraj Navkal, Microsoft; Zhibo Chen, University of Science and Technology of China
DeepLSD: Line Segment Detection and Refinement with Deep Image Gradients
Rémi Pautrat, ETH Zurich; Daniel Barath, ETH Zurich; Viktor Larsson, Lund University; Martin R. Oswald, University of Amsterdam; Marc Pollefeys, ETH Zurich and Microsoft Research
DETRs with Hybrid Matching
Ding Jia, Peking University; Yuhui Yuan, Microsoft Research; Haodi He, Stanford University; Xiaopei Wu, Zhejiang University; Haojun Yu, Peking University; Weihong Lin, Microsoft Research; Lei Sun, Microsoft Research; Chao Zhang, Peking University; Han Hu, Microsoft Research
EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention
Xinyu Liu, Chinese University of Hong Kong; Houwen Peng, Microsoft Research; Ningxin Zheng, Microsoft Research; Yuqing Yang, Microsoft Research; Han Hu, Microsoft Research; Yixuan Yuan, Chinese University of Hong Kong
Four-View Geometry with Unknown Radial Distortion
Petr Hruby, Viktor Korotynskiy, Timothy Duff, Luke Oeding, Marc Pollefeys, ETH Zurich and Microsoft Research; Tomas Pajdla, Viktor Larsson, Lund University
High-Fidelity and Freely Controllable Talking Head Video Generation
Yue Gao, Microsoft Research; Yuan Zhou, Microsoft Research; Jinglu Wang, Microsoft Research; Xiao Li, Microsoft Research; Xiang Ming, Microsoft Research; Yan Lu, Microsoft Research
Human Pose as Compositional Tokens
Zigang Geng, University of Science and Technology of China and Microsoft Research; Chunyu Wang, Microsoft Research; Yixuan Wei, Tsinghua University and Microsoft Research; Ze Liu, University of Science and Technology of China and Microsoft Research; Houqiang Li, University of Science and Technology of China; Han Hu, Microsoft Research
iCLIP: Bridging Image Classification and Contrastive Language-Image Pre-training for Visual Recognition
Yixuan Wei, Tsinghua University and Microsoft Research; Yue Cao, Microsoft Research; Zheng Zhang, Microsoft Research; Houwen Peng, Microsoft Research; Zhuliang Yao, Tsinghua University and Microsoft Research; Zhenda Xie, Tsinghua University and Microsoft Research; Han Hu, Microsoft Research; Baining Guo, Microsoft Research
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks
Wenhui Wang, Microsoft; Hangbo Bao, Microsoft; Li Dong, Microsoft Research; Johan Bjorck, Microsoft; Zhiliang Peng, Microsoft; Qiang Liu, Microsoft; Kriti Aggarwal, Microsoft Research; Owais Khan Mohammed, Microsoft; Saksham Singhal, Microsoft Research; Subhojit Som, Microsoft; Furu Wei, Microsoft Research
Iterative Proposal Refinement for Weakly-Supervised Video Grounding
Meng Cao, Peking University; Fangyun Wei, Microsoft Research; Can Xu, Microsoft Research; Xiubo Geng, Microsoft Research; Long Chen, Hong Kong University of Science and Technology; Can Zhang, Peking University; Yuexian Zou, Peking University; Tao Shen, Microsoft; Daxin Jiang, Microsoft Research
LayoutFormer++: Conditional Graphic Layout Generation via Constraint Serialization and Decoding Space Restriction
Zhaoyun Jiang, Xi’an Jiaotong University; Jiaqi Guo, Microsoft Research; Shizhao Sun, Microsoft Research; Huayu Deng, Shanghai Jiaotong University; Zhongkai Wu, Beihang University; Vuksan Mijovic, Microsoft; Zijiang James Yang, Xi’an Jiaotong University; Jian-Guang Lou, Microsoft Research; Dongmei Zhang, Microsoft Research
Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing
Shruthi Bannur, Microsoft Research; Stephanie Hyland, Microsoft Research; Qianchu Liu, Fernando Pérez García, Microsoft Research; Maximilian Ilse, Microsoft Research; Daniel C. Castro, Microsoft Research; Benedikt Boecking, Harshita Sharma, Microsoft Research; Kenza Bouzid, Microsoft Research; Anja Thieme, Microsoft Research; Anton Schwaighofer, Microsoft Research; Maria Wetscherek, Matthew P. Lungren, Aditya Nori, Microsoft Research; Javier Alvarez-Valle, Microsoft Research; Ozan Oktay Microsoft Research
Look Before You Match: Instance Understanding Matters in Video Object Segmentation
Junke Wang, Shanghai Collaborative Innovation Center on Intelligent Visual Computing; Dongdong Chen, Microsoft Research; Zuxuan Wu, Shanghai Collaborative Innovation Center on Intelligent Visual Computing; Chong Luo, Microsoft Research; Chuanxin Tang, Microsoft Research; Xiyang Dai, Microsoft Research; Yucheng Zhao, Microsoft Research; Yujia Xie, Microsoft Research; Lu Yuan, Microsoft Research; Yu-Gang Jiang, Shanghai Collaborative Innovation Center on Intelligent Visual Computing
MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining
Xiaoyi Dong, University of Science and Technology of China; Jianmin Bao, Microsoft Research; Yinglin Zheng, Xiamen University; Ting Zhang, Microsoft Research; Dongdong Chen, Microsoft Research; Hao Yang, Microsoft Research; Ming Zeng, Xiamen University; Weiming Zhang, University of Science and Technology of China; Lu Yuan, Microsoft Research; Dong Chen, Microsoft Research; Fang Wen, Microsoft Research; Nenghai Yu, University of Science and Technology of China
MetaPortrait: Identity-Preserving Talking Head Generation with Fast Personalized Adaptation
Bowen Zhang, USTC; Chenyang Qi, HKUST; Pan Zhang, USTC; Bo Zhang, Microsoft Research; HsiangTao Wu, Microsoft; Dong Chen, HKUST; Qifeng Chen, HKUST; Yong Wang, USTC; Fang Wen, Microsoft
MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation
Ludan Ruan, Renmin University of China; Yiyang Ma, Peking University; Huan Yang, Microsoft Research; Huiguo He, Microsoft Research; Bei Liu, Microsoft Research; Jianlong Fu, Microsoft Research; Nicholas Jing Yuan, Microsoft Research; Qin Jin, Renmin University of China; Baining Guo, Microsoft Research
Motion Information Propagation for Neural Video Compression
Linfeng Qi, University of Science and Technology of China; Jiahao Li, Microsoft Research; Bin Li, Microsoft Research; Houqiang Li, University of Science and Technology of China; Yan Lu, Microsoft Research
Natural Language-Assisted Sign Language Recognition
Ronglai Zuo, Hong Kong University of Science and Technology; Fangyun Wei, Microsoft Research; Brian Mak, Hong Kong University of Science and Technology
Neural Video Compression with Diverse Contexts
Jiahao Li, Microsoft Research; Bin Li, Microsoft Research; Yan Lu, Microsoft Research
On Data Scaling in Masked Image Modeling
Zhenda Xie, Tsinghua University and Microsoft Research; Zheng Zhang, Microsoft Research; Yue Cao, Microsoft Research; Yutong Lin, Xi’an Jiaotong University and Microsoft Research; Yixuan Wei, Tsinghua University and Microsoft Research; Qi Dai, Microsoft Research; Han Hu, Microsoft Research
Paint by Example: Exemplar-based Image Editing with Diffusion Models
Binxin Yang, University of Science and Technology of China; Shuyang Gu, Microsoft Research; Bo Zhang, Microsoft Research; Ting Zhang, Microsoft Research; Xuejin Chen, University of Science and Technology of China; Xiaoyan Sun, University of Science and Technology of China; Dong Chen, Microsoft Research; Fang Wen, Microsoft Research
ReCo: Region-Controlled Text-to-Image Generation
Zhengyuan Yang, Microsoft Research; Jianfeng Wang, Microsoft; Zhe Gan, Microsoft; Linjie Li, Microsoft Research; Kevin Lin, Microsoft Research; Chenfei Wu, Microsoft Research; Nan Duan, Microsoft; Zicheng Liu, Microsoft Research; Ce Liu, Microsoft; Michael Zeng, Microsoft Research; Lijuan Wang, Microsoft Research
ResFormer: Scaling ViTs with Multi-Resolution Training
Rui Tian, Fudan University and Shanghai Collaborative Innovation Center of Intelligent Visual Computing; Zuxuan Wu, Fudan University and Shanghai Collaborative Innovation Center of Intelligent Visual Computing; Qi Dai, Microsoft Research; Han Hu, Microsoft Research; Yu Qiao,Shanghai AI Laboratory; Yu-Gang Jiang, Fudan University and Shanghai Collaborative Innovation Center of Intelligent Visual Computing
Revealing the Dark Secrets of Masked Image Modeling
Zhenda Xie, Tsinghua University and Microsoft Research; Zigang Geng, University of Science and Technology of China and Microsoft Research; Jingcheng Hu, Tsinghua University and Microsoft Research; Zheng Zhang, Microsoft Research; Han Hu, Microsoft Research; Yue Cao, Microsoft Research
RODIN: A Generative Model for Sculpting 3D Digital Avatars Using Diffusion
Tengfei Wang, Hong Kong University of Science and Technology; Bo Zhang, Microsoft Research; Ting Zhang, Microsoft Research; Shuyang Gu, Microsoft Research; Jianmin Bao, Microsoft Research; Tadas Baltrusaitis, Microsoft Research; Jingjing Shen, Microsoft Research; Dong Chen, Microsoft Research; Fang Wen, Microsoft Research; Qifeng Chen, Hong Kong University of Science and Technology; Baining Guo, Microsoft Research
SeqTrack: Sequence to Sequence Learning for Visual Object Tracking
Xin Chen, Dalian University of Technology; Houwen Peng, Microsoft Research; Dong Wang, Dalian University of Technology; Huchuan Lu, Dalian University of Technology and Peng Cheng Laboratory; Han Hu, Microsoft Research
Side Adapter Network for Open-Vocabulary Semantic Segmentation
Mengde Xu, Huazhong University of Science and Technology and Microsoft Research; Zheng Zhang, Huazhong University of Science and Technology and Microsoft Research; Fangyun Wei, Microsoft Research; Han Hu, Microsoft Research; Xiang Bai; Huazhong University of Science and Technology
Streaming Video Model
Yucheng Zhao, University of Science and Technology of China; Chong Luo, Microsoft Research; Chuanxin Tang, Microsoft Research; Dongdong Chen, Microsoft Research; Noel Codella, Microsoft Research; Zheng-Jun Zha, University of Science and Technology of China
Structural Multiplane Image: Bridging Neural View Synthesis and 3D Reconstruction
Mingfang Zhang, University of Tokyo and Microsoft Research; Jinglu Wang, Microsoft Research; Xiao Li, Microsoft Research; Yifei Huang, University of Tokyo; Yoichi Sato, University of Tokyo; Yan Lu, Microsoft Research
SVFormer: Semi-supervised Video Transformer for Action Recognition
Zhen Xing, Fudan University and Shanghai Collaborative Innovation Center of Intelligent Visual Computing; Qi Dai, Microsoft Research; Han Hu, Microsoft Research; Jingjing Chen, Fudan University and Shanghai Collaborative Innovation Center of Intelligent Visual Computing; Zuxuan Wu, Fudan University and Shanghai Collaborative Innovation Center of Intelligent Visual Computing; Yu-Gang Jiang, Fudan University and Shanghai Collaborative Innovation Center of Intelligent Visual Computing
TinyMIM: An Empirical Study of Distilling MIM Pre-trained Models
Sucheng Ren, Microsoft Research; Fangyun Wei, Microsoft Research; Zheng Zhang, Microsoft Research; Han Hu, Microsoft Research
Two-shot Video Object Segmentation
Kun Yan, Peking University; Xiao Li, Microsoft Research; Fangyun Wei, Microsoft Research; Jinglu Wang, Microsoft Research; Chenbin Zhang, Peking University; Ping Wang, Peking University; Yan Lu, Microsoft Research
Unifying Layout Generation with a Decoupled Diffusion Model
Mude Hui, Xi’an Jiaotong University; Zhizheng Zhang, Microsoft Research; Xiaoyi Zhang, Microsoft Research; Wenxuan Xie, Microsoft Research; Yuwang Wang, Tsinghua University; Yan Lu, Microsoft Research
VideoTrack: Learning to Track Objects via Video Transformer
Fei Xie, Shanghai Jiao Tong University; Lei Chu, Microsoft Research; Jiahao Li, Microsoft Research; Yan Lu, Microsoft Research; Chao Ma, Shanghai Jiao Tong University
VolRecon: Volume Rendering of Signed Ray Distance Functions for Generalizable Multi-View Reconstruction
Yufan Ren, EPFL; Fangjinhua Wang ETH Zurich; Tong Zhang, EPFL; Marc Pollefeys, ETH Zurich and Microsoft Research; Sabine Süsstrunk, EPFL
X-Avatar: Expressive Human Avatars
Kaiyue Shen, ETH Zurich; Chen Guo, ETH Zurich; Manuel Kaufmann, ETH Zurich; Juan Jose Zarate, ETH Zurich; Julien Valentin, Microsoft Research; Jie Song, ETH Zurich; Otmar Hilliges, ETH Zurich
Unifying Vision, Text, and Layout for Universal Document Processing
Zineng Tang, University of North Carolina (UNC) Chapel Hill; Ziyi Yang, Microsoft Research; Guoxin Wang, Microsoft Research; Yuwei Fang, Microsoft Research; Yang Liu, Microsoft Research; Chenguang Zhu, Microsoft Research; Michael Zeng, Microsoft Research; Cha Zhang, Microsoft Research; Mohit Bansal, University of North Carolina (UNC) Chapel Hill