Linjie Li

Researcher

关于

Linjie Li is a Researcher at the computer vision science group in Microsoft Cloud & AI.

Before joining Microsoft, Linjie obtained her Master’s degree in computer science from Purdue University in 2018. Her current research interests include Vision-and-Language Pre-training, Self-supervised Learning and Adversarial Training.

Selected Publications

UNITER: UNiversal Image-TExt Representation Learning

Joint image-text embedding is the bedrock for most Visionand-Language (V+L) tasks, where multimodality inputs are simultaneously processed for joint visual and textual understanding. In this paper, we introduce UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text…

Large-Scale Adversarial Training for Vision-and-Language Representation Learning

We present VILLA, the first known effort on large-scale adversarial training for vision-and-language (V+L) representation learning. VILLA consists of two training stages: (i) task-agnostic adversarial pre-training; followed by (ii) task-specific adversarial finetuning. Instead of adding adversarial perturbations on image pixels…

HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training

We present HERO, a Hierarchical EncodeR for Omni-representation learning, for large-scale video+language pre-training. HERO encodes multimodal inputs in a hierarchical fashion, where local textual context of a video frame is captured by a Cross-modal Transformer via multimodal fusion, and global…

Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling

The canonical approach to video-and-language learning (e.g., video question answering) dictates a neural model to learn from offline-extracted dense video features from vision models and text features from language models. These feature extractors are trained independently and usually on tasks…