Grounded Visual Generation

Multi-modal data provides an exciting opportunity to train grounded generative models that synthesize images consistent with real world phenomena. In this talk, I will share several of our recent efforts towards creating grounded visual generation models: (1) introducing user attention grounding for text-to-image synthesis, (2) improving text-to-image generation results with stronger language grounding, and (3) taking steps towards creating spatially grounded world models for embodied vision-and-language tasks.

发言人详细信息

Jing Yu Koh is a Research Engineer at Google Research, where he works on machine learning for computer vision and natural language processing. He was previously an AI Resident at Google. His research interests include multi-modal learning, vision-and-language models, and generative models. Prior to joining Google, he completed his undergraduate studies at the Singapore University of Technology and Design in 2019.

专题：: Microsoft Vision+Language Summer Talk Series
日期：: 2021年7月21日
演讲者：: Jing Yu Koh
所属机构：: Google

- Chunyuan Li
  
  Principal Researcher
- Jianwei Yang
  
  Principal Researcher
- Pengchuan Zhang
  
  Senior Researcher
- Zhe Gan
  
  Principal Researcher
研究领域
- Artificial intelligence
研究院
- Microsoft Research Lab - Redmond
组
- Deep Learning Group