Grounded Visual Generation

Multi-modal data provides an exciting opportunity to train grounded generative models that synthesize images consistent with real world phenomena. In this talk, I will share several of our recent efforts towards creating grounded visual generation models: (1) introducing user attention grounding for text-to-image synthesis, (2) improving text-to-image generation results with stronger language grounding, and (3) taking steps towards creating spatially grounded world models for embodied vision-and-language tasks.

发言人详细信息

Jing Yu Koh is a Research Engineer at Google Research, where he works on machine learning for computer vision and natural language processing. He was previously an AI Resident at Google. His research interests include multi-modal learning, vision-and-language models, and generative models. Prior to joining Google, he completed his undergraduate studies at the Singapore University of Technology and Design in 2019.

日期:
演讲者:
Jing Yu Koh
所属机构:
Google

系列: Microsoft Vision+Language Summer Talk Series