Computer Vision Group

Empowering technologies for real-world vision-based systems

新闻与深度文章

微软研究院博客

ACAV100M: Scaling up self-supervised audio-visual learning with automatically curated internet videos

2021年10月28日 | Yale Song

The natural association between visual observations and their corresponding sounds has exhibited powerful self-supervision signals for learning video representations (opens in new tab), which makes the ever-growing amount of online video an attractive data source for self-supervised learning. However, online…

微软研究院博客

Microsoft and NVIDIA introduce parameter-efficient multimodal transformers for video representation learning

2021年5月17日 | Yale Song

Understanding video is one of the most challenging problems in AI, and an important underlying requirement is learning multimodal representations that capture information about objects, actions, sounds, and their long-range statistical dependencies from audio-visual signals. Recently, transformers have been successful in…