新闻与深度文章
加载中…
| Yale Song
The natural association between visual observations and their corresponding sounds has exhibited powerful self-supervision signals for learning video representations (opens in new tab), which makes the ever-growing amount of online video an attractive data source for self-supervised learning. However, online…
| Yale Song
Understanding video is one of the most challenging problems in AI, and an important underlying requirement is learning multimodal representations that capture information about objects, actions, sounds, and their long-range statistical dependencies from audio-visual signals. Recently, transformers have been successful in…