Microsoft Research blog
Swin Transformer supports 3-billion-parameter vision models that can train with higher-resolution images for greater task applicability
| Han Hu and Baining Guo
Early last year, our research team from the Visual Computing Group (opens in new tab) introduced Swin Transformer (opens in new tab), a Transformer-based general-purpose computer vision architecture that for the first time beat convolutional neural networks on the important…
Unlocking new dimensions in image-generation research with Manifold Matching via Metric Learning
| Mengyu Dai and Junwon Park
Generative image models offer a unique value by creating new images. Such images can be sharp super-resolution versions of existing images or even realistic-looking synthetic photographs. Generative Adversarial Networks (GANs) and their variants have demonstrated pioneering success with the framework…
ACAV100M: Scaling up self-supervised audio-visual learning with automatically curated internet videos
| Yale Song
The natural association between visual observations and their corresponding sounds has exhibited powerful self-supervision signals for learning video representations (opens in new tab), which makes the ever-growing amount of online video an attractive data source for self-supervised learning. However, online…
Announcing the ORBIT dataset: Advancing real-world few-shot learning using teachable object recognition
| Daniela Massiceti, Cecily Morrison, Katja Hofmann, and Ed Cutrell
Object recognition systems have made spectacular advances in recent years, but they rely on training datasets with thousands of high-quality, labelled examples per object category. Learning new objects from only a few examples could open the door to many new…