Human eyes have a dynamic focusing system that adjusts the focal regions in order to see the surroundings at all distances. When we look far away, up close, and back again, our eyes change focus rapidly to allow us to perceive things finely and coarsely. In computer vision (CV), It remains an open question how to build a neural network that can mimic this behavior and feasibly focus on various granularities of visual inputs towards different tasks.
In the past few years, Transformers (opens in new tab) and Vision Transformers (opens in new tab) have led to unprecedented AI breakthroughs in NLP and vision, respectively. For vision particularly, what makes the Transformers stand out is arguably the self-attention (SA) mechanism, which enables each query token to adaptively gather information from others. It learns the dependencies across different visual tokens, which induces better generalization ability than the canonical convolution layer of static kernels. In the visual world, the input signal is often continuous and comes with an arbitrary granularity and scope. Nevertheless, SA is typically used to modeling over a fixed set of predetermined tokens in a specific scope and granularity, and the interactions among individual tokens are usually dense and heavy, which limits their usability in understanding the complicated visual world.
In this blog, we introduce our recent efforts on building neural networks with focal modulation, leading to the new architecture family: FocalNets (opens in new tab). The highlight moments include:
- FocalNet achieves new state-of-the-art (SoTA) on the most challenging vision task: COCO object detection (opens in new tab), with 3x small model size and training data size. This marks a milestone that the first attention-free model in the past two years to surpass all Transformer models on the leaderboard.
- FocalNet exhibits an intriguing interpretable learning behavior. It can discover and segment objects in an image or a video, while Transformer can hardly do. As the following example shows, the modulation focus maps gradually change from the early, middle to the final stage of perception, which are intuitively interpretable. This suggests FocalNet is capable of different levels of image understanding.
(Left) Comparison with SoTA on COCO object detection. Circle size indicates the model size. (Right) Modulation focus maps at the early, middle, and final stages of visual perception with our FocalNet
We also released the paper on arXiv (opens in new tab), PyTorch codebase on the project GitHub page (opens in new tab), and a HuggingFace demo (opens in new tab). Feel feel to give it a try.
Eye focusing with Focal Modulation Networks
At the core of Focal Modulation Networks (FocalNets) is the focal modulation mechanism: A lightweight element-wise multiplication as the focusing operator to allow the model to see or interact with the input using the proposed modulator; As depicted below, the modulator is computed with a focal aggregation procedure in two steps: focal contextualization to extract contexts from local to global ranges at different levels of granularity and gated aggregation to condense all context features at different granularity levels into the modulator.
Focal Modulation vs Self-Attention
Similar goals, but different focusing processes. Focal modulation and self-attention are two different ways to enable AI models to selectively focus on certain parts of their input. The self-attention starts with interaction and then aggregation, while the focal modulation starts with aggregation then interaction, which significantly ease the process with much light-weight operations.
vs
Modulation Map vs Attention Map. Both methods learn to focus, but the selected focus areas are quite different. With the standard supervised training of FocalNet and Vision Transformers (ViT) (opens in new tab) on ImageNet, we visualize the modulation map of FocalNet and the attention map of ViT, respectively. We observe that our focal modulation automatically learns an interpretable representation and separates the main object from the background clutter. It learns to segment objects without any form of dedicated dense pixel-level supervision, and the selected focus areas are coherent with the human-generated annotation in the image classification task. In contrast, the selected focus areas of attention maps in ViT are less meaningful and may highlight some spuriously correlated regions.
When visualizing the modulation maps in the network for videos, we see that they correspond to coherent semantic regions of the moving objects.
I am excited about our new way of enabling AI to focus on the right parts of the input through focal modulation.
— Johannes Gehrke, Technical Fellow, Lab Director of Research at Redmond, and CTO and Head of Machine Learning for the Intelligent Communications and Conversations Cloud (IC3)
Dense Prediction Tasks with High-Resolution Images
FocalNet is compared against established vision backbone networks, including Vision Transformers (ViT) (opens in new tab), Swin Transformers (opens in new tab) and ConvNeXt (opens in new tab) on different CV tasks, including ImageNet classification (opens in new tab), zero-shot classification on 20 datasets on ICinW (opens in new tab), and dense prediction tasks such as object detection (opens in new tab) and segmentation (opens in new tab). FocalNet consistently outperforms others. The attention-free design of focal modulation can particularly benefit the dense visual prediction tasks with a high-resolution image input, as it allows the model to see a wider scope at different granularities and avoid the heavy burden of token-to-token interaction. Importantly, it achieves a new SoTA 64.3 (test-dev) / 64.2 (minival) on COCO object detection, outperforming the prior arts Swin-v2 Giant and BEIT-3 models with 3x smaller model/data size.
Glad to continue to push on this state-of-the-art computer vision innovation to delight our worldwide Azure Cognitive Services customers.
— Xuedong Huang, Microsoft Technical Fellow and Chief Technology Officer of Azure AI
From the Broader View of Cognitive and Neuroscience
FocalNets mimic human vision. In humans, attention is critical to our ability to focus on specific aspects of environmental stimuli while filtering out other irrelevant information. By definition, visual attention plays a key role in isolating the foreground from the background. Not surprisingly, an algorithm mimicking attention is critical for object recognition in computer vision. Visual attention can be roughly classified into two large categories: feature attention vs spatial attention (e.g. Hayden and Gallant, Neuron 2005 (opens in new tab); Bichot et al, Neuron 2015 (opens in new tab)). Spatial attention directs the movement of eyes to specific locations and therefore is closely linked to the gaze control system. The existing Self-attention (SA) network appears more in line with the spatial attention mechanism of the brain. However, in many cases, we do not know where the object is located or where to focus, but we know it has distinct features. Feature-based attention therefore operates across the visual field and is not closely connected to the eye movement system. Its goal is to construct and maintain an internal representation of the target. Furthermore, in natural human vision, spatial attention and feature attention work together. Importantly, while most studies of visual attention focus on the cortex, it is also well-recognized that the pulvinar nucleus of the thalamus interacts with the cortex and plays a critical role in selective attention. Patients with lesions of the pulvinar nucleus have difficulties in filtering out distractors during attention tasks (Snow JC, et al PNAS2009 (opens in new tab)).
The new algorithm FocalNet appears to better mimic the feature attention system, and hence it performs better in segmenting object from background. This superb ability of FocalNet could be mimicking the dynamic interactions between pulvinar and cortex
— Fan Wang (opens in new tab), Professor of Brain and Cognitive Sciences, Massachusetts Institute of Technology
Focal modulation shares some similar structures with interneurons (opens in new tab)in neural system. (1) One example is the spinal cord: painful information is transmitted to spinal cord, and the projection neurons are the minority of the neurons, most neurons in the dorsal horn are interneurons that process and integrate information and control whether or not painful information is transmitted to higher centers. (2) In motor control, there’s the top-down command, and there’s the final motor neuron output, but for efficient motor control, there are also existing “modules” formed by premotor interneurons that can generate stereotypical patterns such as rhythms and sequences. It makes sense to make interneuron “modules” to specialize in certain processes, and the top-down control can then just play a role in orchestrating these modules. (3) In the somatosensory (body sensory system), while itch and pain are two distinct sensations, the peripheral sensory neurons that detect “itchy” or “painful” stimuli are not so distinct, many of these sensory neurons express “sensor” (receptors) for both itch-inducing and pain-inducing stimuli. The interneurons in the spinal cord play a key role in processing the “ambiguous” incoming information and separate into subsequent “itch” vs “pain” pathways.
A new building block for the next-generation AI models
With FocalNets, the AI research community can build new computer vision systems for high-resolution visual inputs more efficiently. We hope that our experiments will show the community the potential of FocalNets and encourage further adoption of focal modulation.
Acknowledgment: This research was conducted by Jianwei Yang (opens in new tab), Chunyuan Li (opens in new tab), Xiyang Dai (opens in new tab), Lu Yuan (opens in new tab), Jianfeng Gao (opens in new tab). The connections to human vision and neuroscience are drawn by Fan Wang (opens in new tab) and Jinghao Lu (opens in new tab) from MIT. Additional thanks go to the Microsoft Research Horizontal AI Team and Microsoft Alexander Multi-modal team for providing computer resources for large-scale training. We would like to thank DINO team from IDEA, including Lei Zhang (opens in new tab), Hao Zhang (opens in new tab), Feng Li (opens in new tab) and Shilong Liu (opens in new tab), for helpful discussions and detailed instructions of using DINO for object detection. We would like to thank Aishwarya Kamath (opens in new tab) from NYU for sharing the Object365v2 dataset. We would like to thank Lingchen Meng for helping convert contrastive denoising into regular denoising in DINO.