Language Models are Visual Reasoning Coordinators
- Liangyu Chen ,
- Bo Li ,
- Sheng Shen ,
- Jingkang Yang ,
- Chunyuan Li ,
- Kurt Keutzer ,
- Trevor Darrell ,
- Ziwei Liu
NeurIPS 2023 |
Visual reasoning demands multimodal perception and commonsense cognition of the world. Recently, multiple vision-language models (VLMs) have been proposed with excellent commonsense reasoning ability in various domains. However, how to harness the collective power of these complementary VLMs is rarely explored. Existing methods like ensemble still struggle to combine these models with the desired higher-order communications. In this work, we propose Cola, a novel paradigm that coordinates multiple VLMs for visual reasoning. Our key insight is that a language model (LM) can serve as an efficient coordinator to leverage the distinct and complementary capabilities of multiple VLMs. Extensive experiments demonstrate that our finetuning variant, Cola-FT, achieves state-of-the-art performance on outside knowledge VQA, visual entailment, and visual spatial reasoning tasks. Through systematic ablation studies and visualizations, we validate that a coordinator LM comprehends the instruction prompts and the separate functionalities of VLMs and then coordinates them to enable impressive visual reasoning capabilities.