Don’t Just String Tokens, Stack Them! Improving Multimodal Transformers with Layer Stack
Large multimodal models (LMMs) have shown tremendous improvements over the past year for multimodal understanding and reasoning. Currently, most (if not all) of the works attempt to connect vision and LLMs by feeding into a large language model (LLM) a string of visual tokens extracted from pretrained vision encoders (\textit{e.g.}, CLIP). Nevertheless, such a strategy brings considerable compute and memory overhead to the original LLMs due to extra visual tokens, which is particularly significant for high-resolution images and videos. Despite some efforts to mitigate this with sophisticated token compressions, the methods usually struggle to reach a good trade-off between efficacy and efficiency. In this work, we propose a new strategy for connecting vision and language transformers in large multimodal models (LMMs). Instead of stringing visual tokens as a sequence, we stack the visual tokens into multiple layers and then feed the subset at each layer into a corresponding transformer layer in LLMs, shown in Fig.~\ref{fig:teaser}. In the end, we propose \ourmodel, a new architecture to connect vision and language in the context of LMMs. This simple strategy significantly unleash the power of LLMs for modeling the dependencies across a large number of visual tokens while keeping the compute marginally changed. Concretely, using the same recipe as LLaVA-1.5, our \ourmodel uses the same context length but achieves significant gain over a wide range of vision-language benchmarks. In particular, our model brings 4.2, 11.0, and 4.0 performance gains on TextVQA, DocVQA and InfoVQA compared to LLaVA-1.5-7B, respectively.