A Study on Context Length and Efficient Transformers for Biomedical Image Analysis

  • Sarah Hooper ,
  • Hui Xue

Machine Learning for Health (ML4H) 2024 |

Biomedical images are often high-resolution and multi-dimensional, presenting computational challenges for deep neural networks. These computational challenges are compounded when training transformers due to the self-attention operator, which scales quadratically with context length. Recent works have proposed alternatives to self-attention that scale more favorably with context length, alleviating these computational difficulties and potentially enabling more efficient application of transformers to large biomedical images. However, a systematic evaluation on this topic is lacking. In this study, we investigate the impact of context length on biomedical image analysis and we evaluate the performance of recently proposed substitutes to self-attention. We first curate a suite of biomedical imaging datasets, including 2D and 3D data for segmentation, denoising, and classification tasks. We then analyze the impact of context length on network performance using the Vision Transformer and Swin Transformer. Our findings reveal a strong relationship between context length and performance, particularly for pixel-level prediction tasks. Finally, we show that recent attention-free models demonstrate significant improvements in efficiency while maintaining comparable performance to self-attention-based models, though we highlight where gaps remain.