Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing

Shruthi Bannur; Stephanie Hyland; Flora Liu; Fernando Pérez-García; Maximilian Ilse; Daniel Coelho de Castro; Benedikt Boecking; Harshita Sharma; Kenza Bouzid; Anja Thieme; Anton Schwaighofer; Maria Teodora Wetscherek; Matthew Lungren; Aditya Nori; Javier Alvarez-Valle; Ozan Oktay

Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing

The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) | June 2023

Download BibTex

Self-supervised learning in vision-language processing exploits semantic alignment between imaging and text modalities. Prior work in biomedical VLP has mostly relied on the alignment of single image and report pairs even though clinical notes commonly refer to prior images. This does not only introduce poor alignment between the modalities but also a missed opportunity to exploit rich self-supervision through existing temporal content in the data. In this work, we explicitly account for prior images and reports when available during both training and fine-tuning. Our approach, named BioViL-T, uses a CNN-Transformer hybrid multi-image encoder trained jointly with a text model. It is designed to be versatile to arising challenges such as pose variations and missing input images across time. The resulting model excels on downstream tasks both in single- and multi-image setups, achieving state-of-the-art performance on (I) progression classification, (II) phrase grounding, and (III) report generation, whilst offering consistent improvements on disease classification and sentence-similarity tasks. We release a novel multi-modal temporal benchmark dataset, MS-CXR-T, to quantify the quality of vision-language representations in terms of temporal semantics. Our experimental results show the advantages of incorporating prior images and reports to make most use of the data.

a

Background & motivation

Promises and Challenges of VLP in Radiology: Self-supervised learning in vision-language processing (VLP) has shown great promise in various domains, including biomedicine and radiology. The success of VLP relies on paired samples sharing semantics, meaning that the text should describe the image with minimal extraneous detail. However, in biomedicine and radiology, reports often include comparisons to prior imaging studies, which introduces ambiguity and poor alignment between the text and image modalities.

Limitations of Previous VLP Approaches in Biomedical Domains: Previous VLP work in biomedical domains has mostly focused on single image-report pairs, ignoring or removing temporal information from reports during training.

Harnessing Temporal Information for Enhanced Self-supervision: Temporal information can provide cleaner and complementary self-supervision signal by exploiting existing structure in the data without requiring additional data.

Approach

In our recent work (link (opens in new tab)), we have introduced BioViL-T, a novel approach that:

Explicitly accounts for prior images and reports during both training and fine-tuning, instead of treating all image-report pairs as independent.
Utilizes a CNN-Transformer hybrid multi-image encoder trained jointly with a text model, designed to handle challenges such as pose variations and missing input images across time.
Achieves state-of-the-art performance on a variety of tasks, both in single- and multi-image setups, including progression classification, phrase grounding, and report generation.
Releases a new multi-modal temporal benchmark dataset, MS-CXR-T (link (opens in new tab)), to quantify the quality of vision-language representations in terms of temporal semantics.

By incorporating prior images and reports, we can make the most use of the available data, resulting in richer representations and improved downstream performance for a wider range of tasks.

Multi-image encoder: The hybrid image encoder plays a pivotal role in extracting spatiotemporal features from medical images. The encoder is a combination of a CNN and a transformer model, which not only enhances data efficiency but also enables the fusion of temporal content without the need for image registration. The CNN, acting as a stem network, is responsible for providing visual token features of individual images, while the transformer captures patch embedding interactions across time when prior images are available (see Figure 1).

In the context of the biomedical VLP applications, the hybrid image encoder offers several advantages. Its data efficiency makes it well-suited for smaller scale datasets. The encoder effectively captures both static and temporal features in images, which is crucial for tasks that require dense level visual reasoning across time, such as report decoding. When a prior image is available, the encoder generates separate features for the current image and the temporal progression information. This decomposition of static and temporal features allows the model to better handle tasks that require static-only or temporal-only features.

BioViL-T: Vision-Language pre-training with temporal structure: The pre-training process of the BioViL-T model involves a multi-image encoder that extracts spatiotemporal features from image sequences and a text encoder (link (opens in new tab)) that leverages optional cross-attention on image features. The models are trained jointly with cross-modal global and local contrastive objectives (InfoNCE). Additionally, the model uses multi-modal fused representations obtained with cross-attention for image-guided masked language modelling. By relying on these training objectives, the model can leverage both visual and textual information to disambiguate and improve language understanding, which is crucial for various downstream tasks. The overall BioViL-T pipeline is shown below:

Figure 1: The proposed self-supervised training framework, BioViL-T, which leverages pairs of radiology reports and sequences of medical images. The training scheme does not require manual expert labels, and it can scale to a large amount of radiology data to pre-train image and text models required for downstream clinical applications.

The resulting image and text models can be used for a variety of downstream applications, such as phrase-grounding, zero-shot inference, and report generation tasks. Further, the multi-image encoder can readily be adapted to tasks involving multiple images such as temporal change classification (shown below), or to reduce hallucination of temporal change during report generation. Lastly, the model can be conditioned on prior reports, enabling it to generate more contextually relevant reports based on both images and previous reports.

Summary: In summary, the BioViL-T model offers several key benefits, including increased data efficiency, the ability to fuse temporal content without requiring image registration, and adaptability for various downstream tasks. Its unique pre-training process allows it to capture both static and temporal features, making it well-suited for a wide range of biomedical applications.

Figure 2: Example of current (left) and prior (right) chest X-ray scans. The attention maps computed within the vision transformer show (in purple) how the model interprets disease progression by focusing on these image regions. In this particular example, the airspace disease seen in the left lung lobe has improved since the prior acquisition.

MS-CXR-T Benchmark Dataset

MS-CXR-T is a multi-modal benchmark dataset designed for evaluating biomedical Vision-Language Processing (VLP) models on two distinct temporal tasks in radiology: image classification and sentence similarity.

Temporal image classification of pairs of chest X-rays. 1326 pairs of images showing worsening, no change, or improvement of one of five pathologies.
Temporal sentence similarity on text derived from chest X-ray reports. 361 pairs of sentences which are either paraphrases or contradictory.

The dataset, which is based on the MIMIC-CXR v2 dataset, was manually annotated and reviewed by a board-certified radiologist. MS-CXR-T aims to address the lack of publicly available multi-modal benchmark datasets and support the evaluation of both image and text models on temporal tasks in biomedical research. To access the MS-CXR-T dataset, please visit https://aka.ms/ms-cxr-t. (opens in new tab)

Resources

Pre-trained image and text models (HuggingFace): https://aka.ms/biovil-t-model (opens in new tab)

MS-CXR-T benchmark dataset: https://aka.ms/ms-cxr-t (opens in new tab)

HI-ML-Multimodal Toolbox / Source code: https://aka.ms/biovil-t-code (opens in new tab)

Notebook example: https://aka.ms/biovil-t-demo-notebook (opens in new tab)

Getting started

The best way to get started is by running the phrase grounding notebook (opens in new tab). All the dependencies will be installed upon execution, so Python 3.9 and Jupyter (opens in new tab) are the only requirements to get started.

The notebook can also be run on Binder (opens in new tab), without the need to download any code or install any libraries:

(opens in new tab)

Acknowledgements

We would like to thank Hoifung Poon, Melanie Bernhardt, Melissa Bristow and Naoto Usuyama for their valuable technical feedback, and Hannah Richardson for assisting with compliance reviews.

Publication Downloads

Temporal Vision-Language Processing (BioViL-T)

March 24, 2023

BioViL-T is a Vision-Language model trained on sequences of biomedical image and text data at a scale. It does not require manual annotations and can leverage historical raw clinical image acquisitions and clinical notes. The temporal extension enables image and text encoders to become sensitive to disease progression information present in existing datasets. See Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing - Microsoft Research for further information.

Download Data