BiomedParse: A foundation model for smarter, all-in-one biomedical image analysis

Published

By , General Manager, Health Futures , Senior Applied Scientist , Senior Applied Scientist , Principal Applied Science Manager , Assistant Professor

A stylized illustration of a green line-drawn hand holding a transparent prism with colorful bands of light being refracted through it against a black background.

In cancer diagnosis or advanced treatments like immunotherapy, every detail in a medical image counts. Radiologists and pathologists rely on these images to track tumors, understand their boundaries, and analyze how they interact with surrounding cells. This work demands pinpoint accuracy across several tasks—identifying whether a tumor is present, locating it precisely, and mapping its contours on complex CT scans or pathology slides. 

Yet, these crucial steps—object recognition, detection, and segmentation—are often tackled separately, which can limit the depth of analysis. Current tools like MedSAM (opens in new tab) and SAM (opens in new tab) focus on segmentation only, thus missing out on the opportunity to blend these insights holistically and relegating object as an afterthought. 

In this blog, we introduce BiomedParse (opens in new tab), a new approach for holistic image analysis by treating object as the first-class citizen. By unifying object recognition, detection, and segmentation into a single framework, BiomedParse allows users to specify what they’re looking for through a simple, natural-language prompt. The result is a more cohesive, intelligent way of analyzing medical images that supports faster, more integrated clinical insights. 

While biomedical segmentation datasets abound, there are relatively few prior works on object detection and recognition in biomedicine, let alone datasets covering all three tasks. To pretrain BiomedParse, we created the first such dataset by harnessing OpenAI’s GPT-4 for data synthesis from standard segmentation datasets (opens in new tab).

BiomedParse is a single foundation model that can accurately segment biomedical objects across nine modalities, as seen in Figure 1, outperforming prior best methods while requiring orders of magnitude fewer user operations, as it doesn’t require an object-specific bounding box. By learning semantic representation for individual object types, BiomedParse’s superiority is particularly pronounced in the most challenging cases with irregularly shaped objects. Through joint pretraining of object recognition, detection, and segmentation, BiomedParse opens new possibilities for holistic image analysis and image-based discovery in biomedicine.  

a, The GPT-4 constructed ontology showing a hierarchy of object types that are used to unify semantic concepts across datasets. Bar plots showing the number of images containing that object type. b, Bar plot showing the number of image–mask–description triples for each modality in BiomedParseData. CT is abbreviation for Computed Tomography. MRI is abbreviation for Magnetic Resonance Imaging. OCT is abbreviation for Optical Coherence Tomography. c, Flowchart of BiomedParse. BiomedParse takes an image and a text prompt as input and then outputs the segmentation masks for the objects specified in the prompt. Image-specific manual interaction such as bounding box or clicks is not required in our framework. To facilitate semantic learning for the image encoder, BiomedParse also incorporates a learning objective to classify the meta-object type. For online inference, GPT-4 is used to resolve text prompt into object types using the object ontology, which also uses the meta-object type output from BiomedParse to narrow down candidate semantic labels. d, Uniform Manifold Approximation and Projection (UMAP) plots contrasting the text embeddings for different cell types derived from BiomedParse text encoder (left) and PubMedBERT (right). e, UMAP plots contrasting the image embeddings for different cell types derived from BiomedParse image encoder (left) and Focal (right).
Figure 1. Overview of BiomedParse and BiomedParseData.

Image parsing: a unifying framework for holistic image analysis 

Back in 2005, researchers first introduced the concept of “image parsing”—a unified approach to image analysis that jointly conducts object recognition, detection, and segmentation. Built on Bayesian networks, this early model offered a glimpse into a future of joint learning and reasoning in image analysis, though it was limited in scope and application. Fast forward to today, cutting-edge advances in generative AI have breathed new life into this vision. With our model, BiomedParse, we have created a foundation for biomedical image parsing that leverages interdependencies across the three subtasks, thus addressing key limitations in traditional methods. BiomedParse enables users to simply input a natural-language description of an object, which the model uses to predict both the object label and its segmentation mask, thus eliminating the need for a bounding box (Figure 1c). In other words, this joint learning approach lets users segment objects based on text alone.

About Microsoft Research

Advancing science and technology to benefit humanity

Harnessing GPT-4 for large-scale data synthesis from existing datasets 

We created the first dataset for biomedical imaging parsing by harnessing GPT-4 for large-scale data synthesis from 45 existing biomedical segmentation datasets (Figure 1a and 1b). The key insight is to leverage readily available natural-language descriptions already in these datasets and use GPT-4 to organize this often messy, unstructured text with established biomedical object taxonomies.  

Specifically, we use GPT-4 to help create a unifying biomedical object taxonomy for image analysis and harmonize natural language descriptions from existing datasets with this taxonomy. We further leverage GPT-4 to synthesize additional variations of object descriptions to facilitate more robust text prompting.  

This enables us to construct BiomedParseData, a biomedical image analysis dataset comprising over 6 million sets of images, segmentation masks, and text descriptions drawn from more than 1 million images. This dataset includes 64 major biomedical object types, 82 fine-grained subtypes, and spans nine imaging modalities.

a, Box plot comparing the Dice score between our method and competing methods on 102,855 test instances (image–mask–label triples) across nine modalities. MedSAM and SAM require bounding box as input. We consider two settings: oracle bounding box (minimum bounding box covering the gold mask); bounding boxes generated from the text prompt by Grounding DINO, a state-of-the-art text-based grounding model. Each modality category contains multiple object types. Each object type was aggregated as the instance median to be shown in the plot. n in the plot denotes the number of test instances in the corresponding modality. b, Nine examples comparing the segmentation results by BiomedParse and the ground truth, using just the text prompt at the top. c, Box plot comparing the Dice score between our method and competing methods on a cell segmentation test set with n=42 images. BiomedParse requires only a single user operation (the text prompt ‘Glandular structure in colon pathology’). By contrast, to get competitive results, MedSAM and SAM require 430 operations (one bounding box per an individual cell). d, Five examples contrasting the segmentation results by BiomedParse and MedSAM, along with text prompts used by BiomedParse and bounding boxes used by MedSAM. e, Comparison between BiomedParse and MedSAM on a benign tumor image (top) and a malignant tumor image (bottom). The improvement of BiomedParse over MedSAM is even more pronounced on abnormal cells with irregular shapes. f, Box plot comparing the two-sided K–S test P values between valid text prompt and invalid text prompt. BiomedParse learns to reject invalid text prompts describing object types not present in the image (small P value). We evaluated a total of 4,887 invalid prompts and 22,355 valid prompts. g, Plot showing the precision and recall of our method on detecting invalid text prompts across different K–S test P value cutoff. h,i, Scatter-plots comparing the area under the receiver operating characteristic curve (AUROC) (h) and F1 (i) between BiomedParse and Grounding DINO on detecting invalid descriptions.
Figure 2: Comparison on large-scale biomedical image segmentation datasets.

State-of-the-art performance across 64 major object types in 9 modalities

We evaluated BiomedParse on a large held-out test set with 102,855 image-mask-label sets across 64 major object types in nine modalities. BiomedParse outperformed prior best methods such as MedSAM and SAM, even when oracle per-object bounding boxes were provided. In the more realistic setting when MedSAM and SAM used a state-of-the-art object detector (Grounding DINO) to propose bounding boxes, BiomedParse outperformed them by a wide margin, between 75 and 85 absolute points in dice score (Figure 2a). BiomedParse also outperforms a variety of other prominent methods such as SegVol, Swin UNETR, nnU-Net, DeepLab V3+, and UniverSeg.

a, Attention maps of text prompts for irregular-shaped objects, suggesting that BiomedParse learns rather faithful representation of their typical shapes. US, ultrasound. b–d, Scatter-plots comparing the improvement in Dice score for BiomedParse over MedSAM with shape regularity in terms of convex ratio (b), box ratio (c) and inversed rotational inertia (d). A smaller number in the x axis means higher irregularity on average. Each dot represents an object type. e, Six examples contrasting BiomedParse and MedSAM on detecting irregular-shaped objects. Plots are ordered from the least irregular one (left) to the most irregular one (right). f,g Comparison between BiomedParseData and the benchmark dataset used by MedSAM in terms of convex ratio (f) and box ratio (g). BiomedParseData is a more faithful representation of real-world challenges in terms of irregular-shaped objects. h, Box plots comparing BiomedParse and competing approaches on BiomedParseData and the benchmark dataset used by MedSAM. BiomedParse has a larger improvement on BiomedParseData, which contains more diverse images and more irregular-shaped objects. The number of object types are as follows: n=50 for MedSAM benchmark and n=112 for BiomedParseData.
Figure 3. Evaluation on detecting irregular-shaped objects.

Recognizing and segmenting irregular and complex objects

Biomedical objects often have complex and irregular shapes, which present significant challenges for segmentation, even with oracle bounding box. By joint learning with object recognition and detection, BiomedParse learns to model object-specific shapes, and its superiority is particularly pronounced for the most challenging cases (Figure 3). Encompassing a large collection of diverse object types in nine modalities, BiomedParseData also provides a much more realistic representation of object complexity in biomedicine.  

a, Six examples showing the results of object recognition by our method. Object recognition identifies and segments all objects in an image without requiring any user-provided input prompt. b–d, Scatter-plots comparing the F1 (b), Precision (c) and Recall (d) scores between BiomedParse and Grounding DINO on identifying objects presented in the image. e, Comparison between BiomedParse and Grounding DINO on object identification in terms of median F1 score across different numbers of objects in the image. f, Box plot comparing BiomedParse and MedSAM/SAM (using bounding boxes generated by Grounding DINO) on end-to-end object recognition (including segmentation) in relation to various modalities. g, Comparison between BiomedParse and MedSAM/SAM (using bounding boxes generated by Grounding DINO) on end-to-end object recognition (including segmentation) in relation to numbers of distinct objects in the image.
Figure 4. Evaluation on object recognition.

Promising step toward scaling holistic biomedical image analysis

By operating through a simple text prompt, BiomedParse requires substantially less user effort than prior best methods that typically require object-specific bounding boxes, especially when an image contains a large number of objects (Figure 2c). By modeling object recognition threshold, BiomedParse can detect invalid prompt and reject segmentation requests when an object is absent from the image. BiomedParse can be used to recognize and segment all known objects in an image in one fell swoop (Figure 4). By scaling holistic image analysis, BiomedParse can potentially be applied to key precision health applications such as early detection, prognosis, treatment decision support, and progression monitoring.  

Going forward, there are numerous growth opportunities. BiomedParse can be extended to handle more modalities and object types. It can be integrated into advanced multimodal frameworks such as LLaVA-Med (opens in new tab) to facilitate conversational image analysis by “talking to the data.” To facilitate research in biomedical image analysis, we have made BiomedParse open-source (opens in new tab) with Apache 2.0 license. We’ve also made it available on Azure AI (opens in new tab) for direct deployment and real-time inference. For more information, check out our demo. (opens in new tab) 

BiomedParse is a joint work with Providence and the University of Washington’s Paul G. Allen School of Computer Science & Engineering, and brings collaboration from multiple teams within Microsoft*. It reflects Microsoft’s larger commitment to advancing multimodal generative AI for precision health, with other exciting progress such as GigaPath (opens in new tab), BiomedCLIP (opens in new tab),  LLaVA-Rad (opens in new tab), BiomedJourney (opens in new tab), MAIRA (opens in new tab), Rad-DINO (opens in new tab), Virchow (opens in new tab).  

(Acknowledgment footnote) *: Within Microsoft, it is a wonderful collaboration among Health Futures, MSR Deep Learning, and Nuance. 

Paper co-authors: Theodore Zhao, Yu Gu, Jianwei Yang (opens in new tab), Naoto Usuyama (opens in new tab), Ho Hin Lee, Sid Kiblawi, Tristan Naumann (opens in new tab), Jianfeng Gao (opens in new tab), Angela Crabtree, Jacob Abel, Christine Moung-Wen, Brian Piening, Carlo Bifulco, Mu Wei, Hoifung Poon (opens in new tab), Sheng Wang (opens in new tab)

Related publications

Continue reading

See all blog posts