close-up image of interlocking gears turning with a rainbow gradient overlay

Researcher tools: code, datasets, & models

An index of datasets, SDKs, APIs and other open source code created by Microsoft researchers and shared with the broader academic community. We also maintain a collection highlighting some of the tools you’ll find here.

Current selections

Sort by: Most recent

Clear selections

Search within these results

License Types

Published Date

Dataset Source Code

TamGen

This is the implementation of the paper “TamGen: Target-aware Molecule Generation for Drug Design Using a Chemical Language Model”.

GitHub Publication

Download

RAD-DINO model

RAD-DINO is a vision transformer model trained to encode chest X-rays using the self-supervised learning method DINOv2. RAD-DINO is described in detail in RAD-DINO: Exploring Scalable Medical Image Encoders Beyond Text Supervision (F. Pérez-García, H. Sharma, S.…

Download Project Publication

Download

MAIRA-2 model

MAIRA-2 is a multimodal transformer designed for the generation of grounded or non-grounded radiology reports from chest X-rays. It is described in more detail in MAIRA-2: Grounded Radiology Report Generation (S. Bannur, K. Bouzid et al.,…

Download Project Publication

Dataset Source Code

RadFact: An LLM-based Evaluation Metric for AI-generated Radiology Reporting

RadFact is a framework for the evaluation of model-generated radiology reports given a ground-truth report, with or without grounding. Leveraging the logical inference capabilities of large language models, RadFact is not a single number but a suite of…

GitHub Project Publication

Dataset Source Code

TerraTrace: Spatio-Temporal Signatures for Land Use Analytics

Understanding land use over time is critical to tracking events related to climate change, like deforestation. However, satellite-based remote sensing tools which are used for monitoring struggle to differentiate vegetation types in farms and orchards…

GitHub Project Project

Dataset Source Code

OmniParser

OmniParser is a comprehensive method for parsing user interface screenshots into structured and easy-to-understand elements, which significantly enhances the ability of GPT-4V to generate actions that can be accurately grounded in the corresponding regions of…

GitHub Publication

Download

FStar Data Set v2

This dataset is the Version 2.0 of the FStar Data Set. This dataset’s primary objective is to train and evaluate Proof-oriented Programming with AI (PoPAI, in short). Given a specification of a program and proof…

Download Project Publication

Download

FStar Data Set v1

This dataset contains programs and proofs in F* proof-oriented programming language. The data, proposed in Towards Neural Synthesis for SMT-Assisted Proof-Oriented Programming, is an archive of source code, build artifacts, and metadata assembled from eight…

Download Project Publication

Dataset Source Code

Eureka ML Insights

This repository contains the code for the Eureka ML Insights, a framework for standardizing evaluations of large foundation models, beyond single-score reporting and rankings. The framework is designed to help researchers and practitioners run reproducible evaluations…

GitHub Publication

Dataset Source Code

Data Formulator

Transform data and create rich visualizations iteratively with AI.

GitHub Video Publication