An index of datasets, SDKs, APIs and other open source code created by Microsoft researchers and shared with the broader academic community. We also maintain a collection highlighting some of the tools you’ll find here.
ELL Models Repository
The ELL models repository contains a number of pretrained models suited for different device footprints. Currently, there are pretrained models for image classification and audio keyword spotting.
DSTC7-Task2 dataset
Scripts to generate the knowledge-grounded conversation dataset. First used for DSTC7-Task2 challenge.
Simple NLP tools
Scripts for dialog generation evaluation, tokenization, data preprocessing, and GUI interface.
NAIL Agent
NAIL agent provides the source code for an agent which won the 2018 Text-Based Adventure AI Competition. The repository will be also used to release further improvements as we developed better RL approaches to the…
AirSim Simulator
AirSim is high fidelity extensible simulation platform to allow data generation, algorithms testing and reinforcement learning for developing autonomous agents.
Graph-based code modeling toolkit
A toolkit for reasoning about source code (tasks related to program understanding, synthesis, and verification) using graph neural networks. Developed in partnership with MSR Cambridge. Used by several ongoing projects both inside and outside MSR.
Dataset for Learning Karel Programs
A synthetic dataset of visual programs for the program synthesis task, now a common benchmark in the academic community. This webpage hosts the dataset for synthetically generated Karel programs that are used for training and…
Microsoft Program Synthesis using Examples SDK
A framework/SDK for program synthesis from input-output examples, with pre-built applications for data wrangling, Jupyter integration for data scientists, repetitive code editing, text manipulations, and extraction from webpages. Started at MSR, now developed by a…