Researcher tools: code, datasets, & models

Dataset Source Code

MSParS

MSParS is a large-scale dataset for the open domain semantic parsing task. The whole dataset consists of 81,826 samples annotated by native English speakers. We randomly shuffle these samples and use 80% of them (63,826)…

GitHub

Dataset Source Code

ELL Models Repository

The ELL models repository contains a number of pretrained models suited for different device footprints. Currently, there are pretrained models for image classification and audio keyword spotting.

GitHub

Dataset Source Code

DSTC7-Task2 dataset

Scripts to generate the knowledge-grounded conversation dataset. First used for DSTC7-Task2 challenge.

GitHub

Dataset Source Code

Simple NLP tools

Scripts for dialog generation evaluation, tokenization, data preprocessing, and GUI interface.

GitHub

Dataset Source Code

NAIL Agent

NAIL agent provides the source code for an agent which won the 2018 Text-Based Adventure AI Competition. The repository will be also used to release further improvements as we developed better RL approaches to the…

GitHub Publication

Dataset Source Code

Jericho

Jericho is an environment that connects learning agents with interactive fiction games. It supports nearly 50 existing games, as well the TextWorld environment from MSR Montreal. The hope is that Jericho can be the standard…

GitHub

Dataset Source Code

AirSim Simulator

AirSim is high fidelity extensible simulation platform to allow data generation, algorithms testing and reinforcement learning for developing autonomous agents.

GitHub Video Video Project

Dataset Source Code

Graph-based code modeling toolkit

A toolkit for reasoning about source code (tasks related to program understanding, synthesis, and verification) using graph neural networks. Developed in partnership with MSR Cambridge. Used by several ongoing projects both inside and outside MSR.

GitHub

Dataset Source Code

Dataset for Learning Karel Programs

A synthetic dataset of visual programs for the program synthesis task, now a common benchmark in the academic community. This webpage hosts the dataset for synthetically generated Karel programs that are used for training and…

GitHub

Dataset Source Code

Microsoft Program Synthesis using Examples SDK

A framework/SDK for program synthesis from input-output examples, with pre-built applications for data wrangling, Jupyter integration for data scientists, repetitive code editing, text manipulations, and extraction from webpages. Started at MSR, now developed by a…

GitHub