In this area of research, we broadly explore combining machine learning and program synthesis in various ways. This is an umbrella project that has spawned several projects exploring applications of such a combination in different areas.
Heterogeneous data extraction framework (HDEF): This project explores the benefits of combining program synthesis with machine learning for structured information extraction. We use machine learning models (“ML models”) such as conditional random fields to get an initial labeling of potential attribute values. However, such models are typically not interpretable, and the noise produced by such models is hard to manage or debug. We use (noisy) labels produced by such ML models as inputs to program synthesis, and generate interpretable programs that cover the input space.
Project Jigsaw: Large language models like GPT-3 and Codex have shown promising opportunity for AI assisted Pair Programming. However, it is well understood that these models do not necessarily understand code semantics and hence may generate incorrect, insecure code. In this project, we present an approach to augment large language models like GPT-3 and Codex with post-processing steps based on program analysis and synthesis techniques, that understand the syntax and semantics of programs. We show that such techniques can make use of user feedback and improve with usage. We build a tool called Jigsaw, targeted at synthesizing code for using Python Pandas API using multi-modal inputs.
Our current focus in this work is to learn transformations to fix JavaScript vulnerabilities.
Project Omega: Earlier projects largely kept program synthesis and machine learning components separate. However, in several real-world systems, machine learning models are often supported with several guarding rules (which can be thought of as programs) to ensure high quality output. Maintenance/management of these guarding rules can be tedious over time, particularly in an environment where data distribution shifts over time as well. Additionally, machine learning models do not utilize the knowledge present in the rules and rules do not know whether machine learning model already subsume them or not. In this project, we aim to build a framework where machine learning models and rules can co-exist and evolve together, allowing for an easier management of the rules while benefitting ML models with the knowledge present in the rules.