Natural Language to Class-level Code Generation by Iterative Tool-augmented Reasoning over Repository

International Conference on Machine Learning Workshop on Data-Centric Machine Learning Research (ICML-DMLR) |

Large Language Models (LLMs) have demonstrated significant potential in code generation tasks, achieving promising results across various benchmarks. However, most contemporary benchmarks such as HumanEval and MBPP consist of tasks restricted to simpler units such as functions. These aren’t representative of real-world use-cases where source code files often consist of larger more complex structures such as classes and namespaces. Furthermore, the bulk of the tasks are relatively self-contained, and one does not require to retrieve additional relevant information from outside of the context containing the given task. Our contributions through this work are twofold: i) a new benchmark called RepoClassBench composed of tasks each of which corresponds to the generation of a class with multiple variables and functions from a provided Natural Language specification of the class, and possessing dependencies to entities outside of the task in question and ii) Retrieve-Repotools-Reflect (RRR), a novel approach that equips an LLM with the ability to invoke static analysis tools to identify and retrieve relevant information, use this feedback to reason and iteratively improve generated code.\ For each task in RepoClassBench, 2 NL descriptions are provided: a fully specified description that provides fine-grained details for each member of the corresponding class, and a sketchy description that provides a coarse-grained description of each member. The purpose behind the sketchy evaluation is to more closely resemble real world scenarios wherein software developers would likely provide insufficient detail or incomplete descriptions of the classes they want generated. Thus, the sketchy evaluation incentivizes building LLMs with conversational abilities, to use human feedback and resolve ambiguities in the provided description. Lastly, the benchmark also tests an LLM’s long form abilities as a consequence of longer source code files, thus evaluating its reasoning capabilities over larger prompts and ability to generate long coherent output. We make use of RepoClassBench to benchmark our RRR method against existing methods and showcase the improvements RRR is able to achieve using feedback provided by the tools. Details and code for our dataset and evaluation harness are available here: https://github.com/microsoft/repoclassbench (opens in new tab).