On-the-Fly Adapting Code Summarization on Trainable Cost-Effective Language Models

Yufan Cai; Yun Lin; Chenyan Liu; Jinglian Wu; Yifan Zhang; Yiming Liu; Yeyun Gong; Jin Song Dong

On-the-Fly Adapting Code Summarization on Trainable Cost-Effective Language Models

Yufan Cai ,
Yun Lin ,
Chenyan Liu ,
Jinglian Wu ,
Yifan Zhang ,
Yiming Liu ,
Yeyun Gong ,
Jin Song Dong

NeurIPS 2023 | October 2023

Deep learning models are emerging to translate or summarize source code to their comments, facilitating various software engineering tasks such as code documentation and program comprehension. Based on a large corpus in the form of code-comment pairs, a deep language model is trained to fulfill such a language translation or summarization task. To encompass and generalize extremely diverse training corpus, mainstream industries keep scaling up the deep learning models with number of neurons from millions to billions (e.g., GPT-3 and ChatGPT). While scaling up the models (to tens of billions of neurons) is effective, their training and maintenance cost in the organization is non-trivial.In this work, we explore a novel approach, AdaCom, to improve the performance of small-size or medium-size comment generators by on-the-fly model reinforcement. This research is motivated by our observation that deep comment generators, especially those with small scale, usually need to compromise their prediction on a part of the samples. Specifically, given a piece of target code 𝑐, some training samples 𝑆𝑝 can be more contributive to generate the comment of 𝑐 than the other samples 𝑆𝑜 . However, the comment generators can be under-trained on 𝑆𝑝 because it needs to fit 𝑆𝑜 from a global perspective. In this light, we design AdaCom to (1) detect whether the model might have a compromised summarization performance on a sample (i.e., source code) and (2) re-adapt the model on-the-fly by training the most contributing training samples to improve its performance for this sample. Our extensive experiments on 7 deep comment generators on 4 training datasets show that (1) AdaCom can significantly boost the performance of comment generation (the BLEU4 score by on average 14.9%, METEOR by 12.2% and ROUGE-L by 7.4%), (2) the whole adaption on an individual code sample takes very small runtime overhead (1.46 seconds for small model and 3.16 seconds for base model), well acceptable as an on-the-fly solution, and (3) AdaCom can generalize well towards out-of-distribution code samples. We also discuss the comparison between AdaCom and ChatGPT with a case study.