CoSQA: 20,000+ Web Queries for Code Search and Question Answering

Junjie Huang; Duyu Tang; Linjun Shou (寿林钧); Ming Gong (YIMING); Ke Xu; Daxin Jiang (姜大昕); Ming Zhou; Nan Duan

CoSQA: 20,000+ Web Queries for Code Search and Question Answering

Junjie Huang ,
Duyu Tang ,
Linjun Shou (寿林钧) ,
Ming Gong (YIMING) ,
Ke Xu ,
Daxin Jiang (姜大昕) ,
Ming Zhou ,
Nan Duan

ACL-IJCNLP 2021 | May 2021

Download BibTex

Finding codes given natural language query is beneficial to the productivity of software developers. Future progress towards better semantic matching between query and code requires richer supervised training resources. To remedy this, we introduce the CoSQA this http URL includes 20,604 labels for pairs of natural language queries and codes, each annotated by at least 3 human annotators. We further introduce a contrastive learning method dubbed CoCLR to enhance query-code matching, which works as a data augmenter to bring more artificially generated training instances. We show that evaluated on CodeXGLUE with the same CodeBERT model, training on CoSQA improves the accuracy of code question answering by 5.1%, and incorporating CoCLR brings a further improvement of 10.5%.

Publication Downloads

CodeBERT

May 12, 2021

Download Data

CodeXGLUE

September 28, 2020

CodeXGLUE is a benchmark dataset and open challenge for code intelligence. It includes a collection of code intelligence tasks and a platform for model evaluation and comparison. CodeXGLUE stands for General Language Understanding Evaluation benchmark for CODE. It includes 14 datasets for 10 diversified code intelligence tasks covering these scenarios including code-code, text-code, code-text and text-text.

Download Data