ˆ†BLEU: A Discriminative Metric for Generation Tasks with Intrinsically Diverse Targets
- Michel Galley ,
- Chris Brockett ,
- Alessandro Sordoni ,
- Yangfeng Ji ,
- Michael Auli ,
- Chris Quirk ,
- Margaret Mitchell ,
- Jianfeng Gao ,
- Bill Dolan
Proc. of ACL |
We introduce Discriminative BLEU (∆BLEU), a novel metric for intrinsic evaluation of generated text in tasks that admit a diverse range of possible outputs. Reference strings are scored for quality by human raters on a scale of [−1, +1] to weight multi-reference BLEU. In tasks involving generation of conversational responses, ∆BLEU correlates reasonably with human judgments and outperforms sentence-level and IBM BLEU in terms of both Spearman’s ρ and Kendall’s τ.