Abstract
With the rapid development of text matching and pre-training models, chatbot systems are now able to yield relevant and fluent responses but sometimes make mistakes in logic because of weak reasoning capabilities. To facilitate the research in this field, we released MuTual, a reasoning-oriented testbed for multi-turn chit-chat models.
Why MuTual?
There are two types of methods to build a chatbot. Retrieval-based methods rely on text matching technology, selecting the response with the highest matching score among all candidates. Generation-based methods treat chatbots as a sequence-to-sequence problem and aims to generate reasonable responses directly. With its advantage in fluency and diversity in responses, the retrieval-based method has attracted increased attention with respect to chatbots.
With its great success in pretraining, the BERT-based model has achieved 85.8%, 93.1% and 98.5% in R10@1, R10@2 and R10@5 respectively on the Ubuntu Dialogue Corpus, which is very close to human performance. However, despite high performance on the leaderboard, practical user experience remains poor. Chatbot engines often generate responses that are logically incorrect or violate commonsense knowledge.
One important research question is how we can evaluate reasoning ability in chatbots, which can potentially allow us to bridge the gap between high performance on leaderboards and unsatisfactory practical performance. To this end, we developed an open domain MultI-Turn dialogue reasoning dataset (MuTual) to facilitate conversation model reasoning capabilities, consisting of 6088/886/886 high-quality human-constructed instances for training/dev/test.
MuTual follows the response selection setting, where the model selects the most appropriate response from a group of candidates. To the best of our knowledge, this is the first reasoning-oriented dataset in the chit-chat scenario. We evaluated state-of-the-art retrieval-based methods and pre-training models on MuTual and found that RoBERTa-base only yielded 71% on R4@1.
After we released the datasets, several well-known research teams submitted their models, and most of them assembled large-scale pre-training models with modification for this task. The best performance has been 87%, which still underperforms the human performance score of 94%.
Paper Link:http://arxiv.org/abs/2004.04494
Github Link:https://github.com/Nealcly/MuTual
Leaderboard Link:https://nealcly.github.io/MuTual-leaderboard
MuTual Spotlight
In existing chatbot benchmarks, such as Ubuntu and Douban, a model is required to select the positive response from a repository by considering the current context. While these benchmarks focus on testing the ability of a model to select a relevant response, they do not test reasoning ability directly. BERT has shown promising results on these benchmarks.
Many efforts have been made to develop benchmarks to address reasoning in language understanding. Most reasoning-based benchmarks are formed as reading comprehension and rely on an external question to test a model’s reasoning capability, but this can’t be used to directly help chatbots.
Following the traditional response selection setting, we modified English listening comprehension conversations to form utterance prediction tasks.
MuTual Construction
MuTual is modified from Chinese high school English listening comprehension test data, where students are expected to select the best answer among several candidate options when given a multi-turn dialogue and a question (the left part of Figure 2). To ensure that students have fully understood the audio, most of the questions require a demonstration of reasoning capability in the answers. Since chatbots are only concerned with how to respond to contexts rather than answering additional question, we further asked human annotators to rewrite the question and answer candidates as response candidates (the right part of Figure 2).
First, an annotator is required to segment the original conversation after clues to answer the question have appeared. Then they construct a positive response (Response A) and negative responses (Response C and Response D) by consulting the correct choice (Choice A) and incorrect choices (Choice B and Choice C). To make MuTual more challenging, we further asked the annotator to construct one more negative response (Response B) based on the correct choice. Through these steps, MuTual not only retains the reasoning test designed by experts, but also introduces another type of reasoning for each instance. For example, Responses C and D can be excluded based on the relationship between the two speakers, but Response B is incorrect due to the attitude reasoning.
All negative responses are logically correct if the context is not considered, but they are not appropriate responses if the context is taken into account. Therefore, our dataset focuses on multi-turn conversation reasoning rather than the logic of a sentence.
The specific types of reasoning are categorized into six types (Figure 3), including attitude reasoning (13%), algebraic reasoning (7%), intention prediction (31%), situation reasoning (16%), multi-fact reasoning (24%), and other commonsense reasoning (9%). These six types of reasoning are considered the most relevant to real chatbots. For example, it enables chatbots to make personal recommendations if the chatbot knows the user’s attitude. The ability of intention prediction allows chatbots to respond more intelligently in a long conversation session.
To evaluate whether a model is able to select a safe response when the other candidates are inappropriate, we used a safe response to replace one of the candidate responses for each instance in MuTual, yielding MuTual plus. When we replace the positive response with a safe response, it simulates a scenario in which all the other candidates are incorrect. The phenomenon is common in retrieval-based chatbots, because limited candidate responses cannot handle all cases in practice. Similarly, we can evaluate whether the model can choose the correct response instead of a safe response when a correct response exists. If the positive response is replaced, the correct one is the safe response. If the negative response is replaced, the original positive response is still the best one.
Following the standard dialogue setting, we considered our task as a response selection task and employed traditional information retrieval evaluation methods, including recall at position 1 in 4 candidates (R@1), recall at position 2 in 4 candidates (R@2), and the Mean Reciprocal Rank (MRR).
Experiments
As shown in Table 1, TF-IDF is only slightly better than random guessing, which indicates that there is no obvious statistical clue between context and positive response. All models performed significantly worse than on other popular conversation datasets. We can see that well-designed matching models, such as SMN and DAM, did not deliver better performance than simple dual LSTM. Moreover, individual scoring methods and multi-choice methods showed similar results. Even the best performance model RoBERTa-base still lagged far behind human performance by 23 points on R@1.
On MuTual plus, all models performed worse on MuTual, which is consistent with our assumption. We found that the performance of the multi-choice method is significantly better than the individual scoring method. One possible explanation is that multi-choice methods consider candidates together, so they can distinguish whether or not the safe response is the best one.
It is interesting that the performance of RoBERTa does not decrease significantly with an increase in the number of turns, which is different than the phenomenon observed with other datasets, indicating that reasoning problems do not become much harder when the context becomes longer. The results also show that the difficulty of MuTual is attributed to reasoning instead of complex conversation history.
Instances that involve numerical reasoning and intent speculation show poor performance. These two reasoning types heavily depend on commonsense reasoning. Take Figure 4 as an example: it takes a simple subtraction to derive the time difference (5:00 pm – 6h = 11:00 am), but this turned out to be a significant challenge for RoBERTa-MC. In the second case, RoBERTa-MC failed to infer the dialogue situation, where the goal is to find a flat to rent.
We further verified whether MuTual requires multi-turn understanding or if it degenerates into a single turn reasoning problem. We evaluated Roberta and Roberta-MC performance with some utterances manually removed. As the ablation utterance increases, the performance of RoBERTa and RoBERTa-MC significantly decreases, indicating the importance of each utterance and the quality of the dataset.
Conclusion
With the great success of deep learning techniques, chatbots are able to yield relevant and fluent responses. But chatbots still show unsatisfactory practical performance. This is because current chatbots do not have strong reasoning skills. To evaluate reasoning ability in chatbots directly, we released MuTual, which is the first reasoning-based dataset for multi-turn dialogue. We hope that MuTual can facilitate future research on multi-turn conversation reasoning problems.