With the advent of AI assistants, initially developed for structured databases and manually curated knowledge graphs, answers to the types of basic fact-based questions people encounter during the course of regular conversation became keystrokes or a verbal cue away. What film won the Academy Award for best picture in 1998? (Titanic.) What’s the weather going to be like today? (Likely around 80° and sunny if you’re on the Microsoft campus in Redmond, Washington.) What was the score of Monday’s Seattle Mariners game? (They won, 7-3.)
As machine reading comprehension (MRC) technology emerged, these question answer (QA) systems became capable of finding answers directly from passages of text without the need for curated databases and graphs, unlocking the potential of these systems to leverage the vast collection of material online, including digital books and Wikipedia articles. In fact, MRC technology is now being used in Bing (opens in new tab) to provide direct answers to some similar style queries by finding the answer in the text of the web pages retrieved. Microsoft is looking to extend that power to another class of questions: domain- and enterprise-specific queries.
Spotlight: Blog post
Eureka: Evaluating and understanding progress in AI
How can we rigorously evaluate and understand state-of-the-art progress in AI? Eureka is an open-source framework for standardizing evaluations of large foundation models, beyond single-score reporting and rankings. Learn more about the extended findings.
We’re developing a transfer learning–based approach for quickly adapting models that have proven good at answering general interest–type questions to documents in specialized domains using only limited amounts of domain-specific example data. Our goal is to provide a platform for easily searching such documents as instruction manuals, employee handbooks, and organization guidelines. Current answer-seeking tools like tables of contents, document indices, and keyword search mechanisms can be tedious. A system that can take people directly to the exact answer they’re looking for would save them time and effort. To illustrate our approach, we’re releasing a new demonstration system that can answer questions against Welcome to Canada (opens in new tab), the Canadian government’s guidebook for new immigrants.
‘What is?’ versus ‘What should?’
Thanks to such open datasets as the Stanford Question Answering Dataset (SQuAD) (opens in new tab), NewsQA (opens in new tab), and Microsoft Machine Reading Comprehension (MS MARCO) (opens in new tab), existing deep learning MRC algorithms have had success answering general-knowledge questions. In January 2018, for example, the R-NET system (opens in new tab) from Microsoft became the first to achieve parity with human performance on the SQuAD task (opens in new tab). When applied to domain-specific documents, though, these algorithms don’t perform as well because of fundamental differences between the general question-and-answer data they’re trained on and the types of questions asked of domain-specific data.
To explore this problem, we collected approximately 100,000 question-and-answer pairs against five different automobile manuals. When training a BERT-based QA model on a training set of 94,000 question-and-answer pairs in the auto domain, our system achieves an MRC F1 score of 0.83 on test queries for an auto manual, where the F1 score is a measure of the average overlap between the proposed answers and the reference answers and where higher scores are reflective of more accurate answers. However, when we attempt to answer auto manual questions using a model trained from 400,000 general-purpose question-and-answer pairs from the SQuAD, NewsQA, and MS MARCO datasets, an F1 score of only 0.65 is achieved.
Review of the kinds of questions present in each dataset reveals prominent differences in their scope. Ten common question types cover more than 30 percent of the questions in the general data. These are basic fact-based questions starting with words like who, when, or what that generate answers that are generally short in length, such as names, dates, and numbers. This style of questioning occurs much less frequently—less than 3 percent—in our auto manual dataset. Instead, questions in the auto domain tend to start with words like what should, how do, or what happens. Questions like these, which comprise longer answers such as technical descriptions or instructions, represent more than 30 percent of the auto manual queries but less than 2 percent of the general data queries.
While existing QA algorithms are more than capable of learning to answer questions in new domains, we recognize that collecting tens of thousands of example question-and-answer pairs to train a model for a new domain requires resources that customers might not be willing or able to expend, so we turned to transfer learning to adapt existing QA models using only small amounts of training data.
Maximizing small datasets
Transfer learning uses standard back-propagation training for a small number of epochs on the adaptation data. In the transfer learning process used in our system, five-fold cross-validation trials on the adaptation data are performed to estimate the optimal number of training epochs to maximize the performance of the model on the new domain without overfitting to the small example set. This process also inhibits the model from forgetting generalizations previously learned on the general QA data.
Our experiments have demonstrated a general QA model can be adapted and achieve substantial improvements in performance with limited amounts of domain-specific training data. The below figure shows results using a BERT-based QA modeling approach applied to an automobile manual. The solid blue line shows an F1 score of 0.64 for the general QA model, trained on a fixed amount of data, on our auto manual test set. The dotted blue line shows the F1 performance of adapting this general QA model using adaption data ranging from 190 examples to 19,000 examples. The red dotted line shows the model performance when training the model using varying amounts of only the training data collected for the manual. Finally, the solid red line shows an F1 performance of 0.83 when training the model using the full set of 94,000 auto domain question-and-answer pairs only.
F1 performance improves from 0.64 to more than 0.76 when adapting the general model with only 190 training examples. With just fewer than 1,000 examples in the new domain, an F1 score of 0.80 is achieved. By comparison, a model trained using only auto manual data needs more than 10 times more data to achieve the same 0.80 F1 score. Even in conditions when a lot of in-domain data is available, transfer learning from a general model provides clear benefits.
For those attending the 2019 Annual Meeting of the Association for Computational Linguistics (opens in new tab), we’ll be giving demonstrations of the Welcome to Canada system (opens in new tab), for which we collected only 953 question-and-answer pairs for adapting a general QA model, at the Microsoft booth throughout the conference. We look forward to seeing you there and discussing our work with you in more detail.