How to build effective human-AI interaction: Considerations for machine learning and software engineering
Besmira Nushi, Saleema Amershi, Ece Kamar, Gagan Bansal, Dan Weld, Mihaela Vorvoreanu, Eric Horvitz
To help practitioners design better user-facing AI-based systems, Microsoft recently published a set of guidelines for Human-AI Interaction based on decades of research and validated through rigorous user studies across a variety of AI products. The guidelines cover a broad range of interactions: starting from when a user is initially introduced to an AI system and extending to continuous interactions, AI model updates and refinements, as well as managing AI failures.
While the guidelines describe what AI practitioners should create to support effective human-AI interaction, this series of posts explains how to implement the guidelines in AI-based products.
We focus on implications for machine learning and software engineering because in most cases implementing any guideline cannot be supported solely via front-end interface tweaks. For example, explaining the behavior of an algorithm (Guideline 11) is not possible if the algorithm itself is not transparent. In another example, identifying and tailoring information to the user’s current context (Guidelines 3 and 4) may not be possible if the logging infrastructure and devices do not track the correct signals or if that information is noisy.
This series of posts puts forward actions that machine learning practitioners and software engineers can take today to enable effective human-AI interaction as described by the guidelines.
- Part I discusses methods you can use to help set the right expectations for users during their initial interactions with the AI system.
- Part II will discuss how to handle and time services based on context.
- Part III will focus on current methods for bias mitigation and adjusting to users’ social norms.
- Part IV will shift the attention to handling situations when the system might be wrong.
- Part V will expand the implications to supporting interaction experiences over time.
Part I: Set the right expectations initially
In this part, we focus on two guidelines concerning setting the right expectations during initial interactions between a person and an AI system:
Guideline 1: Make clear what the system can do.
Guideline 2: Make clear how well the system can do what it can do.
Making the capabilities and limitations of a system clear is important for any user-facing software. It may be particularly important for AI-based systems because people often have unrealistic expectations about their capabilities. Traditional techniques for conveying this information include user manuals, in-built documentation, and contextual help. Unfortunately, today we still do not know how to exactly produce such artifacts for AI software. Take for instance a question answering system such as your favorite personal assistant (e.g., Alexa, Cortana, Google Assistant, Siri). Can we precisely describe for what domains it can and cannot answer questions? If so, do we always know how accurate it is in each domain? If the assistant is able to answer historical questions does that mean that it also can answer geographical ones? Can a machine learning engineer describe these facets with confidence? Inflated user expectations combined with a lack of clarity about an AI system’s capabilities can lead to dissatisfaction, distrust, or product abandonment in the best case and injury, unfairness and harm in the worst.
Below, we outline current best practices and actions that engineers and model developers can take now to help make the capabilities and limitations of an AI system clear to end-users.
Go beyond aggregate, single-score performance numbers when evaluating the capabilities and limitations of an AI model.
Making the capabilities and limitations of AI systems clear starts with developing a more comprehensive understanding of an AI’s potential behavior. This requires rethinking current evaluation practices that rely on aggregate, single-score performance numbers.
In machine learning evaluation it’s common to report model performance using claims such as “Model X is 90% accurate” for a given benchmark. This aggregate number tells us very little about whether we should expect uniform performance over the whole benchmark or whether there are pockets of data in this benchmark for which accuracy is much lower. In practice, the latter happens much more frequently due to bias in the training data or because some concepts might be more difficult to learn than others. A well-known example of such behavior is the GenderShades study, which showed that the performance of facial recognition algorithms in gender detection is significantly lower for women with a darker skin tone than for other demographic groups. If these discrepancies are unclear to system engineers themselves, it is certainly harder to explain them to end users and customers.
Multi-faceted, in-depth error analysis can help us answer questions such as: Is the model equally accurate for all demographic groups? Are there any environmental or input contexts for which the model performs significantly better or worse? How bad is system performance at the 99-th error percentile?
To conduct multi-faceted, in-depth error analysis, consider the following practices:
- Examine patterns of failure at various levels of granularity – Pandora is a methodology that can help describing failures in a generalizable way. Pandora provides a set of performance views that operate in different levels of abstraction: global views (to convey the overall system performance), cluster views (for individual pockets of the data), and instance views (explaining performance for a single data point). For each view, it is possible to see how one or multiple input features in combination are correlated with model performance. Switching back and forth between these views enables developers to better understand failure in different contexts by slicing and dicing the data in a way that is informed by the likelihood of error. For example, results from this work showed that for systems with rich and multi-modal input spaces performance can be very different in different regions and those differences can occur for very different reasons.
- Examine patterns of failure over different slices of the data – In the language domain, the Errudite tool allows developers to query the input data in a flexible way and characterize the result set according to the error rates. Errudite makes data slicing easier by introducing data selection operators that have semantic meaning. In addition, the tool also enables temporary data edits that enable counterfactual analysis (i.e., What would have happened if a particular example was slightly different?).
Use multiple and realistic benchmarks for evaluation.
Evaluating the capabilities and limitations of an AI using any metric typically requires testing AI models on some data. Evaluating AI models on known public benchmarks is always a good practice. It brings a quantitative perspective on how well the system is doing compared to other state of the art techniques. However, it should not be the only means of evaluation for two reasons.
First, optimizing on a single benchmark over and over again and in each model improvement cycle can lead to hidden overfitting. For example, even if the model is not trained or validated on the benchmark, the induction bias on modelling decisions might have been guided towards benchmark improvements. This practice is unfortunately often incentivized by the way we report and reward performance in large competitions and academic articles, which raises the important question of how can we rethink these practices in the context of making real systems more reliable.
Second, the data coming from our own application may look very different from the benchmark distribution. A benchmark face detection dataset may not include images with the same variety of angles or lighting conditions that your application will be operating in. Moreover, these conditions may change over time as people use and adapt their behavior to the system.
To alleviate these problems, you can:
- Monitor the model against multiple benchmarks instead of just one. This way, you can check whether model improvements and adjustments generalize over different benchmarks.
- Partition the benchmarks into a diverse set of use cases and monitor performance for each of them so that when generalization fails you can map it back to the type of use case.
- Include data from your real world application in the evaluation. If you are worried about not having enough real application usage data, the good news is that for evaluation you may not need as much data as you need for training. Even small amounts of real data may be able to reveal errors that would otherwise be hidden.
- Enrich your evaluation data via data augmentation (e.g., visual transformations), testing under synthetic adversarial distributions, and involving red teaming notions with humans in the loop around evaluation for identifying errors that cannot be observed in known benchmarks.
It is also worth mentioning that any kind of benchmark enrichment or evaluation requires special attention to privacy-related concerns so that the evaluation process itself does not reveal user-sensitive data.
Include human-centered evaluation metrics when examining the behavior and performance of an AI.
To better ensure our AIs behave in ways that end-users expect and that align with their values, we need to include human-centered metrics in our evaluations. One of the most commonly used evaluation metrics is model accuracy. However, accuracy might not always translate to user satisfaction and successful task performance. Investigations in metric design have shown that in domains like machine translation and image captioning there are hidden dimensions of model performance that people care about but that current metrics are not able to represent. Similarly, the way people perceive accuracy may diverge significantly from computed accuracy; this divergence depends on multiple factors related to the type of errors present and the way system accuracy is explained to end users.
Human-centered evaluation metrics for AI, that are more closely related to human notions and expectations of quality, are continuing to emerge. Such metrics are particularly important when a model is employed to assist humans as in decision-making tasks or mixed-initiative systems. A few metrics to consider using include:
- Interpretability – How well might a human understand how the model is making a decision?
- Fairness – Does the model have comparable performance on different demographic groups? Does the system allocate a comparable amount of resources to such subgroups?
- Team utility – How well do the human and the machine perform together? Is the team performance better than either alone?
- Performance explainability – Can the human anticipate ahead of time when the system will make a mistake? (more about this in the next section)
- Complementarity – Is the machine simply replacing the human or is it focusing more on examples and tasks that the human needs help with?
The exact and formal definition of these metrics depends on the domain and often there is disagreement on which definition works best for an application. However, these discussions have led to several open source contributions in the form of libraries for computing and sometimes optimizing for such human-centered metrics: InterpretML, FairLearn, AI Explainability 360.
Keep in mind that at the end of the day none of these metrics can substitute evaluation with actual people such as user studies. If human evaluation is too resource-intensive for your situation, consider at least using human annotators to examine smaller partitions of the data to understand how well your proxy metric of choice aligns with human notions of quality in your application scenario.
Deploy models whose performance is easier to explain to people.
During model optimization and hyperparameter search, we might have multiple equally or similarly accurate model hypotheses. In these cases, consider performance explainability in addition to accuracy when deciding which model to deploy. Performance explainability makes a model more human-centered because it enables people to better understand and anticipate when the model might make mistakes so that the human can take over when needed. In a recent human-centered study, we showed that when a human collaborates with an ML model for decision-making, team performance is significantly affected by how well people can understand and anticipate a model’s error boundary (i.e., where the model makes mistakes vs. where it succeeds).
To identify a model with better performance explainability, consider the following:
- Choose models with high parsimony (i.e., how many chunks of information do you need to describe when the system errs and how complex are the error explanations?) and low stochasticity (i.e., to what extent it is possible to cleanly separate errors from success instances through error explanations?).
- In order to measure the perceived parsimony and stochasticity try to approximate human mental models on error boundaries (i.e., what the human has learned about the error boundary) by training simple explainable rule-based classifiers such as decision trees or rule lists based on previous interactions. The learned mental models would of course be only approximations or simulations, but if they are simple enough we can make sure that they do not add any misleading assumptions about how people learn.
In the future, we hope to see more work on enabling better model selection integrated as part of model optimization and training by either augmenting loss functions or via constrained minimization, with the goal of training models for which humans can develop informed and justified trust.
Consider the cost and risk of mistakes when tuning model parameters.
Although a given system may make many types of mistakes, they may not be equally likely and they may also be associated with different application-related costs. For instance, in many medical applications the cost of a false negative mistake can be much higher than a false positive, especially if consequences of non-treatment when the disease is present are more life-threatening for the patient than the corresponding side effects. As you might have already guessed, explicitly estimating such costs and risks is particularly necessary for high-stakes decision-making. Good practice for risk estimation is to conduct user research or pilot studies before the application is deployed to anticipate the impact of potential mistakes.
Cost estimates can be used to inform the tuning of model parameters. However, currently most models are trained more generally and are not customized to costs associated with a domain, mainly because such costs are often unknown to developers and they may change from one consumer to the other. Therefore, most models are trained on standard approximations of the simplistic 0/1 loss with the hope that they will serve to general applications. Being aware of such cost estimation difficulties, it is still useful to note that at least for the cases when the whole domain shares similar and known non-uniform costs of mistakes, techniques like cost-sensitive learning or importance sampling can help capture the sensitivity for different examples. For example, if false negatives are more costly than false positives, one can assign more loss to those cases during optimization. Other techniques include over (or under) representing particular classes of instances or weighing examples close to the decision boundary differently.
When costs vary across consumers, there are additional complexities related to model deployment, orchestration, and maintenance in the long term. Nevertheless, with ongoing advances in cloud deployment services (e.g., MLOps on AzureML), the task has become more approachable. These services usually tend to containerize the different model versions and serve them to customers in different endpoints.
Calibrate and explain uncertainty.
The techniques we have presented so far are more relevant for describing and measuring system performance on a global level or on a set of instances. However, since model performance can vary from instance to instance, expressing model performance on individual instances during interaction can also help set appropriate expectations for end-users (e.g., conveying model uncertainty on individual instances). Model calibration aims to associate machine learning predictions with carefully calibrated confidence scores which reflect the probability distribution of error. This would mean that if a model recognizes a traffic light in an image with a confidence of 95%, the chance of failure is indeed 5% (over a large amount of samples) if you think about model prediction accuracy as a random variable.
Many out-of-the-box machine learning algorithms today do not come with calibrated uncertainties as a default property. A few examples include Naïve Bayes models, SVMs and even Neural Networks. Some approaches you can use for uncertainty calibration include:
- Post-hoc techniques (e.g., Platt scaling or Isotonic Regression), which do not change how the model is trained but they post-process the model uncertainty predictions so that the output probability best reflects errors. Fortunately, such techniques are readily available for use in popular ML libraries such as scikit-learn. If you’re asking yourself how do these techniques apply to deep learning, this survey provides a comprehensive and summary.
- In-built techniques (e.g., bootstrapping or dropout for uncertainty estimation), which are often tailored to particular model classes but can also be used in broader contexts. For example, concepts such as dropout regularization, bootstrapping and ensembles are shown to improve uncertainty estimates. While some of these methods do come with more computational requirements (e.g., ensembles of deep networks) it might still be a good idea to consider them especially for high-stake domains.
- When fine-grained uncertainty estimation is difficult and if data is sparse, “rough” and coarse-grained calibration might be better than no calibration at all. After all, people are not going to differentiate that much between a confidence of 75% or 76%. Their decisions though may change drastically if confidence changes from 75% to 90%. Coarse-grained calibration can be integrated with post-hoc techniques by mapping original model output scores to larger buckets of confidence. Differentiating between these cases will still help phrasing answers differently when unsure, i.e., “I do not know how to answer this question yet.” or “I am not sure but I think the answer might be …”
Even though model confidence\uncertainty is a concise and succinct way of expressing expected performance, it is important to be aware of two main challenges that may appear in practice:
- Uncertainty explanation – Uncertainty scores used in production may not always be easy to interpret, especially for systems with highly dimensional and rich output. Take for instance an image captioning system that provides scene descriptions to a visually impaired user. The system has provided the following caption to the user “A group of people sitting around a table and having dinner” and is 80% confident. How should the user understand this confidence? Does this mean that there are no people at all in the scene? Or does it mean that they are not having dinner but doing something else? In such cases, it is in fact crucial to elaborate to users the semantics of the output score. Although summarizing uncertainty for richer output is still not a very well-understood problem, a possible alternative would be to highlight high uncertainties for specific chunks of the output (when available) to guide attention in the right direction.
- Training data vs. real-world distributions – Even by using the above techniques, it is important to be aware that as every other learning problem, confidence scores are only going to be as good as the training data. When there are large gaps between the real-world data and what the model has actually seen during training, despite our best efforts in calibration, confidence scores may still not be aligned with accuracy. To this end, the ML community is pushing towards important directions in detecting out-of-distribution examples or calibrating under dataset shifts, identifying unknown unknowns with humans in the loop, and adversarial robustness for malicious data shifts but certainly this still remains one of the most challenging problems in learning.
Other considerations
This post presented several strategies for better understanding your AI’s performance and capabilities. But before deploying AI in your application, you should also consider how to best present information about performance and capabilities to your target users who most likely know very little about AI. For example, uncertainty scores used in production may not always be easy to interpret by end-users, especially for systems with high dimensional and rich output.
In trying to set the right expectations of AI systems, most of the approaches suggested in this post focus on explanation-based techniques supported by a better understanding of an AI’s capabilities and limitations. However, documentation of any kind comes with some important shortcomings. Namely, most people don’t read documentation! In high-stakes scenarios, people may be required or incentivized to do so, but this suggests an opportunity for the community to get creative in exploring alternative ways to set expectations of AI systems. For example, recent work has examined exposing model parameters directly to end-users to allow them to experiment with different performance capabilities of an AI. This not only allows them to better understand how an AI will perform during regular use, it can also give people a sense of agency in determining how they may be impacted by an AI.
Finally, making clear what a system can do and how well it can do it becomes increasingly more challenging when an AI learns and changes over time. As people interact with a system over time their expectations may change or even the definition of the task they want to solve may change. Such dynamics are hard to track with static proxy measures, and we will cover these challenges in future posts focusing on handling changes over time.
Summary
This post presented practices that machine learning and engineering practitioners can use to set the right user expectations on what an AI system can do and how well. Since it is often hard to distinguish fictional hype from actual functionalities, it is responsible practice to explain as much as possible the expected quality of a product. While with data-intensive learned systems this may still be difficult, machine learning and engineering practices like the above, and hopefully others to emerge in the future, can assist us in conveying the right message and build well-justified trust.
Do you have your own practices that you want to share with the community?
Feel free to comment below or write us at [email protected].