Globe with pictures of people, global communication network

Microsoft XC Research

Define your threshold: Communicating confidence in UX scorecards

Share this page

By Michael Van Waardhuizen (opens in new tab)

Man in a suit holds a tablet in his hand. Graphic images of information circles around it

At many large companies like Microsoft, leaders review lots of information about how products are performing. Objectives and key results (OKRs), key performance indicators (KPIs), dashboards, and reports are just a few ways we look at performance. This type of data is what decision makers ask for to understand the status of the products we make.

User research as a discipline often responds by developing quantitative data reported through scorecards, but they’re not always accurate or clear in what they’re reporting. So let’s take a closer look at how these tools are used and how they can be improved.

Scorecards? Yes, scorecards!

Scorecards have been around for a long time in the industry. A simple web search (opens in new tab) will review dozens of examples of UX scorecards, and numerous textbooks have been written on the subject. Scorecards can vary in many ways, but at the heart of them, we often find:

An example of a scorecard with various measurements

A traditional scorecard shows various scenarios and measurements

  • A table of data: Tasks, scenarios, or key results are displayed in rows with quantified metrics in columns
  • A set of color codes: Typically on a green to red spectrum, the colors mark a product’s status or severity, possibly with icons, sparklines, or other accessory info

Of course, many variations exist on this theme. There are charting values, trends over time, or presenting multiple metrics in scatter plot form; there are charts using color gradients, infographic elements, including (more or less) explanatory text. The list goes on and on—and it’s as long as all the options in Excel.

Unfortunately, despite all these variations, scorecards frequently let us down in an important way. There are many issues one could point out: The limited focus on a subset of tasks/items, biasing the overview of a product; the equal weighting tasks/items are typically given; and the lack of accessibility in color schemes frequently used.

These are all important factors that need addressing. But the issue I’m going to address, here and now? False confidence.

In an effort to be simple, most scorecards report very precise-looking numbers that are sorted into colors or grades. Frequently, items may land close to or right on the boundary and appear as strong or severe as any other score in that grade. This is deceptive because—if we’re being honest—we seldom have enough data to confidently state metrics as precisely as portrayed in scorecards.

A gray colored box, with columns and rows. The squares are colored red, green, and yellow with various amounts

A scorecard sorted by color and grades may appear strong but can be deceptive

Take the above image, for example. Task 13 has a 79 percent marked as yellow, where Task 14 has an 82 percent marked as green. However, with the sample size used for that data, no real (statistically significant) difference existed between the two.

These sorts of differences are problematic to a decision-maker who may be choosing where to invest, or whether to release a product and undermine the reliability and reputation of UX research data. Fortunately, there is an easy, well-established answer: confidence intervals.

Confidence in sunshine; confidence in rain

I’m not going to go into how confidence intervals are calculated—there are many resources out there for that. Instead, the question for us to explore is, “How do we communicate confidence in scorecards?”

Researchers have tried myriad options, with their own strengths and limitations. Here are a few of them:

Scorecard with tasks, metrics and time spent

Tabular ranges: One could replace means in the table with the range of data, but it can be hard to read and margins of error may be large for some data sets

Scorecard showing time spent on tasks

Confidence Interval (CI) Footnotes: The CI may be included in small text nearby, but that’s hard to read and easy to ignore

Scorecard measuring perceived ease of use

Bar charts with error bars: This common approach works well, but can be difficult to read when the numbers of metrics and/or tasks increase. Box and whisker plots have similar issues

Violin plots: Though powerful for showing relative probability, violin plots can be harder to generate and have similar limitations as regular bar charts

One solution we’ve tried has shown some promise for improving the at-a-glance understanding of confidence. Let’s transform our data slightly to make reporting easy.

Crossing the threshold

An easy way to simplify confidence reporting is to define a threshold—a line in the sand for what a “good” score is. Thresholds can be set a few ways. For rating questions, I might focus on the label text of an option—“Somewhat satisfied” is acceptable to me, but “neither satisfied nor dissatisfied” is not. We also use previous studies and data to correlate metrics to see where one good threshold can correspond to another. Once we have defined a threshold for a metric, we are able to simplify the metric:

A bar graph showing perceived time

The green line denotes the threshold, which divides the confidence into three states

The threshold line in green above divides our confidence into three states: entirely above the line (good!), entirely below the line (boo!), and straddling the line (umm?).  We now have three states to report: clearly passing, clearly failing, and indeterminate.

Earlier, I noted that color-coding grades is very common in scorecards. Now, we can co-opt the common color codes to communicate confidence clearly (whew!). Here I have assigned passing scores the color green, failing scores the color red (or another, more accessible color scheme), and have left indeterminate scores uncolored.

Scorecard reporting tasks and amount time spent on each one, with success rate

Alternatively, if comparing scores against a competitor or previous benchmark, I use icons to communicate both confidence and direction.

Scorecard measuring tasks using arrows

In both cases, there are several useful properties:

  • Colors provide at-a-glance confirmation of scores that are rooted in the statistical confidence of the metric
  • Colors and icons are able to live happily in a table, allowing for reporting of significant numbers of tasks/scenarios and metrics without visually overloading the audience, improving understanding and impact
  • The transformation of the data does not rely on complex statistical methods that are time-consuming to explain to product team stakeholders who may have less statistical training

There are many different ways to visualize evaluation results with many trade-offs; however, other scorecard limitations still apply. But it’s important for researchers to present data that communicates clearly and accurately. Using this visualization method, your product team will know which insights to take most seriously and address first, while fitting within most existing scorecard presentations. Be confident in your data!

Do you use scorecards at your organization? If so, how do you measure and define your thresholds? Tweet us your thoughts @MicrosoftRI (opens in new tab) or join the conversation on Facebook (opens in new tab) and share your comments.


Michael Van Waardhuizen is a Senior Researcher Manager for a horizontal research team, providing qualitative and quantitative research-as-a-service to help other researchers and product teams scale up and conduct high-quality research faster. He is also a builder of tools, processes, and backends for scaling research. Previously, he conducted applied research to improve the user experience of SharePoint, Office 365, OneDrive, and Windows Mixed Reality.