By Michael Van Waardhuizen (opens in new tab)
At many large companies like Microsoft, leaders review lots of information about how products are performing. Objectives and key results (OKRs), key performance indicators (KPIs), dashboards, and reports are just a few ways we look at performance. This type of data is what decision makers ask for to understand the status of the products we make.
User research as a discipline often responds by developing quantitative data reported through scorecards, but they’re not always accurate or clear in what they’re reporting. So let’s take a closer look at how these tools are used and how they can be improved.
Scorecards? Yes, scorecards!
Scorecards have been around for a long time in the industry. A simple web search (opens in new tab) will review dozens of examples of UX scorecards, and numerous textbooks have been written on the subject. Scorecards can vary in many ways, but at the heart of them, we often find:
- A table of data: Tasks, scenarios, or key results are displayed in rows with quantified metrics in columns
- A set of color codes: Typically on a green to red spectrum, the colors mark a product’s status or severity, possibly with icons, sparklines, or other accessory info
Of course, many variations exist on this theme. There are charting values, trends over time, or presenting multiple metrics in scatter plot form; there are charts using color gradients, infographic elements, including (more or less) explanatory text. The list goes on and on—and it’s as long as all the options in Excel.
Unfortunately, despite all these variations, scorecards frequently let us down in an important way. There are many issues one could point out: The limited focus on a subset of tasks/items, biasing the overview of a product; the equal weighting tasks/items are typically given; and the lack of accessibility in color schemes frequently used.
These are all important factors that need addressing. But the issue I’m going to address, here and now? False confidence.
In an effort to be simple, most scorecards report very precise-looking numbers that are sorted into colors or grades. Frequently, items may land close to or right on the boundary and appear as strong or severe as any other score in that grade. This is deceptive because—if we’re being honest—we seldom have enough data to confidently state metrics as precisely as portrayed in scorecards.
Take the above image, for example. Task 13 has a 79 percent marked as yellow, where Task 14 has an 82 percent marked as green. However, with the sample size used for that data, no real (statistically significant) difference existed between the two.
These sorts of differences are problematic to a decision-maker who may be choosing where to invest, or whether to release a product and undermine the reliability and reputation of UX research data. Fortunately, there is an easy, well-established answer: confidence intervals.
Confidence in sunshine; confidence in rain
I’m not going to go into how confidence intervals are calculated—there are many resources out there for that. Instead, the question for us to explore is, “How do we communicate confidence in scorecards?”
Researchers have tried myriad options, with their own strengths and limitations. Here are a few of them:
One solution we’ve tried has shown some promise for improving the at-a-glance understanding of confidence. Let’s transform our data slightly to make reporting easy.
Crossing the threshold
An easy way to simplify confidence reporting is to define a threshold—a line in the sand for what a “good” score is. Thresholds can be set a few ways. For rating questions, I might focus on the label text of an option—“Somewhat satisfied” is acceptable to me, but “neither satisfied nor dissatisfied” is not. We also use previous studies and data to correlate metrics to see where one good threshold can correspond to another. Once we have defined a threshold for a metric, we are able to simplify the metric:
The threshold line in green above divides our confidence into three states: entirely above the line (good!), entirely below the line (boo!), and straddling the line (umm?). We now have three states to report: clearly passing, clearly failing, and indeterminate.
Earlier, I noted that color-coding grades is very common in scorecards. Now, we can co-opt the common color codes to communicate confidence clearly (whew!). Here I have assigned passing scores the color green, failing scores the color red (or another, more accessible color scheme), and have left indeterminate scores uncolored.
Alternatively, if comparing scores against a competitor or previous benchmark, I use icons to communicate both confidence and direction.
In both cases, there are several useful properties:
- Colors provide at-a-glance confirmation of scores that are rooted in the statistical confidence of the metric
- Colors and icons are able to live happily in a table, allowing for reporting of significant numbers of tasks/scenarios and metrics without visually overloading the audience, improving understanding and impact
- The transformation of the data does not rely on complex statistical methods that are time-consuming to explain to product team stakeholders who may have less statistical training
There are many different ways to visualize evaluation results with many trade-offs; however, other scorecard limitations still apply. But it’s important for researchers to present data that communicates clearly and accurately. Using this visualization method, your product team will know which insights to take most seriously and address first, while fitting within most existing scorecard presentations. Be confident in your data!
Do you use scorecards at your organization? If so, how do you measure and define your thresholds? Tweet us your thoughts @MicrosoftRI (opens in new tab) or join the conversation on Facebook (opens in new tab) and share your comments.
Michael Van Waardhuizen is a Senior Researcher Manager for a horizontal research team, providing qualitative and quantitative research-as-a-service to help other researchers and product teams scale up and conduct high-quality research faster. He is also a builder of tools, processes, and backends for scaling research. Previously, he conducted applied research to improve the user experience of SharePoint, Office 365, OneDrive, and Windows Mixed Reality.