Imagine for a moment that you are blind and are navigating the web using a screen reader to hear websites rather than see them. Imagine that the article you have navigated to includes images. To understand the content and significance, you are relying on your screen reader to narrate the alt text associated with each image, a textual description that should be provided by the web page author.
Now imagine sitting there and hearing the following description of an image:
“slash h 3 f s 0 x u d f 3 0 l 0 6 j f t k a h dot jpeg image”
Unfortunately, low quality alt texts (such as using a file fame rather than a caption) or completely absent alt texts are quite common, resulting in a poor browsing experience for people who rely on screen reader technology. A team of researchers at Microsoft Research decided to do something about this – address the issue of missing and poor-quality alt text online – using existing technology and a bit of out-of-the-box thinking.
Microsoft Research Blog
Microsoft Research Forum Episode 3: Globally inclusive and equitable AI, new use cases for AI, and more
In the latest episode of Microsoft Research Forum, researchers explored the importance of globally inclusive and equitable AI, shared updates on AutoGen and MatterGen, presented novel use cases for AI, including industrial applications and the potential of multimodal models to improve assistive technologies.
The resulting innovation is Caption Crawler, a prototype browser plugin that allows screen reader users to automatically replace bad or missing descriptions of images on their favorite websites with image captions from other pages that have the same image. The researchers found that this technique can retrieve captions for about 13 percent of images that previously had no alt text at all on popular websites, with even better performance (around 25 percent coverage) on sites in categories such as e-commerce that use commonly-replicated image. Caption Crawler can handle multiple captions, queueing up the results in order of quality, and these descriptions are loaded into the browser in the background in real time. The user is then able to toggle to the next reverse-searched caption in the queue using a simple keystroke.
“Technology can be used very effectively to help people but often what happens is we focus on ourselves. Sometimes people get ignored in that process. A lot of our passion is in trying to be more inclusive and in broadening the scope of who benefits from technology – and why.” – Ed Cutrell, Principal Researcher
Caption Crawler must determine how to rank alt texts in the event that multiple alternatives for the same image are discovered online. Through carefully designed questionnaires given to people with and without vision, the team discovered that for any given image the longest caption was overwhelmingly identified by both groups as the best. This allowed the team to design the plugin to queue the resulting alternative caption results according to likely quality. A user study revealed that providing a queue of alternative possibilities was valued by participants who are blind as a way for them to not only learn more about an image, but as a way of having more confidence in the accuracy of the captions.
“What we were trying to do was find a way that would not necessarily require more effort on the part of website authors that were not doing a good job of producing high quality alt text anyway, and seeing if there are other places on the web where they have done that and where we could leverage that to backfill the experience,” explained Dr. Ed Cutrell (opens in new tab), Principal Researcher at Microsoft Research in Redmond, Washington.
Cutrell and his fellow researchers, Dr. Meredith Ringel Morris (opens in new tab) of Microsoft Research and intern Darren Guinness (a doctoral student at the University of Colorado Boulder) share a deep appreciation for the unique challenges faced by screen reader users. All three have been working in the area of accessible technology for some time. This passion resonates personally through their shared desire to work on behalf of folks whom they feel sometimes are forgotten by tech.
Behind the Scenes
When existing captions on other sites are found for an image, they get streamed into the user’s browser extension via a web socket connection. The browser extension dynamically adds the caption to the page in the form of alt text for image elements and aria-labels for background images. Caption Crawler also extracts the alt text and image captions in the DOM while the user is browsing a page using the browser extension. This allows the system to keep improving as more pages are browsed by users. When multiple potential captions for a target image are located, the longest caption is presented first while a queue of all captions found is built; if the user is not satisfied with a caption, they press a shortcut key to access additional captions from the queue.
The importance of the ability to hear multiple caption options was not clear until user testing. As part of its debugging interface, the team had created the shortcut key that would allow hearing more than one caption for a given image if Caption Crawler located additional alt text online. This ability to hear multiple descriptions of a single image in fact delighted users who are blind or with low vision as they discovered that each additional caption in the queue added additional and different types of information and detail. The researchers also noticed how this added to the users’ confidence – for example, confidence that the captions were accurate if they tended to corroborate each other.
Caption Crawler automatically supplies captions when alt text for an image is missing entirely. In the case of poor quality alt text, the user simply presses a keyboard shortcut to request a replacement of the alt text with a Caption Crawler queued caption. The screen reader observes the change and automatically speaks the new caption.
When Caption Crawler is unable to find a pre-existing caption for an image on the web, it requests a computer-generated caption from the CaptionBot API (part of Microsoft Cognitive Services (opens in new tab)) which uses computer vision to describe an image. When the text from CaptionBot is read aloud, the screen reader first speaks the word “CaptionBot” so that the user is aware that this is not a human-authored caption.
Morris points to the pervasiveness of digital imagery, with billions of images being posted daily across a variety of media, as a key motivator for research on automated techniques for creating and improving image descriptions. “Engaging with this digital imagery is part of the fabric of participation in contemporary society, including education, the professions, e-commerce, civic life, entertainment, and socializing. High-quality captions empower screen reader users to more effectively engage with this key aspect of modern life.”
Part of the innovation of this project was taking advantage of existing technology, making future implementation by any interested party fairly straightforward. But the other thing Caption Crawler accomplishes is to provide a bridge between the present and the future. Current AI visual description solutions are not yet as high-quality as human-authored descriptions; there will come a day when AI implementations are able to quite effectively provide high quality visual descriptions and write excellent captions. Cutrell points out that that time isn’t yet here. What the Caption Crawler team wanted to do is leverage high-quality human authored alt-text content until the day when AI can do perform such tasks much better.
“What I love about this research is that it really exemplifies Microsoft’s mission statement of empowering every person to achieve more.” – Meredith Ringel Morris, Principal Researcher
Caption Crawler only works for the most part for popular images; private images – images that would only appear in one location by definition, such as vacation snaps or images of items on eBay – may or may not include alt text but they fall outside the purview of Caption Crawler’s raison d’être. But many images, for example, those having to do with current events, politics, science, or celebrity movie reviews are going to appear in multiple places and Caption Crawler can play a valuable role in these cases.
The team points out that many people who are blind triangulate, that is, use multiple data points to get closer to understanding what they’re encountering on the web. The queue increases their confidence that the info they are getting back from the system is accurate.
“Folks who are blind or low vision are incredibly competent at making sense of the world around them,” says Cutrell. Indeed, people who are blind use all kinds of information to do this effectively. In the case of the web, they rely on textual content, contextual cues, who published or authored the site and what it’s trying to provide. The image is just one little bit of additional information. The Caption Crawler team believes that if it can provide some extra bits of information on top of what screen reader users are already using, they will have a fuller picture of what is on the screen. The team also plans to explore how to match captions for very similar (rather than only identical) images, to further improve the coverage that can be obtained by this approach.
Be sure to check out the team’s paper (opens in new tab), to be presented this month at the CHI 2018 (opens in new tab) conference in Montreal to see the depth and dedication this project evinces. The video also lets you see Caption Crawler in action and is absolutely worth a watch.
Related Links:
Video: Caption Crawler (opens in new tab)
Microsoft Research Ability Team (opens in new tab)