Tackling sign language data inequity

已发布

Blue to green gradient. Two rows of hands: the top row signing ASL and the bottom row signing Data.

Access to information is considered a human right by many global organizations and governments. But even though at least 71 countries mandate the provision of services in sign language, most information resources (like search engines or news sites) are presented in written language only. Sign languages are the primary means of communication for about 70 million d/Deaf people worldwide, and are also used by hearing family members, friends, and colleagues.

While over 300 sign languages are in use worldwide, American Sign Language (ASL) is the primary sign language used in the United States. For many deaf people, English and other written languages are actually secondary languages. Requiring signing deaf people to navigate information in a written language like English forces them to operate in a different, and potentially non-fluent language. Adapting text resources for sign language input and output introduces significant technical challenges. Automatically recognizing or translating sign language could help expand access, but AI development has been blocked by lack of high-quality data.

To help make technical systems more accessible to people with disabilities, Danielle Bragg, a senior researcher at Microsoft Research, has been leading efforts to build systems that better support sign language. This blog post provides an update on their progress, with a focus on their recent paper: ASL Citizen: A Community-Sourced Dataset for Advancing Isolated Sign Language Recognition, which introduces the first crowdsourced sign language dataset. Advancing the state of the art in sign recognition, the project demonstrates that community-centered data curation is not only the right thing to do, but also advances machine learning.

Limitations of prior datasets

ASL Citizen supports machine learning methods to overcome limitations of prior Isolated Sign Language Recognition (ISLR) datasets. Model development typically requires a large, high-quality training set (i.e. large vocabulary, minimal label noise, representation of diverse signers and environments). Lack of appropriate sign language data collected with consent has been a major limitation to development of real-world sign language systems.

Prior sign language datasets have been collected in two main ways: 1) by scraping the internet for videos or 2) by inviting people to a lab for recording. While scraping can result in large collections, the videos are typically collected without consent from video creators, and scraping violates many websites’ terms of service. On the other hand, lab collections typically come with written consent from participants, but they are generally small, limited by the human hours required to record participants and the small pool of potential contributors located nearby. Lab collections also fail to capture diverse real-world settings, and it is difficult to identify and label content in scraped videos. To enable real-world sign language AI, sign language datasets need to additionally capture real-world settings, include diverse people, and be accurately labelled.

Microsoft Research Blog

Microsoft Research Forum Episode 3: Globally inclusive and equitable AI, new use cases for AI, and more

In the latest episode of Microsoft Research Forum, researchers explored the importance of globally inclusive and equitable AI, shared updates on AutoGen and MatterGen, presented novel use cases for AI, including industrial applications and the potential of multimodal models to improve assistive technologies.

Designing the ASL Citizen collection

To overcome the limitations of past datasets, the research team designed a novel sign language crowdsourcing platform. The platform was web-based and enabled people who wanted to contribute to log in, engage in consent, and record videos. Web collection opened the project to a larger, more diverse audience, including anyone with internet access. It also enabled capturing everyday environments, which real-world systems need to learn to handle. By enabling people to contribute wherever and whenever they want, but still providing an explicit consent process, crowdsourcing enables scale with consent.

The platform design also solved labeling problems. Labeling challenges have limited past dataset size due to the large amount of time required to identify video contents. For example, for a dataset of sign language monologues, each video must be carefully watched, and the signed contents must be identified, marked down, and time-aligned with the video. Only highly skilled experts can annotate sign language videos, and the process often takes more time and resources than the video collection itself. To completely avoid such labelling overhead, the platform was designed to collect pre-labelled contents. It accomplishes this by prompting users with sign videos that have known contents and asking them to record their own version. This enabled the team to automatically label videos that users created with the prompt videos.

The platform was community-focused in multiple ways. All website content was presented in both English and ASL; the recording prompts were in ASL; the project goals were shared explicitly; participants were able to verify one another’s contributions; and the community-sourced dataset was made available as a community resource in the form of a dictionary.

In designing the platform, the team engaged in an iterative process, incorporating feedback from community stakeholders and testers, and ran a pilot study with the platform to better understand the user experience and quality of collected data. For more about the platform design and pilot study, see: Exploring Collection of Sign Language Videos through Crowdsourcing. The team also experimented with crowdsourcing videos of complete sentences, which are required for more complete sign language modeling, as discussed in ASL Wiki: An Exploratory Interface for Crowdsourcing ASL Translations.

Improving models using ASL Citizen

A diverse group of experts helped bring ASL Citizen to life. Engineers and a designer at Microsoft helped build and scale the platform design; a well-known ASL professional recorded the prompt videos; and collaborators at Boston University’s Deaf Center (opens in new tab) provided feedback and managed participant recruitment and engagement. Deaf research team members were involved throughout. Consisting of about 84,000 videos of 2,700 distinct signs from ASL, the resulting dataset is the largest labelled ISLR dataset and the first crowdsourced ISLR dataset.

Using the new dataset, the researchers adapted previous approaches to ISLR to the real-world task of looking up signs in a dictionary, and released a set of baselines for machine learning researchers to build upon, focusing on supervised deep learning methods. To establish baseline models, the team partnered with collaborators from the Paul G. Allen School of Computer Science and Engineering (opens in new tab) at the University of Washington. Comparison to prior datasets was difficult because each dataset consists of a different vocabulary. However, compared to the best prior dataset, using just overlapping vocabulary with one baseline, the new dataset boosts performance from 16% to 71% accuracy. Even without algorithmic advances, training and testing on ASL Citizen improves ISLR accuracy compared to prior work, despite spanning a larger vocabulary and testing on completely unseen users.

Building on past efforts

The ASL Citizen project is part of Microsoft’s mission of empowerment and societal impact and part of Bragg’s focus on advancing sign language technology. Doing this effectively requires human-centered, interdisciplinary work.  Because sign language is central to Deaf culture and identity, developing successful sign language AI requires not only technical work, but also deep understanding of the community and alignment of technology design with their perspectives. 

Prior work informing the ASL Citizen dataset included a workshop on sign language AI at Microsoft in 2019, which convened diverse thought leaders from academia, industry, and the Deaf community to discuss the state of sign language computation. The resulting Best Paper at ASSETS 2019: Sign Language Recognition, Generation, and Translation: An Interdisciplinary Perspective, outlined the state of the art, the field’s biggest challenges, and calls to action. This work helped establish data challenges as a major limitation, and highlighted the importance of Deaf community involvement.

Aware of sensitivities to sign language technology development, Bragg and colleagues also mapped out the ethics of sign language AI datasets in a 2021 paper: The FATE Landscape of Sign Language AI Datasets: An Interdisciplinary Perspective. This paper establishes the impact of data choices on models and users, and discusses other complex issues, including data ownership, data sharing, and transparency around sign language AI. To help address such issues, Bragg and collaborators experimented with disguising people’s faces in sign language videos and examined the impact on model performance in a 2020 paper: Exploring Collection of Sign Language Datasets: Privacy, Participation, and Model Performance. Designing sign language data collections to maximize benefits while minimizing harms is hard, and design decisions involve tradeoffs.

To better understand the eventual use cases for sign language AI built using datasets like ASL Citizen, Bragg and collaborators studied community perspectives on sign language AI in another 2023 paper: U.S. Deaf Community Perspectives on Automatic Sign Language Translation, which outlines a survey of Deaf community perspectives on ASL translation.

As large language models and deep learning continue to develop, Bragg expects that high-quality representative training data will become increasingly essential. She hopes that the team’s work can serve as an example of how engaging with communities can also help to advance ML.

相关论文与出版物

继续阅读

查看所有博客文章