To help advance the state of sign language modeling, we created ASL STEM Wiki — the first continuous signing dataset focused on Science, Technology, Engineering, and Math (STEM). The corpus contains 254 Wikipedia articles on STEM topics in English, interpreted into 300 hours of American Sign Language (ASL). In addition to its size and topic, unlike many prior datasets, it contains videos of professional signers, including many CDIs (Certified Deaf Interpreters), and was collected with consent from each contributor under IRB approval. Deaf research team members were involved throughout.
This dataset is released alongside our paper identifying several use cases for ASL STEM Wiki and providing baselines for one of these tasks — fingerspelling detection and identification. Because the dataset focuses on STEM, and STEM terminology often lacks standardized signs, fingerspelling of technical terms appears frequently in our dataset. To help identify fingerspellings, we provide models for fingerspelling detection and alignment, and release benchmark performance on the ASL STEM Wiki dataset for the research community to build on. Our models highlight the difficulty of the detection and alignment task, and provide the first evidence that self-supervised contrastive pretraining can improve fingerspelling detection.
Our dataset empowers a small bilingual resource for students, providing full English texts for STEM articles alongside professional ASL interpretations. This resource enables students and other readers to access spot-translations for select sentences, and to play through entire articles as desired. We release this resource as well.
This project was conducted at Microsoft Research with collaborators.
- Microsoft: Danielle Bragg (PI), Hal Daumé III, Alex Lu, Vanessa Milan, Fyodor Minakov, Chinmay Singh, Cyril Zhang
- University of California, Berkeley: Kayo Yin
Dataset License: Please see the supporting tab. If you are interested in commercial use, please contact [email protected].
Dataset Download:
To download via web interface, please visit: Download ASL STEM Wiki from Official Microsoft Download Center
To download via command line, please execute: wget https://download.microsoft.com/download/4/c/f/4cfec788-7478-4e47-9a15-ace9b6a96198/ASL_STEM_Wiki.zip
Bilingual STEM article resource: Wiki – The ASL Data Community (opens in new tab).
Open-source Repo: Coming soon!
Citation: If you use this dataset in your work, please cite our paper (opens in new tab).
@inproceedings{yin-etal-2024-asl,
title = "{ASL} {STEM} {W}iki: Dataset and Benchmark for Interpreting {STEM} Articles",
author = "Yin, Kayo and
Singh, Chinmay and
Minakov, Fyodor O and
Milan, Vanessa and
Daum{\'e} III, Hal and
Zhang, Cyril and
Lu, Alex Xijie and
Bragg, Danielle",
editor = "Al-Onaizan, Yaser and
Bansal, Mohit and
Chen, Yun-Nung",
booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
month = Nov,
year = "2024",
address = "Miami, Florida, USA",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.emnlp-main.801",
pages = "14474--14490",
abstract = "Deaf and hard-of-hearing (DHH) students face significant barriers in accessing science, technology, engineering, and mathematics (STEM) education, notably due to the scarcity of STEM resources in signed languages. To help address this, we introduce ASL STEM Wiki: a parallel corpus of 254 Wikipedia articles on STEM topics in English, interpreted into over 300 hours of American Sign Language (ASL). ASL STEM Wiki is the first continuous signing dataset focused on STEM, facilitating the development of AI resources for STEM education in ASL.We identify several use cases of ASL STEM Wiki with human-centered applications. For example, because this dataset highlights the frequent use of fingerspelling for technical concepts, which inhibits DHH students{'} ability to learn,we develop models to identify fingerspelled words{---}which can later be used to query for appropriate ASL signs to suggest to interpreters.",
}
Acknowledgements: We are deeply grateful to all community members who participated in this dataset project.