Crowdsourcing for Statistical Machine Translation

Modern approaches to machine translation are data-driven. Statistical translation models are trained using parallel text, which consist of sentences in one language paired with their translation into another language. One advantage of statistical translation models is that they are language independent, meaning that they can be applied to any language that we have training data for. Unfortunately, for most of the world’s languages, do not have sufficient amounts of training data.

In this talk, I will detail my experiments using Amazon’s Mechanical Turk to create crowd-sourced translations for «low resource» languages that we do not have training data for. I will discuss a variety of quality-control strategies that allow non-expert translators to produce translations approaching the level of professional translators, at a fraction of the cost. I’ll analyze the impact of the quality of training data on the performance of the statistical translation model that we train from it, and ask the question: should we even bother with quality control? I’ll present feasibility studies to see which low resource languages it is possible to collect data for, and volume studies to see how much data we can expect to create in a short period. Finally, I will discuss the implications of inexpensive, high quality, translations for applications including national defense, disaster response, research, and online translation systems.

Speaker Bios

Chris Callison-Burch is an associate research professor at the Center for Language and Speech Processing at Johns Hopkins University. His research group recently released Joshua, an open source decoder for statistical machine translation (see http://cs.jhu.edu/~ccb/joshua/). He co-organized the Workshop on Creating Speech and Language Data using Amazon’s Mechanical Turk at the NAACL conference earlier this year. He obsessively built a 109 word French-English parallel corpus last year by scraping just about every bilingual site on the web. Because he really likes data. Data is awesome.

Omar F. Zaidan is a final-year PhD student at the Department of Computer Science at Johns Hopkins University. His research focuses on how to best utilize human annotators and their knowledge, and on developing interesting models that take advantage of such knowledge, especially in the context of crowdsourced annotation tasks. He also developed Z-MERT, an open source package used by research teams around the world for MT parameter tuning, and recently created the Arabic Online Commentary dataset, consisting of over 50M words of Arabic reader commentary. He was also member of the organizing committees for the Workshops for Machine Translation (WMT) in 2010 and 2011.

Date:
Haut-parleurs:
Chris Callison-Burch and Omar Zaidan
Affiliation:
Johns Hopkins University