Neural Machine Translation Enabling Human Parity Innovations In the Cloud
In March 2018 we announced (Hassan et al. 2018) a breakthrough result where we showed for the first time a Machine Translation system that could perform as well as human translators (in a specific scenario – Chinese-English news translation). This was an exciting breakthrough in Machine Translation research, but the system we built for this project was a complex, heavyweight research system, incorporating multiple cutting-edge techniques. While we released the output of this system on several test sets, the system itself was not suitable for deployment in a real-time machine translation cloud API.
Today we are excited to announce the availability in production of our latest generation of neural Machine Translation models. These models incorporate most of the goodness of our research system and are now available by default when you use the Microsoft Translator API. These new models are available today in Chinese, German, French, Hindi, Italian, Spanish, Japanese, Korean, and Russian, from and to English. More languages are coming soon.
Getting from Research Paper to Cloud API
Over the past year, we have been looking for ways to bring much of the quality of our human-parity system into the Microsoft Translator API, while continuing to offer low-cost real-time translation. Here are some of the steps on that journey.
Teacher-Student Training
Our first step was to switch to a “teacher-student” framework, where we train a lightweight real-time student to mimic a heavyweight teacher network (Ba and Caruana 2014). This is accomplished by training the student not on the parallel data that MT systems are usually trained on, but on translations produced by the teacher (Kim and Rush 2016). This is a simpler task than learning from raw data, and allows a shallower, simpler student to very closely follow the complex teacher. As one might expect, our initial attempts still suffered quality drops from teacher to student (no free lunch!), but we nevertheless took first place in the WNMT 2018 Shared Task on Efficient Decoding (Junczys-Dowmunt et al. 2018a). Some particularly exciting results from this effort were that Transformer (Vaswani et al. 2017) models and their modifications play well with teacher-student training and are astoundingly efficient during inference on the CPU.
Learning from these initial results and after a lot of iteration we discovered a recipe that allows our simple student to have almost the same quality as the complex teacher (sometimes there is a free lunch after all?). Now we were free to build large, complex teacher models to maximize quality, without worrying about real-time constraints (too much).
Real-time translation
Our decision to switch to a teacher-student framework was motivated by the great work by Kim and Rush (2016) for simple RNN-based models. At that point it was unclear if the reported benefits would manifest for Transformer models as well (see Vaswani et al. 2017 for details on this model). However, we quickly discovered that this was indeed the case.
The Transformer student could use a greatly simplified decoding algorithm (greedy search) where we just pick the single best translated word at each step, rather than the usual method (beam-search) which involves searching through the huge space of possible translations. This change had minimal quality impact but led to big improvements in translation speed. By contrast, a teacher model would suffer a significant drop in quality when switching from beam-search to greedy-search.
At the same time, we realized that rather than using the latest neural architecture (Transformer with self-attention) in the decoder, the student could be modified to use a drastically simplified and faster recurrent (RNN) architecture. This matters because while the Transformer encoder can be computed over the whole source sentence in parallel, the target sentence is generated a single word at a time, so the speed of the decoder has a big impact on the overall speed of translation. Compared to self-attention, the recurrent decoder reduces algorithmic complexity from quadratic to linear in target sentence length. Especially in the teacher-student setting, we saw no loss in quality due to these modifications, neither for automatic nor for human evaluation results. Several additional improvements such as parameter sharing led to further reductions in complexity and increased speed.
Another advantage of the teacher-student framework we were very excited to see is that quality improvements over time of the ever growing and changing teachers are easily carried over to a non-changing student architecture. In cases where we saw problems in this regard, slight increases in student model capacity would close the gap again.
Dual Learning
The key insight behind dual learning (He et al. 2016) is the “round-trip translation” check that people sometimes use to check translation quality. Suppose we’re using an online translator to go from English to Italian. If we don’t read Italian, how do we know if it’s done a good job? Before clicking send on an email, we might choose to check the quality by translating the Italian back to English (maybe on a different web site). If the English we get back has strayed too far from the original, chances are one of the translations went off the rails.
Dual learning uses the same approach to train two systems (e.g. English->Italian and Italian->English) in parallel, using the round-trip translation from one system to score, validate and train the other system.
Dual learning was a major contributor to our human-parity research result. In going from the research system to our production recipe, we generalized this approach broadly. Not only did we co-train pairs of systems on each other’s output, we also used the same criterion for filtering our parallel data.
Cleaning up inaccurate data
Machine translation systems are trained on “parallel data”, i.e. pairs of documents that are translations of each other, ideally created by a human translator. As it turns out, this parallel data is often full of inaccurate translations. Sometimes the documents are not truly parallel but only loose paraphrases of each other. Human translators can choose to leave out some source material or insert additional information. The data can contain typos, spelling mistakes, grammatical errors. Sometimes our data mining algorithms are fooled by similar but non-parallel data, or even by sentences in the wrong language. Worst of all, a lot of the web pages we see are spam, or may in fact be machine translations rather than human translations. Neural systems are very sensitive to this kind of inaccuracy in the data. We found that building neural models to automatically identify and get rid of these inaccuracies gave strong improvements in the quality of our systems. Our approach to data filtering resulted in the first place in the WMT18 parallel corpus filtering benchmark (Junczys-Dowmunt 2018a) and helped build one of the strongest English-German translation systems in the WMT18 News translation task (Junczys-Dowmunt 2018b). We used improved versions of this approach in the production systems we released today.
Factored word representations
When moving a research technology to production, several real-world challenges arise. Getting numbers, dates, times, capitalization, spacing, etc. right matters a lot more in production than in a research system.
Consider the challenge of capitalization. If we’re translating the sentence “WATCH CAT VIDEOS HERE”. We know how to translate “cat”. We would want to translate “CAT” the same way. But now consider “Watch US soccer here”. We don’t want to confuse the word “us” and the acronym “US” in this context.
To handle this, we used an approach known as factored machine translation (Koehn and Hoang 2007, Sennrich and Haddow 2016) which works as follows. Instead of a single numeric representation (“embedding”) for “cat” or “CAT”, we use multiple embeddings, known as “factors”. In this case, the primary embedding would be the same for “CAT” and “cat” but a separate factor would represent the capitalization, showing that it was all-caps in one instance but lowercase in the other. Similar factors are used on the source and the target side.
We use similar factors to handle word fragments and spacing between words (a complex issue in non-spacing or semi-spacing languages such as Chinese, Korean, Japanese or Thai).
Factors also dramatically improved translation of numbers, which is critical in many scenarios. Number translation is mostly an algorithmic transformation. For example, 1,234,000 can be written as 12,34,000 in Hindi, 1.234.000 in German, and 123.4万 in Chinese. Traditionally, numbers are represented like words, as groups of characters of varying length. This makes it hard for machine learning to discover the algorithm. Instead, we feed every single digit of a number separately, with factors marking beginning and end. This simple trick robustly and reliably removed nearly all number-translation errors.
Faster model training
When we’re training a single system towards a single goal, as we did for the human-parity research project, we expect to throw vast numbers of hardware at models that take weeks to train. When training production models for 20+ language pairs, this approach becomes untenable. Not only do we need reasonable turn-around times, but we also need to moderate our hardware demands. For this project, we made a number of performance improvements to Marian NMT (Junczys-Dowmunt et al. 2018b).
Marian NMT is the open-source Neural MT toolkit that Microsoft Translator is based on. Marian is a pure C++ neural machine translation toolkit, and, as a result, extremely efficient, not requiring GPUs at runtime, and very efficient at training time
Due to its self-contained nature, it is quite easy to optimize Marian for NMT specific tasks, which results in one of the most efficient NMT toolkits available. Take a look at the benchmarks. If you are interested in Neural MT research and development, please join and contribute to the community on Github.
Our improvements concerning mixed-precision training and decoding, as well as for large model training will soon be made available in the public Github repository.
We are excited about the future of neural machine translation. We will continue to roll out the new model architecture to the remaining languages and Custom Translator throughout this year. Our users will automatically get the significantly better-quality translations through the Translator API, our Translator app, Microsoft Office, and the Edge browser. We hope the new improvements help your personal and professional lives and look forward to your feedback.
References
- Jimmy Ba and Rich Caruana. 2014. Do Deep Nets Really Need to be Deep? Advances in Neural Information Processing Systems 27. Pages 2654-2662. https://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep
- Hany Hassan, Anthony Aue, Chang Chen, Vishal Chowdhary, Jonathan Clark, Christian Federmann, Xuedong Huang, Marcin Junczys-Dowmunt, William Lewis, Mu Li, Shujie Liu, Tie-Yan Liu, Renqian Luo, Arul Menezes, Tao Qin, Frank Seide, Xu Tan, Fei Tian, Lijun Wu, Shuangzhi Wu, Yingce Xia, Dongdong Zhang, Zhirui Zhang, Ming Zhou. 2018. Achieving Human Parity on Automatic Chinese to English News Translation. http://arxiv.org/abs/1803.05567
- He, Di and Xia, Yingce and Qin, Tao and Wang, Liwei and Yu, Nenghai and Liu, Tie-Yan and Ma, Wei-Ying. 2016. Dual Learning for Machine Translation. Advances in Neural Information Processing Systems 29. Pages 820-828. https://papers.nips.cc/paper/6469-dual-learning-for-machine-translation
- Marcin Junczys-Dowmunt. 2018a. Dual Conditional Cross-Entropy Filtering of Noisy Parallel Corpora. Proceedings of the Third Conference on Machine Translation: Shared Task Papers. Belgium, pages 888-895. https://aclweb.org/anthology/papers/W/W18/W18-6478/
- Marcin Junczys-Dowmunt. 2018b. Microsoft’s Submission to the WMT2018 News Translation Task: How I Learned to Stop Worrying and Love the Data. Proceedings of the Third Conference on Machine Translation: Shared Task Papers. Belgium, pages 425-430. https://www.aclweb.org/anthology/W18-6415/
- Marcin Junczys-Dowmunt, Kenneth Heafield, Hieu Hoang, Roman Grundkiewicz, Anthony Aue. 2018a. Marian: Cost-effective High-Quality Neural Machine Translation in C++. Proceedings of the 2nd Workshop on Neural Machine Translation and Generation. Melbourne, Australia, pages 129-135. https://aclweb.org/anthology/papers/W/W18/W18-2716/
- Marcin Junczys-Dowmunt, Roman Grundkiewicz, Tomasz Dwojak, Hieu Hoang, Kenneth Heafield, Tom Neckermann, Frank Seide, Ulrich Germann, Alham Fikri Aji, Nikolay Bogoychev, André F. T. Martins, Alexandra Birch. 2018b. Marian: Fast Neural Machine Translation in C++. Proceedings of ACL 2018, System Demonstrations. Melbourne, Australia, pages 116-121. https://www.aclweb.org/anthology/P18-4020/
- Yoon Kim and Alexander M. Rush. 2016. Sequence-level knowledge distillation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, pages 1317–1327. https://aclweb.org/anthology/papers/D/D16/D16-1139/
- Philipp Koehn, Hieu Hoang. 2007. Factored Translation Models. Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL). Prague, Czech Republic, pages 868-876. https://www.aclweb.org/anthology/D07-1091/
- Rico Sennrich, Barry Haddow. 2016. Linguistic Input Features Improve Neural Machine Translation. Proceedings of the First Conference on Machine Translation: Volume 1, Research Papers. Berlin, Germany, pages 83-91. https://www.aclweb.org/anthology/W16-2209/
- Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, Lukasz and Polosukhin, Illia. 2017. Attention is all you need. Advances in Neural Information Processing Systems 30. Pages 5998-6008. https://papers.nips.cc/paper/7181-attention-is-all-you-need