Bilingual Data Cleaning for SMT using Graph-based Random Walk
- Lei Cui ,
- Dongdong Zhang ,
- Shujie Liu ,
- Shujie Liu ,
- Mu Li ,
- Ming Zhou
ACL 2013 |
Published by ACL - Association for Computational Linguistics
The quality of bilingual data is a key factor
in Statistical Machine Translation (SMT).
Low-quality bilingual data tends to produce
incorrect translation knowledge and
also degrades translation modeling performance.
Previous work often used supervised
learning methods to filter lowquality
data, but a fair amount of human
labeled examples are needed which are
not easy to obtain. To reduce the reliance
on labeled examples, we propose
an unsupervised method to clean bilingual
data. The method leverages the mutual
reinforcement between the sentence
pairs and the extracted phrase pairs, based
on the observation that better sentence
pairs often lead to better phrase extraction
and vice versa. End-to-end experiments
show that the proposed method substantially
improves the performance in largescale
Chinese-to-English translation tasks.