Word-Level Language Identification Using CRF: Code-Switching Shared Task Report of MSR India System
- Gokul Chittaranjan ,
- Yogarshi Vyas ,
- Kalika Bali ,
- Monojit Choudhury
Proceedings of the First Workshop on Computational Approaches to Code Switching |
Published by Association for Computational Linguistics
We describe a CRF based system for word-level language identification of code-mixed text. Our method uses lexical, contextual, character n-gram, and special character features, and therefore, can easily be replicated across languages. Its performance is benchmarked against the test sets provided by the shared task on code-mixing (Solorio et al., 2014) for four language pairs, namely, English-Spanish (En-Es), English-Nepali (En-Ne),English-Mandarin (En-Cn), and Standard Arabic-Arabic (Ar-Ar) Dialects. The experimental results show a consistent performance across the language pairs.