Benchmarking Large Language Models Across Languages, Modalities, Models and Tasks
- Sanchit Ahuja ,
- Divyanshu Aggarwal ,
- Varun Gumma ,
- Ishaan Watts ,
- Ashutosh Sathe ,
- Millicent Ochieng ,
- Rishav Hada ,
- Prachi Jain ,
- Mohamed Ahmed ,
- Kalika Bali ,
- Sunayana Sitaram
North American Chapter of the Association for Computational Linguistics |
Published by NAACL 2024
Recently, there has been a surge in LLM evaluation research to comprehend LLM capabilities and limitations. However, much of this research has been confined to English, leaving LLM building and evaluation for non-English languages relatively unexplored. Several new LLMs have been introduced recently, necessitating their evaluation on non-English languages. This study aims to perform a thorough evaluation of the non-English capabilities of state-of-the-art LLMs (GPT-3.5-Turbo, GPT-4, PaLM2, Mistral, and Llama2) by comparing them on the same set of multilingual datasets. Our benchmark comprises datasets covering languages, including low-resource African languages. We also include two multimodal datasets in the benchmark and compare the performance of LLaVA-v1.5 and GPT-4-Vision. Our experiments show that GPT-4 and PaLM2 outperform the Llama and Mistral models on various tasks, notably on low-resource languages, with GPT-4 outperforming PaLM2 on more datasets. However, issues such as data contamination must be addressed to obtain an accurate assessment of LLM performance on non-English languages.