Conference paper

Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation

Abstract

Reliable multilingual evaluation is difficult, and culturally appropriate evaluation is even harder to achieve. A common practice to fill this gap is to machine-translate English evaluation sets, but this introduces language bias and carries over cultural assumptions, often testing knowledge irrelevant to the target audience. In this work, we highlight the extent and impact of these biases and present a multilingual evaluation framework that aims to mitigate them through improved translation and annotation practices. Through a large-scale study involving professional and community translators and annotators, we show that state-of-the-art models excel primarily by learning Western-centric concepts. Notably, we find that model rankings on full MMLU change when evaluated on a subset of questions marked as culturally sensitive. We release Global-MMLU, a multilingual extension of MMLU across 42 languages, with improved translation quality, expanded language coverage, and designated subsets labeled as culturally sensitive and culturally agnostic to enable a more comprehensive and equitable benchmark for evaluating language models across diverse linguistic and cultural contexts. Global-MMLU: https://hf.co/datasets/CohereForAI/Global-MMLU Global-MMLU Lite: https://huggingface.co/datasets/CohereForAI/Global-MMLU-Lite.