KO-BENCH-2503

1. Introduction

Evaluating the task-specific performance of large language models (LLMs) has become a critical and indispensable component in the development and application of LLM-based systems. To enable effective evaluation, it is necessary to employ a diverse set of datasets organized by task type and difficulty level, allowing for more structured and comprehensive performance analysis.

This report introduces KO-BENCH-2503, a newly constructed benchmark dataset designed to conduct fine-grained evaluations of various LLMs. As an enhanced and expanded version of the previous Korean LLM benchmark KO-BENCH, which primarily focused on direct translation tasks, KO-BENCH-2503 provides a more nuanced assessment by incorporating a wider range of category-specific, Korea-customized questions. These questions are tailored to reflect the complexity and diversity required for evaluating Korean language capabilities of LLMs with greater precision.

2. Motivation and Goal

2-1. Limitations of KO-BENCH

KO-BENCH, based on MT-Bench and focused mainly on simple translation tasks, followed MT-Bench’s original questions without significant adaptation. As such, it exhibited limitations when applied directly in the Korean context, highlighting the need for a new dataset tailored to Korean language and cultural environments.

First, KO-BENCH lacked question diversity. Due to its reliance on MT-Bench, KO-BENCH failed to offer sufficient variety in terms of question scope and difficulty. Most items were centered around straightforward translation, with minimal consideration given to real-world scenarios, contemporary societal issues, or practical contexts that could highlight LLM applicability.

Second, KO-BENCH showed insufficient realism and localization. The questions did not closely align with the actual linguistic environment of Korean users, and they lacked cultural and contextual elements relevant to Korean society. Consequently, the dataset was limited in its ability to rigorously assess Korean-language LLMs.

2-2. Objectives of KO-BENCH-2503 Development

KO-BENCH-2503 was developed to address the shortcomings of KO-BENCH and to provide a practical benchmark for evaluating Korean-language LLMs. Its main objectives and characteristics are as follows.

While retaining the original categories from KO-BENCH, all questions were thoroughly revised, and subcategories were added to enable more balanced and comprehensive evaluation.
Questions were reconstructed to reflect real-life Korean contexts, incorporating frequently used expressions, sociocultural backgrounds, regional characteristics (e.g., dialects and local references), historical figures, and public personalities familiar to Korean users.
Grammar, vocabulary choices, and sentence structures were carefully refined to ensure fluency and naturalness in Korean.
Realistic questions that users are likely to ask LLMs in practice were included, allowing for accurate performance assessment in real-world usage scenarios.
Beyond factual recall, a greater proportion of high-level questions requiring reasoning, analysis, and creativity were incorporated to better test a wide range of cognitive capabilities.
Additionally, KO-BENCH-2503 serves as a proprietary internal dataset designed for in-house evaluation purposes. It maintains continuity with KO-BENCH while providing a more advanced framework for performance assessment.

3. Dataset Construction Methodology

KO-BENCH-2503 maintains the same category structure as KO-BENCH, while significantly refining the expressions, question types, formats, and content within each category to better suit the Korean user environment. This section outlines how each category was improved and highlights the key enhancement points that were prioritized during the revision process.

In the Writing category, practical writing tasks such as “student council president campaign speeches” were introduced. These items were designed to assess not only basic writing skills but also the ability to exhibit creative thinking within realistic scenarios.

The Roleplay category was updated to reflect Korean cultural contexts by incorporating scenarios involving public figures, historical characters, and regional dialects. Furthermore, situations commonly encountered in specific professions were included to evaluate how appropriately and naturally LLMs can respond in realistic conversational settings.

In the Reasoning category, while the structural format of questions was retained, the content and phrasing were revised to reflect real-life situations. Everyday contexts such as visiting a hair salon, experiencing an emergency room visit, or reading a clock were used to assess reasoning and problem-solving skills in practical scenarios.

For the Math category, the scope was broadened beyond simple arithmetic to include topics such as geometry, figures, equation solving, probability, and context-based calculations. Questions were uniquely composed to avoid redundancy and included real-world applications such as reasoning about maximum or minimum values and calculating bank interest rates, thereby assessing practical mathematical proficiency.

In the Coding category, the purpose and evaluation criteria of each problem were clearly articulated to allow for more systematic assessment of code correctness and functional completeness. All items were revised to use fluent and natural Korean expressions, enhancing clarity and comprehension.

The Extraction category was restructured around task-oriented applications with practical relevance, such as sentiment classification of news articles and analysis of food reviews. Machine-translated expressions were removed and replaced with natural Korean, improving both clarity and applicability in real-world settings.

In the STEM category, topics encountered in everyday life were incorporated, including viral transmission rates, fine dust pollution, LLM technology, and network security. Emphasis was placed on recent science and technology issues relevant to Korean society to ensure topicality and realism. As with other categories, linguistic expressions were refined to ensure naturalness and clarity.

Lastly, in the Humanities category, while maintaining the original question formats, the content was reoriented to focus on Korean humanities topics such as national history and notable Korean figures. All items were rewritten in smooth, intuitive Korean to eliminate translation artifacts and enhance user engagement and evaluation accuracy.

4. Evaluation Based on Empirical Experiments

This section presents the experimental results comparing the Korean language performance of various LLMs using prompts from both the existing benchmark (KO-BENCH) and the newly constructed dataset (KO-BENCH-2503). The primary objective of this evaluation is to verify whether KO-BENCH-2503 provides a more systematic and precise framework for assessing LLMs’ Korean language capabilities.

The experiment was designed based on the hypothesis that “LLMs developed in Korea are likely to exhibit superior performance in Korean, due to a stronger focus on Korean language data during training, compared to models developed abroad.” To test this hypothesis, model responses were collected and subsequently evaluated.

Evaluation was conducted using the LLM-as-a-Judge method, where one LLM assesses and assigns a score (ranging from 1 to 10) to the responses generated by other LLMs. The full evaluation pipeline, including scoring criteria, procedures, and implementation details, is documented in the K-Judge Technical Report.

4-1. Model Selection for Evaluation

Before conducting the evaluation, a balanced set of LLMs was selected, including both internationally and domestically developed models. For international models, we included several high-performing variants from OpenAI’s GPT family (gpt-4-0125-preview, gpt-4o-mini-2024-07-18, gpt-4.1, and gpt-4.1-mini), along with models developed outside the English-speaking world, such as Qwen2-7B-Instruct and Qwen2-72B-Instruct from Qwen.

For Korean models, we included EXAONE-3.0-7.8B-Instruct and EXAONE-3.5-7.8B-Instruct from LGAI, kanana-1.5-8b-instruct-2505 from Kakao, and our own in-house models: ko-gemma-2-9b-it, ko-gemma-3-12b, and ko-gemma-3-27b.

4-2. Evaluation Results

The chart below illustrates the average scores assigned to the selected models in response to prompts from both KO-BENCH and KO-BENCH-2503. The X-axis represents the model names, and the Y-axis denotes the average score on a 10-point scale. Grey circle markers indicate scores based on KO-BENCH, while blue square markers represent scores based on KO-BENCH-2503.

The results show that most international models performed worse on KO-BENCH-2503 than on KO-BENCH, whereas the Korean models generally achieved higher scores on KO-BENCH-2503. These findings support the initial hypothesis: that LLMs developed in Korea demonstrate stronger Korean language proficiency due to more focused training on Korean-language datasets.

It should be noted, however, that changes in question composition between KO-BENCH and KO-BENCH-2503 may have introduced differences in difficulty and task characteristics, which makes direct comparison of raw scores across the two datasets less straightforward. Nonetheless, examining relative changes in performance across the same models provides valuable insight.

[Figure 1] Comparison of model performance on KO-BENCH and KO-BENCH-2503 datasets

4-3. Summary of Findings

The experimental results confirm that KO-BENCH-2503 offers a more meaningful and refined benchmark for evaluating Korean-language capabilities of LLMs. Compared to KO-BENCH, KO-BENCH-2503 presents notable improvements in the naturalness of Korean expressions, relevance to real-world contexts, and inclusion of cognitively demanding tasks. These enhancements establish KO-BENCH-2503 as a more suitable dataset for performance evaluation of Korean-specialized LLMs.

5. Conclusion

This technical report introduced KO-BENCH-2503, a newly constructed benchmark dataset designed to address the limitations of the existing Korean LLM evaluation benchmark, KO-BENCH. While KO-BENCH included a large proportion of questions derived from simple translations and thus failed to fully reflect real-world Korean usage scenarios, KO-BENCH-2503 was developed with a stronger emphasis on structured, realistic, and contextually appropriate Korean prompts.

KO-BENCH-2503 retains the original category structure but significantly improves the quality of individual items and incorporates prompts tailored to Korean linguistic and cultural contexts. This provides a more robust foundation for evaluating the Korean language capabilities of LLMs in a precise and reliable manner. Through extensive experiments conducted with various LLMs, this report demonstrates that KO-BENCH-2503 is well-suited for its intended purpose of refined Korean-language evaluation.

KO-BENCH-2503 can serve as a practical reference benchmark for domestic organizations that require LLMs capable of handling Korean-language tasks. It also provides a valuable tool for selecting the most suitable LLM for specific use cases. Ultimately, we hope that KO-BENCH-2503 will become a key standard for identifying and deploying the most effective LLMs for Korean-language applications in Korea.

6. References

[1] KO-BENCH dataset
[2] KO-BENCH technical report
[3] KO-BENCH-2503 dataset
[4] KO-BENCH-2503 technical report
[5] KO-BENCH github
[6] openai/gpt-4-0125-preview
[7] openai/gpt-4o-mini-2024-07-18
[8] openai/gpt-4.1
[9] openai/gpt-4.1-mini
[10] Qwen/Qwen2-7B-Instruct
[11] Qwen/Qwen2-72B-Instruct
[12] LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct
[13] LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct
[14] kakaocorp/kanana-1.5-8b-instruct-2505
[15] davidkim205/ko-gemma-2-9b-it
[16] davidkim205/ko-gemma-3-12b
[17] davidkim205/ko-gemma-3-27b

KO-BENCH-2503: A Refined Extension of KO-BENCH for Korean LLM Evaluation