KoLightEval

1. Introduction

Evaluating the performance of large language models (LLMs) is essential for enabling objective comparisons and identifying areas for improvement. However, most existing benchmarks—such as MMLU, HellaSwag, and BIG-bench—are primarily designed for English, making it difficult to assess the practical utility and limitations of LLMs in Korean. To address this gap, this report presents a case study in which we construct a Korean-specific evaluation dataset and customize Hugging Face’s Lighteval framework for efficient evaluation of Korean LLMs.

2. Challenges in Korean LLM Evaluation and the Necessity of Lighteval

2.1 Existing Challenges

Lack of Korean Benchmarks There is a significant shortage of standardized benchmarks and datasets optimized for Korean. In particular, evaluation datasets covering STEM fields, including mathematics and science problems, are relatively scarce, making it difficult to comprehensively assess the capabilities of Korean LLMs.
Difficulty in Applying Custom Data Most existing evaluation frameworks are designed around English-specific formats and prompts. Korean, as an agglutinative language with unique syntactic order and honorifics, cannot be accurately evaluated by simply translating English prompts. Therefore, a custom evaluation framework tailored to the linguistic characteristics of Korean is necessary.
Lack of Efficient Backend and Automation Support Evaluating large-scale datasets and the latest LLMs requires efficient use of multiple GPUs, API calls, and various backend environments. However, existing systems often lack sufficient support for such automated and scalable evaluation.

2.2 Advantages of Lighteval

Flexible Benchmark Extensibility Lighteval’s modular code architecture based on HuggingFace’s Datasets and Tasks abstraction allows easy addition and modification of various evaluation datasets and tasks. This enables researchers to quickly apply their own custom datasets, facilitating evaluation tailored to non-English languages like Korean.
Support for Diverse Backends and Acceleration It supports multiple execution backends including Hugging Face Transformers, vLLM, and Accelerate libraries. By leveraging parallel processing and hardware acceleration on large GPU clusters, it significantly improves evaluation speed. Additionally, it can integrate external API-based models such as OpenAI’s API, enabling flexible evaluation of the latest LLMs.
User-Friendly Interfaces and Automation Both Command Line Interface (CLI) and Python API are provided, allowing users to easily build and run evaluation pipelines. Detailed sample-level result storage and log management features support systematic analysis of evaluation processes and outcomes.

3. Construction of Korean Evaluation Datasets

Dataset	Description
ko-math-500	A Korean-translated subset of 500 high school-level math problems from the MATH dataset, including detailed solutions with LaTeX expressions.
ko-gpqa	Korean version of GPQA containing challenging physics questions designed to test deep understanding and logical reasoning.
ko-ged-mc:elementary ko-ged-mc:middle ko-ged-mc:high	Korean GED exam questions spanning multiple subjects at elementary, middle, and high school levels to evaluate language comprehension and reasoning.

[Table 1] Overview of Korean Evaluation Datasets in KoLightEval

ko-math-500

ko-math-500 is a Korean-translated subset of 500 representative problems from the widely used MATH (Mathematics Aptitude Test of Heuristics) dataset, designed to evaluate the mathematical reasoning abilities of large language models.

The original MATH dataset consists of over 12,500 high school-level math problems and serves as a challenging math benchmark, containing problem statements and detailed solutions written in English natural language. The ko-math-500 subset is based on the standard evaluation set of 500 problems used in the 2023 paper Let’s Verify Step by Step for model performance comparison. The original dataset is publicly available at HuggingFaceH4/MATH-500.

The Korean translation was performed using the GPT-4o model. The data is structured in JSON format and includes LaTeX math expressions. Each entry contains the following keys:

problem : the math question
solution : the detailed solution
answer : the final answer
subject : the math domain (one of seven subjects: Algebra, Counting & Probability, Geometry, Intermediate Algebra, Number Theory, Prealgebra, Precalculus)
level : difficulty level represented by an integer from 1 to 5, where a subject’s easiest problems are assigned a level of 1 and the hardest problems are assigned a level of 5
id : a unique identifier for each problem

{
  "problem": "직교 좌표 $(0,3)$를 극 좌표로 변환하시오. 답은 $r > 0$ 및 $0 \\leq \\theta < 2 \\pi$를 만족하는 형식 $(r,\\theta)$로 나타내시오.",
  "solution": "주어진 점 $(0,3)$에 대해 $r$은 다음과 같이 계산된다:\n\\[ r = \\sqrt{0^2 + 3^2} = 3. \\]  \n또한, 원점을 $(0,3)$과 연결하는 선은 양의 $x$축에 대해 $\\frac{\\pi}{2}$의 각도를 형성한다는 것을 알 수 있다.\n\n[asy]  \nunitsize(0.8 cm);  \n\ndraw((-0.5,0)--(3.5,0));  \ndraw((0,-0.5)--(0,3.5));  \ndraw(arc((0,0),3,0,90),red,Arrow(6));  \n\ndot((0,3), red);  \nlabel(\"$(0,3)$\", (0,3), W);  \ndot((3,0), red);  \n[/asy]  \n\n따라서, 극 좌표는 $\\boxed{\\left( 3, \\frac{\\pi}{2} \\right)}$이다.",
  "answer": "\\left( 3, \\frac{\\pi}{2} \\right)",
  "subject": "미적분학 준비",
  "level": 2,
  "id": "test/precalculus/807.json"
}

ko-gpqa

ko-gpqa is a Korean-translated version of the GPQA (Graduate-Level Google‑Proof Q&A) benchmark dataset, which consists of high-difficulty science questions. Introduced in this paper, GPQA is designed to go beyond simple fact retrieval and instead test an AI system’s ability to perform deep understanding and logical reasoning. It is particularly useful for evaluating true comprehension and inference capabilities in language models.

As specified in the original paper, the dataset is intended strictly for evaluation purposes and should not be used for training. The original dataset is available at Hugging Face: Idavidrein/gpqa.

The Korean translation was performed using GPT-4o, and the dataset is formatted in JSON structure. Each entry includes the following keys:

id : a unique identifier for each problem
Question : the question prompt
Correct Answer : the correct answer
Incorrect Answer 1, Incorrect Answer 2, Incorrect Answer 3 : three incorrect choices

{
  "id": "rec06pnAkLOr2t2mp",
  "Question": "에너지 E1과 E2를 가지는 두 양자 상태의 수명이 각각 10^-9초와 10^-8초입니다. 이 두 에너지 준위를 명확히 구분하고자 합니다. 이들이 명확히 구분될 수 있도록 하는 에너지 차이는 다음 보기 중 어느 것입니까?",
  "Correct Answer": "10^-4 eV",
  "Incorrect Answer 1": "10^-11 eV",
  "Incorrect Answer 2": "10^-8 eV",
  "Incorrect Answer 3": "10^-9 eV"
}

ko-ged-mc

ko-ged-mc is a benchmark dataset designed to evaluate the comprehension and reasoning capabilities of large language models (LLMs) in the Korean language. The dataset comprises multiple-choice questions extracted from the General Equivalency Diploma (GED) exams administered at three distinct educational levels: elementary, middle, and high school. Specifically, the ko-ged dataset contains questions from the second GED exam of 2024. All exam items are sourced from the official archives managed by the Korean Ministry of Education.

Table 2 summarizes the subject domains covered by the ko-ged-mc dataset across the three educational tiers. For each level, Common Subjects denote the core curriculum assessed uniformly across all exams, whereas Additional Subjects represent specialized topics tested depending on the specific educational stage.

Level	Common Subjects	Additional Subjects
Elementary	Korean English Mathematics Science Social Studies Ethics Art Physical Education Music	Practical Arts
Middle		Technology & Home Economics Information Technology
High		Technology & Home Economics Korean History

[Table 2] Subject Areas Included in the ko-ged-mc Dataset by Educational Level

Each dataset is structured in JSON format and contains the following keys:

id : a unique positive integer identifier for each question
Domain : subject name
Question : the question prompt
Correct Answer : the correct answer
Incorrect Answer 1, Incorrect Answer 2 Incorrect Answer 3 : three incorrect choices

The datasets are separated by educational level (E/M/H), and subject.

ko-ged-mc:elementary

{
  "id": 1,
  "Domain": "영어",
  "Question": "다음 중 모두 소문자로 쓴 낱말은?",
  "Correct Answer": "jump",
  "Incorrect Answer 1": "Hat",
  "Incorrect Answer 2": "RED",
  "Incorrect Answer 3": "Grapes"
}

ko-ged-mc:middle

{
  "id": 209,
  "Domain": "도덕",
  "Question": "(가)에 들어갈 인물은?\n\n◦(가) → 평화, 비폭력, 인도 독립, 시민 불복종",
  "Correct Answer": "간디",
  "Incorrect Answer 1": "공자",
  "Incorrect Answer 2": "노자",
  "Incorrect Answer 3": "칸트"
}

ko-ged-mc:high

{
  "id": 46,
  "Domain": "수학",
  "Question": "다항식 x³ - 3³ 을 인수분해한 식이 (x-3)(x²+ax+9) 일 때, 상수 a의 값은?",
  "Correct Answer": "3",
  "Incorrect Answer 1": "1",
  "Incorrect Answer 2": "5",
  "Incorrect Answer 3": "7"
}

4. Evaluation Setup Using Lighteval

4.1 Prompt Format Design

As described in Chapter 3, all datasets—except ko-math-500, which requires LaTeX-formatted answers—follow a multiple-choice structure, where the model must select the correct option among predefined choices. Accordingly, two distinct prompt templates were designed: one for mathematical reasoning tasks and another for multiple-choice questions.

These templates were adapted from the OpenAI simple-evals repository and translated into Korean to ensure consistency with the language used in the evaluation tasks.

Prompt for Mathematical Reasoning Tasks

For math-related tasks, the prompt enforces a consistent output format to facilitate reliable parsing and automatic evaluation. The following template was used:

주어진 수학 문제를 효율적이고 명확하게 풀어보세요. 응답의 마지막 줄은 다음 형식이어야 합니다: '따라서, 최종 답은: $\\boxed{{ANSWER}}$입니다. 이 답이 맞기를 바랍니다.' (따옴표 없이) 이때 ANSWER는 문제를 푸는 최종 숫자나 식이 되어야 합니다. 답하기 전에 한 단계씩 생각해 보세요.

{Question}

This format allows the evaluation system to extract the model’s final answer from the boxed expression (e.g., $\\boxed{42}$ ) for automatic comparison with the reference answer.

Prompt for Multiple-Choice Questions

For multiple-choice tasks, the correct answer is randomly assigned to one of four options labeled A through D. The remaining three options are filled with incorrect answers. The model is instructed to select one of the options and indicate its response using a standardized format. The prompt is structured as follows:

다음의 객관식 질문에 답하세요. 답변의 마지막 줄은 'Answer: $LETTER' (without quotes) 형식이어야 하며, 여기서 LETTER는 ABCD 중 하나입니다. 답변하기 전에 단계별로 생각하세요.

{Question}

A) {A}
B) {B}
C) {C}
D) {D}

This format enables straightforward extraction of the model’s predicted answer and ensures compatibility with automatic evaluation scripts.

4.2 Evaluation Metrics

To evaluate model performance, we used extractive match-based metrics tailored to each task type: math and multiple-choice. For math tasks, the evaluation extracts the final boxed answer from the model’s LaTeX-formatted output and compares it to the reference. This method is more robust than string matching, as it tolerates variation in mathematical notation while focusing on the actual solution. For multiple-choice questions, the model's predicted letter (A–D) is extracted and compared with the correct option. To reduce positional bias, choices are randomly shuffled, and only the final answer letter is considered during evaluation.

Both metrics use a precision of five decimal places and include a fallback mode to handle minor formatting inconsistencies in model outputs. Although the datasets and model responses are in Korean, the evaluation language is set to English to align with the LaTeX syntax and alphabet-based answer formats used in the parsing logic. This ensures the correct functioning of extraction and comparison processes. Overall, extractive matching improves reliability by focusing on the core answer elements and supports scalable, automated evaluation.

4.3 Task Implementation with LightevalTaskConfig

Building on the prompt and metric designs introduced above, task definitions were implemented using the LightevalTaskConfig class provided by Huggingface’s Lighteval framework. This class allows for seamless integration of custom tasks by encapsulating dataset loading, prompt formatting, and metric configuration into a single, reusable structure.

We defined new task instances tailored for Korean-language datasets, including both LaTeX-formatted math problems and multiple-choice questions. The modular nature of LightevalTaskConfig allowed us to incorporate task-specific behaviors—such as randomizing answer positions in multiple-choice prompts—without altering the core framework. This setup promotes clarity and reproducibility while supporting extensibility for future additions.

KoLightEval can be run with models served via the vLLM backend. For example, the following command evaluates the model on a single Korean task:

lighteval vllm \
    "model_name=davidkim205/ko-gemma-3-12b,dtype=bfloat16" \
    "custom|ko_ged:elementary|0|0" \
    --custom-tasks path/to/custom_tasks.py \
    --use-chat-template

The task definitions are provided via the --custom-tasks option and are part of the KoLightEval framework.

By leveraging the existing Lighteval infrastructure and contributing Korean-language evaluation tasks in a self-contained way, our implementation demonstrates a practical use case for adapting multilingual benchmarks within a unified evaluation system.

5. Experimental Evaluation

5.1 Model Selection

To evaluate performance, six large language models with diverse characteristics and sizes were selected. OpenAI’s GPT-4.1 represents one of the most advanced commercial models, while Google’s Gemma-3 and Qwen-2.5 were included as publicly available multilingual models. Additionally, Korean-specialized models developed domestically—EXAONE-3.5 and Kakao’s Kanana-1.5—were chosen to assess Korean-optimized performance. A fine-tuned variant, Ko-Gemma-3, based on the Gemma architecture, was also included to examine the effect of fine-tuning within the same model family.

These models vary in size from approximately 7 billion to 12 billion parameters. Although GPT-4.1’s exact size is undisclosed, it is known to be significantly larger and trained on extensive data. Ko-Gemma-3 and Gemma-3 have around 12 billion parameters, while EXAONE and Kanana have about 7.8 to 8 billion. GPT-4.1, Qwen-2.5, and Gemma-3 are designed as multilingual models, whereas EXAONE, Kanana, and Ko-Gemma focus exclusively on Korean. This setup allows for a balanced and comprehensive comparison of how design choices affect performance in Korean large language models.

5.2 Score Comparison

Figure 1 visualizes the performance scores of each model across benchmark datasets using radar charts, with accuracy normalized on a 0 to 10 scale to facilitate comparison among models.

As shown in Figure 1, scores for ko-gpqa and ko-math-500 were lower than those for ko-ged-mc across all models, reflecting the higher difficulty of these datasets. In particular, ko-gpqa consists of question-answering tasks that require multi-document information aggregation, multi-step reasoning, and understanding of long contexts, which have been identified as challenging for general language models.

GPT-4.1 achieved the highest average scores across all benchmarks; however, on the ko-math-500 dataset, Gemma-3 and Ko-Gemma-3 models scored 0.78 points higher than GPT-4.1. These two models share the same architecture, and while Ko-Gemma-3 is a fine-tuned variant of Gemma-3, the performance difference between them was minimal. This indicates that while GPT-4.1 is generally the strongest model overall, domain-specific or fine-tuned models can outperform it on specialized tasks such as mathematical reasoning.

Interestingly, Korean-specialized models such as EXAONE and Kanana performed comparably to Gemma-3 on the ko-ged-mc dataset, demonstrating their effectiveness in Korean grammar error detection tasks. However, on the ko-math-500 dataset, they showed a significant performance gap compared to multilingual models like Gemma-3 and Ko-Gemma-3, suggesting limitations in mathematical reasoning or general-domain training coverage.

[Figure 1] Performance Comparison by Benchmark Dataset

6. Discussion

Insights into the Performance of Korean LLMs

A key finding of this evaluation is the minimal performance difference observed between Ko-Gemma-3 (the fine-tuned version) and Gemma-3. This suggests that either Gemma-3 has robust multilingual Korean capabilities or that the current fine-tuning process has limited impact. However, further investigation is required. However, this outcome may vary depending on the quantity and quality of the fine-tuning data as well as the applied methodology, indicating the need for further experiments under diverse conditions.

Although GPT-4.1 consistently demonstrated the highest overall performance, Gemma-based models slightly outperformed GPT-4.1 on more challenging mathematical tasks (e.g., *ko-math-500*). This result illustrates that in certain domains or task types, factors such as model architecture and training data characteristics can influence performance outcomes.

Future Improvements and Research Directions

Enhancing the realism and reliability of Korean LLM evaluations requires expanding the scope and diversity of benchmark datasets. Existing benchmarks primarily consist of short-answer tasks with well-defined answers, which fall short of capturing the complex language demands encountered in real-world user interactions—such as long-form generation, style adaptation, multi-document summarization, and context-aware question answering.

Addressing these limitations will require the inclusion of more realistic benchmarks, such as scenario-based tasks across specific domains, generative tasks, and evaluations incorporating user feedback. In parallel, evaluation metrics should evolve beyond accuracy-focused approaches to incorporate multi-dimensional indicators of fluency, contextual coherence, and informational completeness. These enhancements will help provide a clearer picture of both the practical capabilities and the limitations of current models.

7. Conclusion

We introduced the KoLightEval framework, designed to provide a more realistic and multifaceted evaluation of Korean large language models (LLMs). Utilizing this framework, we conducted a systematic comparison of Korean comprehension and generation capabilities across various domestic and international models. Our results indicate limited improvement from fine-tuning, while revealing distinct performance characteristics among models. A key observation is the minimal performance difference between Ko-Gemma-3 and Gemma-3. Moreover, Gemma-based models showed comparable or slightly superior performance to commercial large-scale models on certain challenging tasks.

In addition, this study highlighted certain limitations of existing Korean benchmark datasets and evaluation metrics, emphasizing the need for an evaluation framework that better reflects realistic and diverse user scenarios. Future work includes expanding evaluation datasets and developing composite metrics to improve Korean LLM assessment.

Appendix

Detailed Benchmark Scores by Model

Table A.1 presents detailed scores for each model across the benchmark datasets. For brevity, ko- prefixes (e.g., ko-ged-mc, ko-gpqa) are omitted from column names. The subscripts E, M, and H denote the educational levels of the questions: Elementary, Middle, and High.

Model	ged-mc:E	ged-mc:M	ged-mc:H	gpqa	math-500
GPT-4.1	10	9.88	10	6.31	7
Gemma-3-12b-it	9.58	9.47	9.12	3.94	7.78
Ko-Gemma-3-12b (Ours)	9.65	9.67	9.24	3.84	7.78
EXAONE-3.5-7.8B-Instruct	9.58	9.55	9.2	3.64	6.26
Kanana-1.5-8b-instruct-2505	9.72	9.02	9.6	3.08	5.14
Qwen2.5-7B-Instruct	8.75	8.78	8.67	3.43	6.26

[Table A.1] Model performance scores across benchmark datasets (normalized to 0–10 scale)

KoLightEval: Korean Evaluation Datasets and Task Integration for LLMs Using Lighteval