This report presents ko-ged, a benchmark dataset developed to evaluate large language models (LLMs) in Korean language comprehension and reasoning. The dataset is derived from authentic exam questions of Korea’s official elementary, middle, and high school General Equivalency Diploma (GED) tests, administered by the Korean Ministry of Education.
ko-ged is constructed using authentic test items authored by official exam committees, including those responsible for national equivalency exams and public education assessments in Korea. All questions are carefully selected to align with the curriculum and evaluation standards of each grade level in elementary, middle, and high school as closely as possible.
The dataset spans a wide range of subjects, including:
Within each subject area, questions are distributed to ensure a balanced range of difficulty levels as well as cognitive processes, reflecting the variety of reasoning skills assessed across Korea’s elementary, middle, and high school equivalency and standardized exams.
Most items are designed as short-answer questions to support automated scoring. A subset includes generative formats to evaluate language models’ ability to produce coherent written responses. Each item is annotated with a reference answer and, when applicable, an annotated rationale. These annotations enable flexible evaluation strategies, such as exact match, partial credit, and qualitative assessment.
The ko-ged dataset consists of two versions: the original ko-ged and the extended ko-ged-2501.
The original ko-ged is based on actual past Korean equivalency exams—similar to the GED but separately administered for elementary, middle, and high school levels. It covers five core subjects—Korean, English, Mathematics, Science, and Social Studies—across all three school levels, with 10 questions per subject per level, totaling 150 questions. A uniform sampling strategy ensures balanced coverage, making it ideal for targeted benchmarking.
In contrast, the extended version, ko-ged-2501, comprises 293 questions drawn from the 2025 first-round Korean GED exam. It expands coverage to additional subjects and offers a broader variety of question types, better reflecting real exam complexity and enabling comprehensive model evaluation.
All exam items in ko-ged are stored as JSON objects with fields such as question_id
, category
(indicating grade level and subject), turns
(containing the question text), and reference
(including reference answers and, when applicable, explanations). For example, a question might be represented as:
{ "question_id": 33, "category": "초등-수학", "turns": ["사각뿔이 있다. 이 사각뿔의 밑면과 만나는 면의 개수는?"], "reference": ["4개"] }
{ "question_id": 2, "category": "초등-국어", "turns": [ "다음 내용에서 괄호 부분에 어울리는 한국 속담을 쓰세요. \"민수 : 어제 자전거를 타다가 넘어져서 다쳤어. 수빈 : 많이 다쳤어? (자전거를 정말 잘 타는 사람도 넘어질 수가 있어.) 그러니 항상 조심해.\"" ], "reference": [ "속담으로는 \"원숭이도 나무에서 떨어질 때가 있다\"가 적절합니다. 따라서 문장은 \"원숭이도 나무에서 떨어질 때가 있어.\"로 완성될 수 있습니다." ] }
All exam items in ko-ged are sourced from the official archive maintained by the Korean Ministry of Education. These materials are publicly accessible and provided for educational use. We curated text-based questions suitable for language model evaluation and excluded items relying on images, charts, or oral instructions. The collected items were transcribed and standardized through minimal preprocessing, including text normalization, converting multiple-choice questions into short-answer formats to better fit LLM evaluation, and manual formatting verification. Quality assurance steps included manual review to correct transcription errors and ensure consistency across items.
In the ko-ged evaluation setup, a single, fixed-format prompt was used to assess all model outputs, regardless of subject, question type (short-answer or generative), or difficulty. Each prompt consisted of three main components: evaluation criteria, the original question, and the model-generated answer. The evaluation criteria were defined explicitly to guide the scoring language model, while the other two components were filled with the corresponding input and output.
Using a unified prompt across all items offered several advantages. First, it ensured fairness, as all responses were judged under the same conditions, reducing potential bias toward specific formats or domains. Second, it enabled automation and scalability, allowing large-scale evaluations to be conducted efficiently using a consistent framework. Third, it promoted reproducibility, as the fixed evaluation structure made it easier to replicate the process or apply it in future experiments with minimal adjustments.
To ensure clarity, we briefly define the LLM-as-a-judge paradigm: it refers to the approach where large language models themselves are employed as evaluators to assess the quality and correctness of generated responses, enabling scalable and automated scoring without human intervention.
The ko-ged evaluation procedure follows a three-step flow: (1) generating model responses for each question from the target language model, (2) submitting these answers within a fixed evaluation prompt to GPT-4o for scoring, and (3) collecting and organizing the evaluation outputs.
The process is fully automated, with each model answer and its corresponding evaluation result (model judgment) stored separately. The model judgment includes both a qualitative assessment of the response and an integer score ranging from 0 to 10. This enables not only detailed qualitative feedback for individual answers but also quantitative comparisons across different models.
A fundamental scoring criterion in ko-ged is the consistency of language between the question and the response. If the main language of the response differs from that of the question, the response is assigned a score of 0. This rule ensures fairness and consistency in evaluation, and reflects the practical limitations of using LLMs that do not respond in the expected language in real Korean-language settings. An exception is made for English subject questions, where the expected response language (Korean or English) is explicitly indicated in the question.
Beyond simple correctness, the evaluation considers multiple qualitative factors including the usefulness, relevance to the question, factual accuracy, depth of content, creativity, and level of detail in the response. This rubric is designed to differentiate the quality of both short-answer and generative responses.
The overall performance of each model is measured by the average judgment score across all evaluated items. This approach enables quantitative comparisons between models and ensures consistent evaluation across diverse question types.
All model generations were conducted with temperature=0.7
and max_new_tokens=1024
to allow for varied but relevant responses. For evaluation, GPT-4o was used as the scoring model with temperature=0
and max_tokens=2048
, ensuring deterministic and consistent judgments across all responses.
Category | Model Source | Models Evaluated |
---|---|---|
General-purpose LLMs | OpenAI | GPT-4.1, GPT-4o |
Gemma 2, Gemma 3 | ||
Qwen | Qwen 2, Qwen 2.5 | |
Meta (LLaMA) | LLaMA 3.1, LLaMA 3.2 | |
Korean-developed LLMs | Kakao | kanana 1.5 |
LGAI-EXAONE | EXAONE 3.0, EXAONE 3.5, EXAONE Deep | |
davidkim205 | ko-Gemma 2, ko-Gemma 3 (Ours) |
To ensure fairness and reproducibility, all models were evaluated under the same generation and scoring configuration. This uniform setup enables direct comparison across a diverse range of LLMs, including globally recognized general-purpose models, Korean-language specialized systems, and our fine-tuned variants trained on ko-ged.
All benchmarking experiments were conducted under a unified evaluation framework described in Section 4, using a fixed prompt format and GPT-4o as the scoring model. Each model was evaluated with identical generation settings to ensure fairness and reproducibility. The results reported below reflect average scores on a 0–10 scale, derived from automated LLM-as-a-judge assessments.
The performance comparison of seven representative models—GPT-4.1, Gemma-1.1-7B, Qwen-2-7B, LLaMA-3.1-8B, EXAONE-3.5, Ko-Gemma-3, and Kanana-1.5-8B—on both ko-ged and ko-ged-2501 datasets is summarized in Figure 1. Average scores were computed by aggregating results across all subjects and school levels.
Figure 1 illustrates that most models experienced a performance decrease on the extended ko-ged-2501 compared to the original ko-ged, likely due to the broader subject coverage and higher difficulty of the expanded dataset. GPT-4.1 consistently demonstrated strong performance with average scores above 9 on both datasets, substantially surpassing other models. Conversely, LLaMA-3.1-8B-Instruct scored notably lower, reflecting limited effectiveness under these evaluation conditions.
Of particular interest, Kanana-1.5-8B outperformed its original ko-ged scores on ko-ged-2501, suggesting enhanced adaptability to the expanded range of subjects and the more authentic exam-like format of the extended dataset. This may be attributed to its training on a larger volume of Korean-specific data, which likely improved its capacity to handle diverse Korean language tasks reflected in ko-ged-2501.
A detailed breakdown by school difficulty levels is presented in Figure 2, which compares model performances on ko-ged and ko-ged-2501.
Figure 2 reveals that on the original ko-ged dataset, model scores were generally higher on middle school questions compared to elementary level. In contrast, the extended ko-ged-2501 dataset exhibited a trend of decreasing scores with increasing difficulty across most models. Notably, the Qwen model maintained relatively stronger performance on middle and high school levels compared to elementary, warranting further study to elucidate underlying factors.
Subject-specific performance results on the ko-ged dataset are depicted in Figure 3.
Figure 3 shows that models tend to perform best in Mathematics and English, with relatively lower scores in Korean language tasks. GPT-4.1, EXAONE, and Kanana—models with either large-scale or Korean-specialized training—demonstrate superior results in Social Studies. The combination of GPT-4.1’s extensive model capacity and Kanana’s intensive Korean language exposure likely accounts for their advantage in this subject area.
For the ko-ged-2501 dataset, subject-wise performance distributions are shown in Figure 4.
Figure 4 highlights that Mathematics consistently achieves high scores, accompanied by strong performance in Information Technology (IT). Except for GPT-4.1 and Kanana, models generally exhibit weaker results in Korean History, which may reflect limited Korean-centric training data or less domain familiarity. This contrast underscores the value of Korean-specialized training in improving performance on culturally and linguistically specific subjects.
Together, these results emphasize distinct model strengths and weaknesses by subject and demonstrate the suitability of the ko-ged and ko-ged-2501 datasets for comprehensive evaluation of language models on diverse academic topics and difficulty levels.
While the ko-ged benchmark provides a valuable and authentic evaluation resource aligned with Korean educational standards, several limitations remain. First, the dataset excludes exam items relying on images, graphs, or oral components, which limits the scope of skills evaluated to text-based understanding and reasoning. Second, despite careful curation, some subject areas and question types are underrepresented, especially in the extended ko-ged-2501 version, which may affect the generalizability of evaluation results. Third, using an LLM (GPT-4o) as an automated judge, although efficient and scalable, may introduce biases or inconsistencies compared to human evaluation.
To address these limitations, the ko-ged benchmark needs to expand beyond text-based evaluation to include multimodal questions. By incorporating items with various non-textual materials such as charts, graphs, images, and audio, it can better assess multimodal comprehension and complex reasoning skills. Additionally, the automatic scoring system for generative answers should be improved through comparative analysis with human evaluators to enhance reliability. It is also necessary to supplement currently underrepresented subjects and question types to achieve a more balanced distribution across educational levels. Furthermore, diversifying the LLM evaluator models used for scoring and thoroughly analyzing potential biases and consistency issues should be actively pursued. These efforts will help ko-ged evolve into a comprehensive and reliable evaluation tool that more accurately reflects the real educational environment in Korea.
This report presented the ko-ged benchmark, a valuable resource for evaluating large language models’ understanding and reasoning in Korean based on authentic exam questions aligned with the national curriculum. By covering diverse subjects and difficulty levels, ko-ged provides a fair and systematic framework to compare both domestic and international LLMs. Evaluation results show that models like GPT-4.1 and Korean-specialized LLMs perform strongly. The unified, automated evaluation protocol enables scalable and reproducible assessments. Looking ahead, expanding ko-ged to include multimodal questions, improving the evaluation framework, and incorporating a variety of evaluator models will enhance its comprehensiveness and reliability. These efforts will further advance research at the intersection of Korean education and artificial intelligence.
Subject | Elementary | Middle | High |
---|---|---|---|
Korean | 10 | 10 | 10 |
Math | 10 | 10 | 10 |
Social Studies | 10 | 10 | 10 |
Science | 10 | 10 | 10 |
English | 10 | 10 | 10 |
Subject | Elementary | Middle | High |
---|---|---|---|
Korean | 2 | 2 | - |
Math | 15 | 16 | 20 |
Social Studies | 10 | 21 | 10 |
Science | 10 | 13 | 13 |
English | 7 | 5 | 7 |
Art | 5 | 13 | 12 |
Music | 5 | 6 | 10 |
Physical Education | 3 | 13 | 12 |
Ethics | - | 5 | 10 |
Technology & Home Economics | - | 10 | 9 |
Korean History | - | - | 17 |
Information Technology | - | 8 | - |
Practical Arts | 4 | - | - |