Ko-IFEval: A Human-Verified Instruction-Following Benchmark for Korean LLMs

12Digit AI Research

1. Introduction

Instruction-following capability is a critical factor in evaluating the usability, reliability, and practical effectiveness of large language models (LLMs). IFEval is a benchmark designed to assess this capability in a quantifiable manner by evaluating a model's compliance with various types of automatically verifiable instructions. However, since IFEval is entirely constructed in English, it poses challenges to fair and accurate evaluation of Korean LLMs. This report introduces Ko-IFEval, a new benchmark developed to address these issues. Ko-IFEval consists of a human-verified dataset in which IFEval’s instructions have been translated and adapted to reflect Korean linguistic and cultural characteristics. It also includes an automatic evaluation tool tailored for Korean text. Together, they provide a more reliable and culturally appropriate benchmark for evaluating the instruction-following ability of Korean LLMs with greater precision.

2. Importance and Challenges of Instruction-Following Evaluation for Korean LLMs

Evaluating instruction-following ability is essential for measuring the practical effectiveness and user trustworthiness of LLMs. How accurately a model understands and executes diverse user instructions across various contexts directly impacts its reliability and quality in real-world applications. Therefore, developing benchmarks that quantitatively evaluate instruction-following performance plays a vital role in advancing LLM research and development.

IFEval is an English-based benchmark that covers diverse types of instructions and features an automatic scoring system, enabling relatively fair evaluation of a model’s instruction-following ability. However, directly applying IFEval to Korean LLMs compromises evaluation accuracy and fairness due to linguistic structural differences, cultural mismatches, and limitations of automatic scoring tools designed around English grammar.

  • Linguistic differences | Korean sentence boundaries can be ambiguous due to particles and morphological variations, unlike English, which has clearer sentence demarcations and intuitive word counts.
  • Cultural mismatch | Direct translations often retain unfamiliar names and cultural references irrelevant to Korean contexts, reducing evaluation fairness.
  • Automated evaluation accuracy | English-based scoring tools struggle with Korean grammar and syntax, causing evaluation inconsistencies.

Currently, there is a lack of publicly available instruction-following benchmarks tailored specifically for Korean. Ko-IFEval addresses this gap by incorporating linguistic and cultural adaptations, with all data verified by human reviewers to ensure reliable evaluation.

3. Dataset Construction

IFEval is a benchmark consisting of diverse instruction types, with an automatic scoring mechanism based on predefined logic. Applying this framework to Korean LLMs requires not only accurate translation but also adaptation of evaluation logic suited to Korean language characteristics. Ko-IFEval is a Korean instruction-following benchmark constructed through this process, with all data manually verified to ensure linguistic accuracy and logical consistency. The full list of categories and instance counts is provided in Appendix Table A.1.

Ko-IFEval was constructed through the following three steps:

  1. translation of prompts using GPT-4o
  2. removal and modification of conditions incompatible with Korean linguistic structures
  3. adaptation of prompts to reflect Korean cultural context

3.1 Removed Categories

We excluded categories relying on English-specific linguistic features or irrelevant for Korean evaluation:

  • English-dependent categories: change_case:capital_word_frequency, change_case:english_capital, change_case:english_lowercase
  • Korean-irrelevant category: language:response_language

3.2 Post-translation Adjustments

Several modifications were applied to ensure instructions are linguistically clear and culturally relevant for Korean:

  • Clarifying word count vs. character count: The length_constraints:number_words category was translated to refer to "어절" (space-separated word units in Korean). To support character-based constraints, a separate condition, length_constraints:number_letters, was introduced.
  • Adjusting paragraph and quotation conditions: When both length_constraints:nth_paragraph_first_word and startend:quotation conditions co-occur, the paragraph indices were adjusted to avoid logical conflicts.
  • Refining letter frequency constraints: In keywords:letter_frequency, thresholds were calibrated to better fit the distribution of characters in Korean. When literal application would cause excessive difficulty, thresholds were adjusted. For example, a prompt requiring a high frequency of the letter "o" was adapted as follows:
    # Original
    Write a letter to your friend who recently moved away. Your entire response should be in English, and in all capital letters. The letter o should appear at least 40 times.
    
    # Modified
    최근 이사 간 친구에게 편지를 써주세요. 글자 '오'를 최소 13번 이상 포함해야 합니다.
  • Restricting constraint combinations: The combination:repeat_prompt condition enforces that the response must begin with the prompt itself. Therefore, it is only paired with length_constraints:number_sentences, length_constraints:number_words, or length_constraints:number_letters constraints.
  • Adapting to Korean cultural context: Unfamiliar names and culturally irrelevant topics in prompts were replaced with localized content.
    # Original
    Write a 300+ word summary of the wikipedia page "https://en.wikipedia.org/wiki/Raymond_III,_Count_of_Tripoli".
    
    # Modified
    위키백과 페이지 "https://ko.wikipedia.org/wiki/이순신"의 내용을 요약하여 300자 이상으로 작성하시오.

4. Evaluation Framework

The IFEval framework evaluates model responses along two axes: Strict vs. Loose and Prompt-level vs. Instruction-level.

  • Strict scoring determines correctness based on the model’s raw answer exactly matching the instruction.
  • Loose scoring applies post-processing—such as removing markdown symbols or introductory/closing phrases—to avoid false alarms caused by formatting or non-substantive content.

Strict evaluation may lead to false negatives (correct responses judged as incorrect), while Loose evaluation can cause false positives (incorrect responses judged as correct).

Additionally, a single prompt may contain multiple instructions:

  • Prompt-level evaluation considers a response correct only if all instructions are satisfied.
  • Instruction-level evaluation judges each instruction independently.

To ensure reliable and consistent evaluation, Ko-IFEval uses only Strict criteria, applied at both the prompt and instruction levels.

Sentence Counting Logic

Each evaluation rule was implemented independently for each instruction_id. Since the logic for each condition is intuitive and well-documented in the IFEval paper, we focus here only on the adjustments made to sentence counting in Korean.

The original IFEval implementation uses the nltk tokenizer to segment English sentences. However, this tool is not suitable for Korean due to its English-centric design. Although Korean-specific tokenizers exist, we opted to implement a rule-based sentence segmentation logic to maximize control over boundary conditions and eliminate dependencies on external packages.

The sentence counting procedure for model outputs is as follows:

  1. Split the text into paragraphs based on line breaks.
  2. If a paragraph ends with a comma, merge it with the following paragraph.
  3. Within each paragraph, segment sentences based on the pattern of a Korean character followed by a sentence-ending punctuation mark (., ?, or !).
    • A space following the punctuation mark is required to consider it a sentence boundary.
    • Quoted text is ignored for boundary detection.
    • Examples:
      • `그냥...왜 그럴까?` → 1 sentence
      • `철수는 "언제 집에 갈까?"라고 물었습니다.` → 1 sentence

5. Benchmarking Results

We evaluated the performance of the following models using Ko-IFEval: GPT-4.1, Gemma-3 (12B, 9B, 4B), Trillion-7B-Preview, and Kanana-nano-2.1B. GPT-4.1 represents a general-purpose LLM, while the Gemma-3 series allows size-wise comparison. Trillion-7B and Kanana-nano are Korean-specialized models and serve as the primary focus of this benchmark. We report both prompt-level and instruction-level accuracy results.

5.1 Comparison by Evaluation Level

Figure 1 presents prompt-level and instruction-level accuracy for each model. As expected, prompt-level scores are lower due to stricter conditions, but the ranking trend among models remains consistent across evaluation levels.

GPT-4.1 achieved the highest prompt-level accuracy, exceeding 0.88, significantly outperforming all others. The Gemma-3 series exhibited a clear size-performance trend: 12B (0.72) > 9B (0.62) > 4B (0.55), suggesting larger models better handle formatting constraints.

Korean-specialized models outperformed general models of similar size. Trillion-7B scored 0.76, and Kanana-nano-2.1B scored 0.65, both surpassing their Gemma counterparts by over 0.1. This highlights the importance of Korean linguistic and cultural alignment, which Ko-IFEval emphasizes.

[Figure 1] Accuracy of LLMs Evaluated on the Ko-IFEval Benchmark [Figure 1] Accuracy of LLMs Evaluated on the Ko-IFEval Benchmark

5.2 Comparison by Instruction Group

Among the seven instruction groups, combination showed the greatest performance variance across models. Smaller models struggled with following multiple constraints simultaneously, while Trillion-7B performed comparably to GPT-4.1 in this group.

By contrast, simpler instruction groups such as startend and punctuation showed relatively small performance differences—except for Gemma-3-4B, which lagged behind.

[Figure 2] Accuracy by Instruction Group on Ko-IFEval [Figure 2] Accuracy by Instruction Group on Ko-IFEval

Discussion

Korean LLMs demonstrate better performance in instruction-following when they reflect linguistic and cultural specificity. While they respond more precisely to elements such as particles and sentence endings, they still struggle with complex instructions and numerical constraints. This indicates that instruction-following goes beyond simple language understanding and requires adherence to structural and logical requirements. This highlights the need for a more refined evaluation framework to better analyze Korean LLM performance.

Ko-IFEval represents the first major adaptation of an English-based benchmark for Korean, but it still faces limitations as a rule-based evaluation system. It is difficult to quantify exceptions in sentence structure or meaning-driven responses, and the difficulty of conditions varies considerably. Additionally, since the benchmark is based on translations, it may not fully reflect the distribution of real Korean user instructions. Future improvements should focus on enhancing evaluation scripts and diversifying data sources.

Instruction-following evaluation will become increasingly important in the multilingual LLM era, requiring language-specific criteria and the ability to assess handling of multiple or conflicting constraints. Evaluation must extend beyond simple correctness to include contextual understanding and prioritization of user intent. Moreover, feedback-driven performance measurement in real-world use cases could become a future benchmark direction. Ko-IFEval lays the groundwork for such long-term developments.

7. Conclusion

This report introduces Ko-IFEval, a benchmark designed to more accurately evaluate the instruction-following ability of Korean LLMs. By adapting an English-centric framework through linguistic and cultural modifications, and implementing Korean-specific evaluation logic, Ko-IFEval addresses the limitations of existing benchmarks. Our experiments show that Korean-specialized models outperform general-purpose models, demonstrating the importance of language-tailored evaluation.

Ko-IFEval provides a foundational tool for the development and validation of Korean LLMs. With further data expansion and refinement of the evaluation framework, it can evolve into a benchmark that captures more realistic and diverse instruction scenarios. Moving forward, instruction-following evaluation should go beyond task completion to assess models’ flexible understanding and responsiveness to user intent. Ko-IFEval represents an important starting point for such multidimensional assessment.

As multilingual models continue to advance, benchmarks like Ko-IFEval will play a key role in enabling rigorous, fair, and language-specific evaluation. By offering a structured, automated, and culturally aware framework, Ko-IFEval contributes to more equitable and accurate assessment of LLM capabilities in Korean.

8. Appendix

Comparison of Instruction Groups: IFEval vs. Ko-IFEval

Instruction Group Instruction IFEval Ko-IFEval
Change Case Capital Word Frequency 25 -
Change Case English Capital 25 -
Change Case English Lowercase 39 -
Combination Repeat Prompt 41 40
Combination Two Responses 24 21
Detectable Content Number Placeholders 27 25
Detectable Content Postscript 26 26
Detectable Format Constrained Response 10 10
Detectable Format JSON Format 17 18
Detectable Format Multiple Sections 14 14
Detectable Format Number Bullet Lists 31 30
Detectable Format Number Highlighted Sections 48 48
Detectable Format Title 37 27
Keywords Existence 39 34
Keywords Forbidden Words 49 47
Keywords Frequency 42 38
Keywords Letter Frequency 33 30
Language Response Language 31 -
Length Constraints n-th Paragraph First Word 12 12
Length Constraints Number Letters - 46
Length Constraints Number Paragraphs 27 23
Length Constraints Number Sentences 52 50
Length Constraints Number Words 52 6
Punctuation No Comma 66 45
Startend End Checker 26 26
Startend Quotation 41 36
834 652
[Table A.1] Comparison of Instruction Groups: IFEval vs. Ko-IFEval

Note: One missing instruction ID in the original IFEval dataset was corrected in Ko-IFEval, resulting in one additional data point. Ko-IFEval includes 464 prompts, adapted from the original 541 in IFEval by removing or modifying prompts incompatible with Korean linguistic evaluation.