Instruction-following capability is a critical factor in evaluating the usability, reliability, and practical effectiveness of large language models (LLMs). IFEval is a benchmark designed to assess this capability in a quantifiable manner by evaluating a model's compliance with various types of automatically verifiable instructions. However, since IFEval is entirely constructed in English, it poses challenges to fair and accurate evaluation of Korean LLMs. This report introduces Ko-IFEval, a new benchmark developed to address these issues. Ko-IFEval consists of a human-verified dataset in which IFEval’s instructions have been translated and adapted to reflect Korean linguistic and cultural characteristics. It also includes an automatic evaluation tool tailored for Korean text. Together, they provide a more reliable and culturally appropriate benchmark for evaluating the instruction-following ability of Korean LLMs with greater precision.
Evaluating instruction-following ability is essential for measuring the practical effectiveness and user trustworthiness of LLMs. How accurately a model understands and executes diverse user instructions across various contexts directly impacts its reliability and quality in real-world applications. Therefore, developing benchmarks that quantitatively evaluate instruction-following performance plays a vital role in advancing LLM research and development.
IFEval is an English-based benchmark that covers diverse types of instructions and features an automatic scoring system, enabling relatively fair evaluation of a model’s instruction-following ability. However, directly applying IFEval to Korean LLMs compromises evaluation accuracy and fairness due to linguistic structural differences, cultural mismatches, and limitations of automatic scoring tools designed around English grammar.
Currently, there is a lack of publicly available instruction-following benchmarks tailored specifically for Korean. Ko-IFEval addresses this gap by incorporating linguistic and cultural adaptations, with all data verified by human reviewers to ensure reliable evaluation.
IFEval is a benchmark consisting of diverse instruction types, with an automatic scoring mechanism based on predefined logic. Applying this framework to Korean LLMs requires not only accurate translation but also adaptation of evaluation logic suited to Korean language characteristics. Ko-IFEval is a Korean instruction-following benchmark constructed through this process, with all data manually verified to ensure linguistic accuracy and logical consistency. The full list of categories and instance counts is provided in Appendix Table A.1.
Ko-IFEval was constructed through the following three steps:
We excluded categories relying on English-specific linguistic features or irrelevant for Korean evaluation:
change_case:capital_word_frequency
, change_case:english_capital
, change_case:english_lowercase
language:response_language
Several modifications were applied to ensure instructions are linguistically clear and culturally relevant for Korean:
length_constraints:number_words
category was translated to refer to "어절" (space-separated word units in Korean). To support character-based constraints, a separate condition, length_constraints:number_letters
, was introduced.length_constraints:nth_paragraph_first_word
and startend:quotation
conditions co-occur, the paragraph indices were adjusted to avoid logical conflicts.keywords:letter_frequency
, thresholds were calibrated to better fit the distribution of characters in Korean. When literal application would cause excessive difficulty, thresholds were adjusted. For example, a prompt requiring a high frequency of the letter "o" was adapted as follows: # Original Write a letter to your friend who recently moved away. Your entire response should be in English, and in all capital letters. The letter o should appear at least 40 times. # Modified 최근 이사 간 친구에게 편지를 써주세요. 글자 '오'를 최소 13번 이상 포함해야 합니다.
combination:repeat_prompt
condition enforces that the response must begin with the prompt itself. Therefore, it is only paired with length_constraints:number_sentences
, length_constraints:number_words
, or length_constraints:number_letters
constraints.# Original Write a 300+ word summary of the wikipedia page "https://en.wikipedia.org/wiki/Raymond_III,_Count_of_Tripoli". # Modified 위키백과 페이지 "https://ko.wikipedia.org/wiki/이순신"의 내용을 요약하여 300자 이상으로 작성하시오.
The IFEval framework evaluates model responses along two axes: Strict vs. Loose and Prompt-level vs. Instruction-level.
Strict evaluation may lead to false negatives (correct responses judged as incorrect), while Loose evaluation can cause false positives (incorrect responses judged as correct).
Additionally, a single prompt may contain multiple instructions:
To ensure reliable and consistent evaluation, Ko-IFEval uses only Strict criteria, applied at both the prompt and instruction levels.
Each evaluation rule was implemented independently for each instruction_id
. Since the logic for each condition is intuitive and well-documented in the IFEval paper, we focus here only on the adjustments made to sentence counting in Korean.
The original IFEval implementation uses the nltk
tokenizer to segment English sentences. However, this tool is not suitable for Korean due to its English-centric design. Although Korean-specific tokenizers exist, we opted to implement a rule-based sentence segmentation logic to maximize control over boundary conditions and eliminate dependencies on external packages.
The sentence counting procedure for model outputs is as follows:
.
, ?
, or !
).We evaluated the performance of the following models using Ko-IFEval: GPT-4.1, Gemma-3 (12B, 9B, 4B), Trillion-7B-Preview, and Kanana-nano-2.1B. GPT-4.1 represents a general-purpose LLM, while the Gemma-3 series allows size-wise comparison. Trillion-7B and Kanana-nano are Korean-specialized models and serve as the primary focus of this benchmark. We report both prompt-level and instruction-level accuracy results.
Figure 1 presents prompt-level and instruction-level accuracy for each model. As expected, prompt-level scores are lower due to stricter conditions, but the ranking trend among models remains consistent across evaluation levels.
GPT-4.1 achieved the highest prompt-level accuracy, exceeding 0.88, significantly outperforming all others. The Gemma-3 series exhibited a clear size-performance trend: 12B (0.72) > 9B (0.62) > 4B (0.55), suggesting larger models better handle formatting constraints.
Korean-specialized models outperformed general models of similar size. Trillion-7B scored 0.76, and Kanana-nano-2.1B scored 0.65, both surpassing their Gemma counterparts by over 0.1. This highlights the importance of Korean linguistic and cultural alignment, which Ko-IFEval emphasizes.
Among the seven instruction groups, combination showed the greatest performance variance across models. Smaller models struggled with following multiple constraints simultaneously, while Trillion-7B performed comparably to GPT-4.1 in this group.
By contrast, simpler instruction groups such as startend
and punctuation
showed relatively small performance differences—except for Gemma-3-4B, which lagged behind.
Korean LLMs demonstrate better performance in instruction-following when they reflect linguistic and cultural specificity. While they respond more precisely to elements such as particles and sentence endings, they still struggle with complex instructions and numerical constraints. This indicates that instruction-following goes beyond simple language understanding and requires adherence to structural and logical requirements. This highlights the need for a more refined evaluation framework to better analyze Korean LLM performance.
Ko-IFEval represents the first major adaptation of an English-based benchmark for Korean, but it still faces limitations as a rule-based evaluation system. It is difficult to quantify exceptions in sentence structure or meaning-driven responses, and the difficulty of conditions varies considerably. Additionally, since the benchmark is based on translations, it may not fully reflect the distribution of real Korean user instructions. Future improvements should focus on enhancing evaluation scripts and diversifying data sources.
Instruction-following evaluation will become increasingly important in the multilingual LLM era, requiring language-specific criteria and the ability to assess handling of multiple or conflicting constraints. Evaluation must extend beyond simple correctness to include contextual understanding and prioritization of user intent. Moreover, feedback-driven performance measurement in real-world use cases could become a future benchmark direction. Ko-IFEval lays the groundwork for such long-term developments.
This report introduces Ko-IFEval, a benchmark designed to more accurately evaluate the instruction-following ability of Korean LLMs. By adapting an English-centric framework through linguistic and cultural modifications, and implementing Korean-specific evaluation logic, Ko-IFEval addresses the limitations of existing benchmarks. Our experiments show that Korean-specialized models outperform general-purpose models, demonstrating the importance of language-tailored evaluation.
Ko-IFEval provides a foundational tool for the development and validation of Korean LLMs. With further data expansion and refinement of the evaluation framework, it can evolve into a benchmark that captures more realistic and diverse instruction scenarios. Moving forward, instruction-following evaluation should go beyond task completion to assess models’ flexible understanding and responsiveness to user intent. Ko-IFEval represents an important starting point for such multidimensional assessment.
As multilingual models continue to advance, benchmarks like Ko-IFEval will play a key role in enabling rigorous, fair, and language-specific evaluation. By offering a structured, automated, and culturally aware framework, Ko-IFEval contributes to more equitable and accurate assessment of LLM capabilities in Korean.
Instruction Group | Instruction | IFEval | Ko-IFEval |
---|---|---|---|
Change Case | Capital Word Frequency | 25 | - |
Change Case | English Capital | 25 | - |
Change Case | English Lowercase | 39 | - |
Combination | Repeat Prompt | 41 | 40 |
Combination | Two Responses | 24 | 21 |
Detectable Content | Number Placeholders | 27 | 25 |
Detectable Content | Postscript | 26 | 26 |
Detectable Format | Constrained Response | 10 | 10 |
Detectable Format | JSON Format | 17 | 18 |
Detectable Format | Multiple Sections | 14 | 14 |
Detectable Format | Number Bullet Lists | 31 | 30 |
Detectable Format | Number Highlighted Sections | 48 | 48 |
Detectable Format | Title | 37 | 27 |
Keywords | Existence | 39 | 34 |
Keywords | Forbidden Words | 49 | 47 |
Keywords | Frequency | 42 | 38 |
Keywords | Letter Frequency | 33 | 30 |
Language | Response Language | 31 | - |
Length Constraints | n-th Paragraph First Word | 12 | 12 |
Length Constraints | Number Letters | - | 46 |
Length Constraints | Number Paragraphs | 27 | 23 |
Length Constraints | Number Sentences | 52 | 50 |
Length Constraints | Number Words | 52 | 6 |
Punctuation | No Comma | 66 | 45 |
Startend | End Checker | 26 | 26 |
Startend | Quotation | 41 | 36 |
834 | 652 |
Note: One missing instruction ID in the original IFEval dataset was corrected in Ko-IFEval, resulting in one additional data point. Ko-IFEval includes 464 prompts, adapted from the original 541 in IFEval by removing or modifying prompts incompatible with Korean linguistic evaluation.