Hunminai-1.0 is a language model specifically designed to address the unique challenges of Korean natural language processing (NLP). Korean’s complex grammatical structures—such as particles, verb endings, and honorifics—pose difficulties for models largely trained on English. Additionally, the limited availability of large-scale, high-quality Korean training data further restricts the performance of general multilingual models on Korean tasks. In addition to these linguistic factors, capturing cultural nuances and social honorifics is essential for practical applications, making it essential to develop models that are carefully tailored to the Korean linguistic and cultural context.
To meet these demands, Hunminai-1.0 was developed based on the Gemma-3 architecture and aims to achieve superior performance compared to existing models in both quantitative benchmarks and qualitative evaluations. It excels in a variety of Korean language understanding and generation tasks, including dialogue generation, question answering, and long-form text creation. The model is publicly available on Huggingface in two sizes—12B and 27B parameters—under the Gemma license, facilitating easy access and customization by researchers and developers.
The model was fine-tuned on a carefully curated corpus of 100,000 high-quality Korean instruction examples using Supervised Fine-Tuning (SFT), followed by Direct Preference Optimization (DPO). This two-stage training approach enables better alignment with user intents in Korean and improves performance on downstream tasks such as dialogue generation, question answering, and long-form text generation. The dataset encompasses a wide variety of Korean language contexts and tasks, emphasizing alignment with user intent and natural language generation. The dataset is not currently public but is planned for future release.
Hunminai-1.0 was objectively evaluated using LM-Evaluation-Harness across a variety of Korean and multilingual benchmarks to assess its overall performance. However, such quantitative benchmarks primarily focus on short-form, closed-ended questions, limiting their ability to fully capture the capabilities of modern language models in complex instruction following and multi-turn dialogue scenarios. To address this limitation, we additionally conducted qualitative evaluations based on the K-BENCH dataset. Using GPT-4o and K-Judge as evaluators, we assessed the model’s fluency, factual accuracy, and alignment with user intent across a range of Korean downstream tasks.
For a fair comparison, Hunminai-1.0 was evaluated alongside publicly available Korean-specialized models of similar scale, including SKT’s A.X-4.0-Light and A.X-3.1-Light, KT’s Midm-2.0-Base-Instruct, LG’s EXAONE-4.0-32B and EXAONE-3.5-7.8B-Instruct, Kakao’s kanana-1.5-8b-instruct-2505, and Naver’s HyperCLOVAX-SEED-Text-Instruct-1.5B and HyperCLOVAX-SEED-Think-14B.
All LM Evaluation Harness benchmarks were evaluated on an NVIDIA RTX 3090 24GB GPU. The models were loaded with 4-bit quantization (load_in_4bit=True
) and evaluated using bfloat16
precision. The maximum sequence length was set to 4096 tokens.
For the instruction-following evaluation on the IFEval benchmark, we adopted the Strict scoring criterion, which requires exact matches to the instruction for correctness. Furthermore, evaluation was conducted at the instruction level by judging each instruction independently rather than the entire prompt. The final score was calculated as the average across all instruction groups to provide a detailed and reliable assessment of model performance.
LM Evaluation Harness is a standardized framework designed to evaluate the performance of large language models across a wide range of established benchmarks. It facilitates consistent, automated, and reproducible comparisons between models. The benchmarks used to evaluate Hunminai-1.0 are listed in Table 1.
Benchmark | Language | Category | Description |
---|---|---|---|
KMMLU | ko | General Knowledge | A Korean adaptation of MMLU that evaluates models across diverse academic and professional subjects in the Korean language. |
HAE-RAE | ko | Society & Culture | A Korean language benchmark assessing knowledge and reasoning in a multiple-choice QA format, modeled after MMLU but tailored to local contexts. |
MMLU | en | General Knowledge | A benchmark testing knowledge and reasoning across 57 tasks in various domains, designed to assess multitask accuracy in English. |
IFEval | en | Instruction Following | A Korean benchmark focused on instruction-following ability, evaluating how well models respond to open-ended prompts in a user-aligned way. |
The following analysis presents the results across these benchmarks, with a particular focus on language-specific performance. Table 2 presents scores across benchmarks, which were used to compute both language-specific and overall averages.
Model | Average | KMMLU | HAE-RAE | MMLU | IFEval | |
---|---|---|---|---|---|---|
1 | davidkim205/Hunminai-1.0-27b | 70.96 | 53.79 | 72.23 | 73.89 | 83.93 |
2 | skt/A.X-4.0-Light | 68.93 | 54.2 | 73.69 | 67.6 | 80.22 |
3 | LGAI-EXAONE/EXAONE-4.0-32B | 67.3 | 50.77 | 75.62 | 74.08 | 68.71 |
4 | davidkim205/Hunminai-1.0-12b | 65.78 | 45.35 | 67 | 68.54 | 82.25 |
5 | skt/A.X-3.1-Light | 63.28 | 48.02 | 72.59 | 55.53 | 76.98 |
6 | LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct | 63.04 | 44.58 | 71.68 | 61.91 | 73.98 |
7 | K-intelligence/Midm-2.0-Base-Instruct | 62.77 | 42.59 | 77.45 | 67.74 | 63.31 |
8 | kakaocorp/kanana-1.5-8b-instruct-2505 | 58.77 | 40.72 | 76.08 | 60.38 | 57.91 |
9 | naver-hyperclovax/HyperCLOVAX-SEED-Text-Instruct-1.5B | 43.5 | 37.63 | 50.14 | 44.76 | 41.49 |
Figure 1 illustrates the language-specific performance of Hunminai-1.0 and other Korean-specialized models. For each model, the bars represent benchmark scores in Korean and English, while the dot indicates the overall average.
While several models exhibit strong performance on Korean benchmarks, often with reduced results in English, Hunminai-1.0 maintains competitive performance in both languages. This balanced outcome underscores its robustness in multilingual contexts while retaining its strength in Korean-specific tasks.
K-BENCH is a comprehensive benchmark suite designed to evaluate both the qualitative and quantitative performance of Korean language models. It encompasses a diverse range of tasks that assess models from multiple perspectives, including their real-world applicability and linguistic capabilities. The composition and evaluation protocols are summarized in Table 3 below.
Benchmark | Category | Evaluation Protocol |
---|---|---|
ko-bench | Multi-turn Dialogue | LLM-as-a-judge (GPT-4o, K-judge) scoring on a 0–10 scale based on quality and relevance of multi-turn responses. |
ko-ged | Subjective QA (Reading & Reasoning) | LLM-as-a-judge (GPT-4o, K-judge) scoring on a 0–10 scale based on response quality. |
ko-gpqa | Multiple-Choice Science Reasoning | Auto-scored by comparing the model’s selected choice with the ground truth. |
ko-math-500 | Math (Boxed Answer Extraction) | Auto-scored by extracting the boxed answer and matching it to the correct solution. |
ko-ged-mc:elementary ko-ged-mc:middle ko-ged-mc:high |
Multiple-Choice GED | Auto-scored by matching the model’s selected option to the correct answer. |
ko-ifeval | Instruction Following | Strict rule-based scoring based on correct interpretation and execution of explicit instructions. |
For tasks that require qualitative evaluation, such as ko-bench and ko-ged, we employed GPT-4o and K-Judge as judge models. K-Judge is based on keval, a lightweight Korean language model optimized for offline use. keval is significantly smaller than GPT-4o, yet delivers comparable evaluation results in practice. This makes it a cost-effective and scalable alternative to GPT-4o in offline or resource-constrained environments. In this study, we used the keval-12b model for evaluation.
Figure 2 visualizes the differences in evaluation scores when using GPT-4o and keval as judge models for ko-bench, ko-ged, and their overall average. Overall, keval tends to assign slightly higher scores than GPT-4o, particularly for smaller models. According to the Average Benchmark Scores, model rankings remained largely consistent across both judges, with the exception of two changes: Hunminai-1.0-27B and EXAONE-4.0-32B swapped places at ranks 1 and 2, while Midm-2.0-Base-Instruct and EXAONE-3.5-7.8B-Instruct swapped places at ranks 5 and 6.
Accordingly, in the following sections, we report ko-bench and ko-ged results based solely on keval scores.
In the K-BENCH evaluation, Hunminai-1.0-27B and EXAONE-4.0-32B, which ranked first and third respectively in the quantitative evaluation, achieved the top two positions in the qualitative assessment as well. Hunminai-1.0-27B performed particularly well on ko-ged and ko-math, while EXAONE-4.0-32B scored highest on ko-bench, ko-gpqa, and ko-ifeval.
In contrast, A.X-4.0-Light, which ranked 2nd in the LM-Evaluation-Harness results, dropped two spots in the K-BENCH scores and performed similarly to Hunminai-1.0-12B. The detailed scores for each benchmark and model are provided in Table 4.
These results highlight how model performance can vary across benchmark types and demonstrate the value of incorporating qualitative evaluations for a more comprehensive understanding of language model capabilities.
Model | Avg | ko-bench | ko-ged | ko-ged:E | ko-ged:M | ko-ged:H | ko-gpqa | ko-math-500 | ko-ifeval | |
---|---|---|---|---|---|---|---|---|---|---|
1 | davidkim205/ Hunminai-1.0-27b |
8.52 | 8.22 | 9.31 | 9.86 | 9.67 | 9.6 | 4.55 | 8.56 | 8.41 |
2 | LGAI-EXAONE/ EXAONE-4.0-32B |
8.21 | 8.56 | 9.35 | 9.38 | 9.22 | 9.12 | 5.25 | 6.32 | 8.49 |
3 | skt/ A.X-4.0-Light |
7.88 | 8.2 | 8.95 | 9.65 | 9.55 | 9.64 | 3.38 | 5.56 | 8.08 |
4 | davidkim205/ Hunminai-1.0-12b |
7.88 | 8.15 | 9.03 | 9.72 | 9.63 | 9.32 | 3.18 | 5.6 | 8.37 |
5 | K-intelligence/ Midm-2.0-Base-Instruct |
7.57 | 8.19 | 8.16 | 9.72 | 9.31 | 9.48 | 2.68 | 4.8 | 8.24 |
6 | LGAI-EXAONE/ EXAONE-3.5-7.8B-Instruct |
7.42 | 8.22 | 8.58 | 9.65 | 9.1 | 9 | 3.13 | 4.88 | 6.76 |
7 | kakaocorp/ kanana-1.5-8b-instruct-2505 |
7.22 | 7.7 | 8.87 | 9.1 | 9.02 | 9.08 | 2.83 | 3.72 | 7.47 |
8 | naver-hyperclovax/ HyperCLOVAX-SEED -Text-Instruct-1.5B |
5.11 | 4.46 | 4.53 | 8.4 | 7.63 | 7.31 | 1.87 | 2.94 | 3.75 |
Hunminai-1.0 is a Korean-specialized language model that demonstrates strong performance in both quantitative and qualitative evaluations. Based on the Gemma-3 architecture and fine-tuned on a curated Korean instruction dataset, it delivers stable and superior results across various tasks such as instruction understanding, dialogue generation, and open-ended reasoning. Notably, in both the LM Evaluation Harness and K-BENCH assessments, Hunminai-1.0 maintains a well-balanced performance in Korean and English, outperforming existing Korean-specialized models and highlighting its potential for multilingual applications.
However, Hunminai-1.0 has several limitations. First, the training data is not publicly available, which limits transparency and reproducibility. Second, the qualitative evaluations utilize the cost-effective keval judge model, which tends to yield slightly higher scores, especially for smaller models, requiring caution when comparing absolute scores.
Through high-quality instruction tuning and public release, Hunminai-1.0 plays a pivotal role in advancing the Korean LLM ecosystem. Its balanced multilingual capabilities make it suitable for a wide range of applications, from search and education to customer support. By enabling practical deployments, it also fosters broader AI adoption and drives innovation in Korean language technology.