Hunminai-1.0: A Korean-Specialized Language Model with Balanced Multilingual Performance

12Digit AI Research

Hunminai-1.0 is a language model specifically designed to address the unique challenges of Korean natural language processing (NLP). Korean’s complex grammatical structures—such as particles, verb endings, and honorifics—pose difficulties for models largely trained on English. Additionally, the limited availability of large-scale, high-quality Korean training data further restricts the performance of general multilingual models on Korean tasks. In addition to these linguistic factors, capturing cultural nuances and social honorifics is essential for practical applications, making it essential to develop models that are carefully tailored to the Korean linguistic and cultural context.

To meet these demands, Hunminai-1.0 was developed based on the Gemma-3 architecture and aims to achieve superior performance compared to existing models in both quantitative benchmarks and qualitative evaluations. It excels in a variety of Korean language understanding and generation tasks, including dialogue generation, question answering, and long-form text creation. The model is publicly available on Huggingface in two sizes—12B and 27B parameters—under the Gemma license, facilitating easy access and customization by researchers and developers.

Key Advantages of Hunminai-1.0

  1. High Performance Specialized for Korean : Hunminai-1.0 demonstrates superior performance tailored specifically for Korean natural language processing, outperforming existing models in both quantitative benchmarks and qualitative assessments.
  2. Public Availability and Accessibility : The model is openly accessible on the Huggingface in two sizes—12B and 27B parameters—allowing researchers and developers to easily utilize and customize it for various applications.
  3. Developed based on the advanced Gemma-3 architecture : Leveraging the latest advances in the Gemma-3 architecture, Hunminai-1.0 incorporates cutting-edge NLP techniques and optimizations, ensuring robust and efficient language understanding and generation.
  4. Broad applicability to Korean-specific NLP tasks : The model exhibits strong applicability and effectiveness in a wide range of Korean language tasks, including dialogue generation, question answering, and long-form text generation, making it highly adaptable to different practical use cases.

Training

The model was fine-tuned on a carefully curated corpus of 100,000 high-quality Korean instruction examples using Supervised Fine-Tuning (SFT), followed by Direct Preference Optimization (DPO). This two-stage training approach enables better alignment with user intents in Korean and improves performance on downstream tasks such as dialogue generation, question answering, and long-form text generation. The dataset encompasses a wide variety of Korean language contexts and tasks, emphasizing alignment with user intent and natural language generation. The dataset is not currently public but is planned for future release.

Evaluation

Hunminai-1.0 was objectively evaluated using LM-Evaluation-Harness across a variety of Korean and multilingual benchmarks to assess its overall performance. However, such quantitative benchmarks primarily focus on short-form, closed-ended questions, limiting their ability to fully capture the capabilities of modern language models in complex instruction following and multi-turn dialogue scenarios. To address this limitation, we additionally conducted qualitative evaluations based on the K-BENCH dataset. Using GPT-4o and K-Judge as evaluators, we assessed the model’s fluency, factual accuracy, and alignment with user intent across a range of Korean downstream tasks.

For a fair comparison, Hunminai-1.0 was evaluated alongside publicly available Korean-specialized models of similar scale, including SKT’s A.X-4.0-Light and A.X-3.1-Light, KT’s Midm-2.0-Base-Instruct, LG’s EXAONE-4.0-32B and EXAONE-3.5-7.8B-Instruct, Kakao’s kanana-1.5-8b-instruct-2505, and Naver’s HyperCLOVAX-SEED-Text-Instruct-1.5B and HyperCLOVAX-SEED-Think-14B.

Evaluation Setup and Environment

All LM Evaluation Harness benchmarks were evaluated on an NVIDIA RTX 3090 24GB GPU. The models were loaded with 4-bit quantization (load_in_4bit=True) and evaluated using bfloat16 precision. The maximum sequence length was set to 4096 tokens.

For the instruction-following evaluation on the IFEval benchmark, we adopted the Strict scoring criterion, which requires exact matches to the instruction for correctness. Furthermore, evaluation was conducted at the instruction level by judging each instruction independently rather than the entire prompt. The final score was calculated as the average across all instruction groups to provide a detailed and reliable assessment of model performance.

LM Evaluation Harness

LM Evaluation Harness is a standardized framework designed to evaluate the performance of large language models across a wide range of established benchmarks. It facilitates consistent, automated, and reproducible comparisons between models. The benchmarks used to evaluate Hunminai-1.0 are listed in Table 1.

Benchmark Language Category Description
KMMLU ko General Knowledge A Korean adaptation of MMLU that evaluates models across diverse academic and professional subjects in the Korean language.
HAE-RAE ko Society & Culture A Korean language benchmark assessing knowledge and reasoning in a multiple-choice QA format, modeled after MMLU but tailored to local contexts.
MMLU en General Knowledge A benchmark testing knowledge and reasoning across 57 tasks in various domains, designed to assess multitask accuracy in English.
IFEval en Instruction Following A Korean benchmark focused on instruction-following ability, evaluating how well models respond to open-ended prompts in a user-aligned way.

[Table 1] Benchmarks used in the LM Evaluation Harness assessment

The following analysis presents the results across these benchmarks, with a particular focus on language-specific performance. Table 2 presents scores across benchmarks, which were used to compute both language-specific and overall averages.

Model Average KMMLU HAE-RAE MMLU IFEval
1 davidkim205/Hunminai-1.0-27b 70.96 53.79 72.23 73.89 83.93
2 skt/A.X-4.0-Light 68.93 54.2 73.69 67.6 80.22
3 LGAI-EXAONE/EXAONE-4.0-32B 67.3 50.77 75.62 74.08 68.71
4 davidkim205/Hunminai-1.0-12b 65.78 45.35 67 68.54 82.25
5 skt/A.X-3.1-Light 63.28 48.02 72.59 55.53 76.98
6 LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct 63.04 44.58 71.68 61.91 73.98
7 K-intelligence/Midm-2.0-Base-Instruct 62.77 42.59 77.45 67.74 63.31
8 kakaocorp/kanana-1.5-8b-instruct-2505 58.77 40.72 76.08 60.38 57.91
9 naver-hyperclovax/HyperCLOVAX-SEED-Text-Instruct-1.5B 43.5 37.63 50.14 44.76 41.49

[Table 2] LM Evaluation Harness Scores by Benchmark and Model

Figure 1 illustrates the language-specific performance of Hunminai-1.0 and other Korean-specialized models. For each model, the bars represent benchmark scores in Korean and English, while the dot indicates the overall average.

While several models exhibit strong performance on Korean benchmarks, often with reduced results in English, Hunminai-1.0 maintains competitive performance in both languages. This balanced outcome underscores its robustness in multilingual contexts while retaining its strength in Korean-specific tasks.

[Figure 1] Language-specific benchmark scores for Hunminai-1.0 and comparable Korean-specialized models
[Figure 1] Language-specific benchmark scores for Hunminai-1.0 and comparable Korean-specialized models. Bars represent Korean and English scores; dots indicate overall averages.

K-BENCH

K-BENCH is a comprehensive benchmark suite designed to evaluate both the qualitative and quantitative performance of Korean language models. It encompasses a diverse range of tasks that assess models from multiple perspectives, including their real-world applicability and linguistic capabilities. The composition and evaluation protocols are summarized in Table 3 below.

Benchmark Category Evaluation Protocol
ko-bench Multi-turn Dialogue LLM-as-a-judge (GPT-4o, K-judge) scoring on a 0–10 scale based on quality and relevance of multi-turn responses.
ko-ged Subjective QA (Reading & Reasoning) LLM-as-a-judge (GPT-4o, K-judge) scoring on a 0–10 scale based on response quality.
ko-gpqa Multiple-Choice Science Reasoning Auto-scored by comparing the model’s selected choice with the ground truth.
ko-math-500 Math (Boxed Answer Extraction) Auto-scored by extracting the boxed answer and matching it to the correct solution.
ko-ged-mc:elementary
ko-ged-mc:middle
ko-ged-mc:high
Multiple-Choice GED Auto-scored by matching the model’s selected option to the correct answer.
ko-ifeval Instruction Following Strict rule-based scoring based on correct interpretation and execution of explicit instructions.

[Table 3] Overview of K-BENCH Datasets and Evaluation Protocols

For tasks that require qualitative evaluation, such as ko-bench and ko-ged, we employed GPT-4o and K-Judge as judge models. K-Judge is based on keval, a lightweight Korean language model optimized for offline use. keval is significantly smaller than GPT-4o, yet delivers comparable evaluation results in practice. This makes it a cost-effective and scalable alternative to GPT-4o in offline or resource-constrained environments. In this study, we used the keval-12b model for evaluation.

Figure 2 visualizes the differences in evaluation scores when using GPT-4o and keval as judge models for ko-bench, ko-ged, and their overall average. Overall, keval tends to assign slightly higher scores than GPT-4o, particularly for smaller models. According to the Average Benchmark Scores, model rankings remained largely consistent across both judges, with the exception of two changes: Hunminai-1.0-27B and EXAONE-4.0-32B swapped places at ranks 1 and 2, while Midm-2.0-Base-Instruct and EXAONE-3.5-7.8B-Instruct swapped places at ranks 5 and 6.

Accordingly, in the following sections, we report ko-bench and ko-ged results based solely on keval scores.

[Figure 2] Comparison of Evaluation Results: GPT-4o vs. keval as Judge Models
[Figure 2] Comparison of Evaluation Results: GPT-4o vs. keval as Judge Models

In the K-BENCH evaluation, Hunminai-1.0-27B and EXAONE-4.0-32B, which ranked first and third respectively in the quantitative evaluation, achieved the top two positions in the qualitative assessment as well. Hunminai-1.0-27B performed particularly well on ko-ged and ko-math, while EXAONE-4.0-32B scored highest on ko-bench, ko-gpqa, and ko-ifeval.

In contrast, A.X-4.0-Light, which ranked 2nd in the LM-Evaluation-Harness results, dropped two spots in the K-BENCH scores and performed similarly to Hunminai-1.0-12B. The detailed scores for each benchmark and model are provided in Table 4.

These results highlight how model performance can vary across benchmark types and demonstrate the value of incorporating qualitative evaluations for a more comprehensive understanding of language model capabilities.

Model Avg ko-bench ko-ged ko-ged:E ko-ged:M ko-ged:H ko-gpqa ko-math-500 ko-ifeval
1 davidkim205/
Hunminai-1.0-27b
8.52 8.22 9.31 9.86 9.67 9.6 4.55 8.56 8.41
2 LGAI-EXAONE/
EXAONE-4.0-32B
8.21 8.56 9.35 9.38 9.22 9.12 5.25 6.32 8.49
3 skt/
A.X-4.0-Light
7.88 8.2 8.95 9.65 9.55 9.64 3.38 5.56 8.08
4 davidkim205/
Hunminai-1.0-12b
7.88 8.15 9.03 9.72 9.63 9.32 3.18 5.6 8.37
5 K-intelligence/
Midm-2.0-Base-Instruct
7.57 8.19 8.16 9.72 9.31 9.48 2.68 4.8 8.24
6 LGAI-EXAONE/
EXAONE-3.5-7.8B-Instruct
7.42 8.22 8.58 9.65 9.1 9 3.13 4.88 6.76
7 kakaocorp/
kanana-1.5-8b-instruct-2505
7.22 7.7 8.87 9.1 9.02 9.08 2.83 3.72 7.47
8 naver-hyperclovax/
HyperCLOVAX-SEED
-Text-Instruct-1.5B
5.11 4.46 4.53 8.4 7.63 7.31 1.87 2.94 3.75

[Table 4] K-BENCH Scores by Benchmark and Model

Conclusion

Hunminai-1.0 is a Korean-specialized language model that demonstrates strong performance in both quantitative and qualitative evaluations. Based on the Gemma-3 architecture and fine-tuned on a curated Korean instruction dataset, it delivers stable and superior results across various tasks such as instruction understanding, dialogue generation, and open-ended reasoning. Notably, in both the LM Evaluation Harness and K-BENCH assessments, Hunminai-1.0 maintains a well-balanced performance in Korean and English, outperforming existing Korean-specialized models and highlighting its potential for multilingual applications.

However, Hunminai-1.0 has several limitations. First, the training data is not publicly available, which limits transparency and reproducibility. Second, the qualitative evaluations utilize the cost-effective keval judge model, which tends to yield slightly higher scores, especially for smaller models, requiring caution when comparing absolute scores.

Through high-quality instruction tuning and public release, Hunminai-1.0 plays a pivotal role in advancing the Korean LLM ecosystem. Its balanced multilingual capabilities make it suitable for a wide range of applications, from search and education to customer support. By enabling practical deployments, it also fosters broader AI adoption and drives innovation in Korean language technology.