K-judge: Offline-Ready Evaluation Frameworks for Korean AI Models

ChangYeon Kim¹ YeHee Lim¹ BumSu Jung¹ YeonSu Ho¹ YoungJun Hwang¹

¹2Digit AI Research

Space(keval) Space(kgrammar) Dataset(keval) Dataset(kgrammar)

K-judge is a comprehensive evaluation framework designed to assess the performance of Korean AI models efficiently. Built for offline usability, K-judge ensures secure and reliable AI evaluation in environments where internet access is limited or data privacy is a priority. It consists of two specialized models: keval and kgrammar, each optimized for different aspects of AI evaluation.

keval is designed to evaluate AI-generated responses based on their appropriateness, accuracy, and relevance to the given question. Fine-tuned on a Korean-specialized dataset, keval provides precise assessments tailored for Korean AI models.
kgrammar focuses on detecting grammatical errors in AI responses. It ensures linguistic accuracy by identifying mistakes, making it a valuable tool for evaluating the language quality of Korean AI-generated text.

In this article, we will explore the development and application of K-judge, detailing how keval and kgrammar contribute to the evaluation of Korean AI models.

Improving Korean AI Evaluation with K-judge

Evaluating AI models has traditionally relied on domain-specific benchmarks like ARC and Hellaswag, which use question-answer pairs. However, these benchmarks often fail to capture human preference for AI-generated responses. While human evaluation provides more accurate assessments, it is costly and time-consuming.

Korean language models face additional challenges. Most existing benchmarks are designed for English, and Korean-specific datasets are scarce. Translating English datasets into Korean is an option, but simple translation fails to reflect the nuances of the Korean language.

K-judge: A Comprehensive Korean AI Evaluation Framework

K-judge addresses these challenges by leveraging the LLM-as-a-Judge approach and the Ko-Bench dataset through its models.

LLM-as-a-Judge: A method that utilizes large language models (LLMs) to evaluate AI models efficiently. It reduces time and cost compared to human evaluation while providing a more comprehensive assessment beyond traditional benchmarks.
Ko-Bench: An adaptation of MT-Bench designed specifically for Korean. By incorporating Korean linguistic, cultural, and geographical contexts, Ko-Bench ensures more precise evaluation of an AI model’s ability to understand and generate Korean.

Components of K-judge: keval & kgrammar

As part of K-judge, keval focuses on assessing the appropriateness, accuracy, and usefulness of AI-generated responses. By using a Korean-specialized dataset, keval ensures that responses align with human expectations and task requirements.

Complementing keval, kgrammar specializes in evaluating grammatical accuracy. It detects errors in AI-generated Korean text while ensuring linguistic correctness. Unlike keval, kgrammar does not assess response accuracy or usefulness but focuses solely on grammatical correctness.

By combining keval’s performance assessment with kgrammar’s grammatical evaluation, K-judge provides a more effective and scalable solution for evaluating Korean AI models. This offline-ready framework offers a secure, cost-effective, and high-quality approach to AI evaluation, making it ideal for privacy-sensitive or restricted environments.

Training

The keval and kgrammar models are built on Gemma-2-9b-it and meta-llama/Llama-3.2-1B / meta-llama/Llama-3.2-3B, all of which are known for their strong performance in text generation tasks.

Gemma-2 was selected not only for its capabilities in areas like question answering, summarization, and reasoning but also for its commercially-friendly license, which allows for innovation sharing and AI commercialization. Among the Gemma-2 models, the 9B variant demonstrates exceptional performance relative to other models of similar size. Despite its power, its relatively small size makes it highly versatile, enabling deployment in resource-constrained environments such as laptops, desktops, and cloud-based systems.

For smaller base models, the meta-llama/Llama-3.2-1B and meta-llama/Llama-3.2-3B Lightweight models were utilized. They are smaller size, making them relatively faster for deployment compared to larger models, while still providing strong performance across a variety of NLP tasks. These models enhance K-judge’s flexibility, providing options for more resource-efficient deployments while maintaining high performance.

Dataset

keval (Ko-Bench)

AI models typically respond in the same language as the input, but inconsistencies can occur. While they rarely answer in Korean when prompted in English, they sometimes generate English responses to Korean prompts. In some cases, a response appears to be in Korean but includes a single foreign-language word. These inconsistencies can lead to models receiving inflated scores in evaluations. To address this, Ko-Bench introduces a 0-point score alongside the 1-10 scale used in MT-Bench, ensuring that such errors are properly penalized.

For training, we collected responses from various models to Ko-Bench questions and evaluated them using GPT-4o. The response generation and evaluation process followed the methodology of MT-Bench.

Since language inconsistency errors were relatively rare in the original dataset, we augmented the data by prompting models to generate responses in English or mix foreign-language characters, ensuring a more comprehensive evaluation. Below is the prompt used in this process.

예시와 같이 다음 질문에 영어로 답변하세요.

## 예시
[질문] 분자생물학의 중심 원리는 무엇인가?
[답변] The molecular biology of DNA is the study of the structure of DNA.

[질문] {질문}
[답변]

예시와 같이 다음 질문에 한글 외 문자가 포함된 문장으로 답변하세요. 중국어 외에도 다양한 언어를 무작위로 선택하세요.

## 예시
[질문] 사람들이 당신을 해하기 시작하면 기분이 어떻습니까?
[답변] 저는 사람입니다. 저는 100年의 경험을 가지고 있습니다. 저는 무엇이든 할 수 있습니다! 저는 제가 무엇을 하는지 知道합니다.

[질문] {질문}
[답변]

For model training, we ultimately used the following dataset format, which includes questions input into a Korean AI model, the corresponding answers, and prompts for evaluating those answers.

{
  "prompt": "당신은 도움이 되는 어시스턴트입니다.",
  "input": "[지시]\n 공정한 심판으로서 아래에 표시된 사용자 질문에 대한 AI 어시스턴트의 응답 품질을 평가해주세요. 질문과 대답의 언어가 동일하지 않으면 무조건 0점입니다. 평가에서는 응답의 유용성, 관련성, 정확성, 깊이, 창의성, 상세함 등의 요소를 고려해야 합니다. 평가를 시작하기 전에 짧은 설명을 제공하세요. 가능한 한 객관적으로 평가하세요. 설명을 제공한 후 다음 형식을 엄격히 따라 1에서 10점 사이로 평가해야 합니다: \"[[rating]]\", 예를 들어: \"Rating: [[5]]\".\n\n[Question]\n다음과 같은 주가 기록을 바탕으로 2022년 매월 최고 및 최저 종가를 추출합니다. 결과를 각 월에 한 줄이 할당된 CSV 문자열로 반환합니다.\n 날짜, 시가, 고가, 저가, 종가, 거래량\n2022-01-01,150.02,155.28,148.50,153.80,15678900\n2022-01-02,154.32,157.25,153.48,156.25,19874500\n2022-02-01,160.50,163.28,159.50,161.80,14326700\n2022-02-02,161.80,164.25,161.30,163.90,17689200\n2022-03-01,165.40,168.35,163.10,166.80,16253400\n2022-03-02,167.00,169.85,165.50,168.20,19 568100\n\n[어시스턴트 답변의 시작]\nBased on the provided data, below is the CSV string with the highest and lowest closing prices for each month:\n\n```\n날짜,최고종가,최저종가\n2022-01,156.25,153.80\n2022-02,163.90,161.80\n2022-03,168.20,166.80\n```\n\nThis CSV string shows the highest and lowest closing prices for each month in 2022. The first column represents the month, the second column represents the highest closing price, and the third column represents the lowest closing price.\n[어시스턴트 답변의 끝]",
  "output": "이 평가에서는 AI 어시스턴트의 응답이 사용자의 질문에 얼마나 잘 부합하는지를 살펴봅니다. 사용자는 2022년의 매월 최고 및 최저 종가를 추출하여 CSV 문자열로 반환해달라고 요청했습니다. AI 어시스턴트는 주어진 데이터를 바탕으로 각 월의 최고 및 최저 종가를 정확히 계산하여 CSV 형식으로 제공했습니다. 응답은 사용자의 요구에 맞게 구성되었으며, 데이터의 분석과 결과의 표현이 정확합니다. 또한, 응답은 명확하고 이해하기 쉽게 설명되어 있습니다. 다만, 질문과 응답의 언어가 다르므로 평가 기준에 따라 점수를 낮출 수밖에 없습니다.\n\nRating: [[0]]"
}

kgrammar

The goal of kgrammar is to identify grammatical errors in Korean text and quantify them. Like keval, it flags foreign words within Korean responses as errors. However, it does not evaluate the accuracy or relevance of the response.

Training kgrammar requires error-containing texts along with corresponding error analysis data. The training dataset was generated using GPT-4o.

To generate error-containing documents, we used the following prompt to introduce errors into responses when given a question. To prevent mathematical expressions and code from being misclassified as errors, the dataset includes 50% general questions, 25% math-related questions, and 25% coding-related questions.

예시와 같이 다음 질문에 한글 외 문자가 포함된 문장으로 답변하세요. 태국어, 영어, 러시아어 등 다양한 언어를 무작위로 선택하세요.
## 예시
저는 사람입니다. 저는 100年의 경험을 가지고 있습니다. 저는 무엇이든 can do it! 저는 제가 무엇을 하는지 知道합니다.

{질문}

After generating error-containing documents, we apply the following prompt to analyze errors in each document and output the number of detected errors.

문장에서 한국어 문서의 통일성을 해치는 불필요한 외국어(영어, 일본어, 중국어 등) 문장을 지적하세요. 단순한 어휘 교체가 아니라, 외국어 표현 자체가 문제가 되는 경우만 포함하세요.

**지적 대상:**
- 문장 내 문맥 상 부자연스러운 외국어 사용

**예외 처리:**
- 고유명사(예: 브랜드명, 시험명, 기관명), 약어(예: AI, DNA)
- 기호, 수식, 전문 용어 등 외국어 표현
- 다양한 프로그래밍 언어로 작성된 코드 블록, 변수, 함수, 클래스 이름
- 한국어에서 널리 사용되어 대체할 필요가 없는 한자어나 외래어(예: 베이커리, 콘텐츠)

**형식**
```\
- {오류 문장 1 설명}
- {오류 문장 2 설명}
- ...
<incorrect grammar>n</incorrect grammar>
```\
(n은 오류 개수)

Here is the format of the generated error-containing documents along with the error analysis results.

가장 적은 S등급을 받은 통신사는 LG유플러스 입니다. 그들의 S등급 지역은 243곳でした. 이는 다른 두 통신사에 비해 적습니다.
            
- \"でした\"는 문장에서 일본어 표현으로, 한국어 문서의 통일성을 해치는 요소입니다.\n<incorrect grammar>1</incorrect grammar>

For model training, the kgrammar dataset was finalized in the following format, with the error analysis sentence structure adjusted to align with the prompt.

{
  "prompt": "한국어 문맥상 부자연스러운 부분을 찾으시오. 오류 문장과 개수는 <incorrect grammar> </incorrect grammar> tag, 즉 <incorrect grammar> - 오류 문장과 설명 </incorrect grammar> 안에 담겨 있으며, <wrong count> </wrong count> tag, 즉 <wrong count> 오류 개수 </wrong count> 이다.",
  "input": "가장 적은 S등급을 받은 통신사는 LG유플러스 입니다. 그들의 S등급 지역은 243곳でした. 이는 다른 두 통신사에 비해 적습니다.",
  "output": "<incorrect grammar>\n- \"でした\"는 문장에서 일본어 표현으로, 한국어 문서의 통일성을 해치는 요소입니다.\n</incorrect grammar> <wrong count>1</wrong count>"
}

Fine-Tuning

The development of K-judge’s keval and kgrammar models incorporated both Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) to enhance its evaluation accuracy and alignment with human judgment.

SFT involves training a model with labeled data, enabling it to learn correct responses for specific inputs and improve its performance on evaluation tasks. To strengthen K-judge’s ability to assess Korean AI responses, we conducted additional SFT using a Korean-specific evaluation dataset, ensuring more accurate and context-aware assessments.

DPO further refines the models by optimizing responses based on user preferences. Instead of simply learning correct answers, DPO allows K-judge’s models to differentiate between higher- and lower-quality responses through comparative evaluation. By incorporating DPO, keval and kgrammar were fine-tuned to provide more consistent and human-like judgments.

By integrating both SFT and DPO, K-judge achieved improved alignment with expert evaluations, minimizing discrepancies between AI-generated scores and human assessments. This combination ensures that keval and kgrammar not only recognize correct answers but also evaluate AI-generated content in a way that closely mirrors human judgment.

Evaluation

keval

The evaluation dataset consists of 22 samples, with two samples selected for each score from 0 to 10 in the training data. The keval model was assessed using two key metrics: Diff and Accuracy. keval was developed in three sizes—1B, 3B, and 9B. Even the smaller models (1B and 3B) demonstrated performance comparable to the 9B model. For baseline evaluation, keval was compared against GPT-4o and GPT-4o-mini.

Diff

Diff measures the difference between the labeled and predicted scores.

wrong: The number of incorrectly formatted responses.

The column (0–10) represent the difference (n) between the labeled and predicted score, with each value showing the count and proportion of samples for that difference.

Score Calculation

The score is calculated based on how closely the predicted score matches the labeled score:

Compute the difference between the Ko-Bench label score and the predicted score for each sample.
Assign points based on the difference:
- A difference of 0 → 1 point
- A difference of 1 → 0.5 points
- Any other difference → 0 points
Compute the final score as:
- score = average of total points

This metric provides a normalized measure of the model’s evaluation accuracy, with higher scores indicating better alignment with human-labeled judgments.

	model	wrong	score	0	1	2	3	4	5	6	7	10
0	keval-2-9b	0 (0.0%)	61.4%	11 (50.0%)	5 (22.7%)	2 (9.1%)	3 (13.6%)	0	0	0	0	1 (4.5%)
1	keval-2-3b	0 (0.0%)	59.1%	10 (45.5%)	6 (27.3%)	4 (18.2%)	2 (9.1%)	0	0	0	0	0
2	gpt-4o	0 (0.0%)	54.5%	7 (31.8%)	10 (45.5%)	2 (9.1%)	2 (9.1%)	1 (4.5%)	0	0	0	0
3	keval-2-1b	0 (0.0%)	43.2%	8 (36.4%)	3 (13.6%)	5 (22.7%)	2 (9.1%)	1 (4.5%)	0	1 (4.5%)	0	2 (9.1%)
4	gpt-4o-mini	1 (4.5%)	36.4%	4 (18.2%)	8 (36.4%)	4 (18.2%)	3 (13.6%)	0	1 (4.5%)	0	1 (4.5%)	0

Accuracy

Accuracy measures the ratio of correct predictions to the total predictions.

score: The percentage of correctly predicted labels, calculated as the number of correct predictions divided by the total dataset size.
wrong: The number and proportion of responses with incorrect formatting.

The column (0–10) correspond to Ko-Bench scores, with each value representing the count and percentage of correctly predicted scores for that label. A higher score indicates a greater alignment between the model's predictions and the ground truth labels.

	model	wrong	score	0	1	2	3	4	5	6	7	8	9	10
0	keval-2-9b	0 (0.0%)	50.0%	1 (50.0%)	1 (50.0%)	2 (100.0%)	0	2 (100.0%)	0	0	1 (50.0%)	1 (50.0%)	1 (50.0%)	2 (100.0%)
1	keval-2-3b	0 (0.0%)	45.5%	2 (100.0%)	1 (50.0%)	0	0	2 (100.0%)	1 (50.0%)	0	1 (50.0%)	1 (50.0%)	0	2 (100.0%)
2	keval-2-1b	0 (0.0%)	36.4%	0	1 (50.0%)	2 (100.0%)	0	1 (50.0%)	0	1 (50.0%)	0	0	1 (50.0%)	2 (100.0%)
3	gpt-4o	0 (0.0%)	31.8%	2 (100.0%)	0	0	1 (50.0%)	0	1 (50.0%)	0	0	1 (50.0%)	0	2 (100.0%)
4	gpt-4o-mini	1 (4.5%)	18.2%	2 (100.0%)	0	0	0	0	0	0	0	1 (50.0%)	0	1 (50.0%)

Comparison of GPT and keval Evaluation Results

GPT excels at assessing the accuracy and usefulness of model responses. However, it struggles with Korean input containing code or mathematical expressions, often misclassifying them as errors.

The following is an actual test result. Although the Korean AI model's response contained no English expressions aside from code, GPT-4o incorrectly classified it as an English response and assigned a score of zero due to a language mismatch.

Response from the Korean AI Model

[어시스턴트 답변의 시작]
네, 이 함수에는 몇 가지 버그가 있습니다:

1. 인덱싱이 잘못되었습니다: `str1[i - 1]`과 `str2[j - 1]`은 문자열의 마지막 요소가 아닌 첫 번째 요소에 대해 계산됩니다. 올바른 인덱싱은 `str1[i - 1]`과 `str2[j - 1]`이 되어야 합니다.

2. 초기화가 잘못되었습니다: `dp` 테이블의 첫 번째 행과 첫 번째 열은 0이 아닌 1로 초기화되어야 합니다. 이는 첫 번째 문자열 또는 두 번째 문자열의 첫 번째 문자와 비교할 때 공통 부분 수열의 길이가 1이기 때문입니다.

3. 최대값을 찾는 방법이 잘못되었습니다: 현재 구현에서는 `dp[i - 1][j]`와 `dp[i][j - 1]` 중 최대값을 찾습니다. 하지만 이 경우 `dp[i - 1][j]`와 `dp[i][j - 1]`가 아닌 `dp[i - 1][j - 1]`의 값이 이전 비교에서 업데이트되었을 수 있으므로 올바른 값이 아닙니다.

다음은 수정된 버전의 함수입니다:

\```python
def longest_common_subsequence_length(str1, str2):
 m = len(str1)
 n = len(str2)

 dp = [[0] * (n + 1) for _ in range(m + 1)]

 for i in range(1, m + 1):
  dp[i][0] = dp[i - 1][0]

 for j in range(1, n + 1):
  dp[0][j] = dp[0][j - 1]

 for i in range(1, m + 1):
  for j in range(1, n + 1):
   if str1[i - 1] == str2[j - 1]:
    dp[i][j] = dp[i - 1][j - 1] + 1
   else:
    dp[i][j] = max(dp[i - 1][j], dp[i][j - 1])

 return dp[m][n]
\```

이 수정된 버전은 올바른 인덱싱, 초기화 및 최대값을 찾는 방법을 사용합니다. 이제 함수는 두 입력 문자열의 가장 긴 공통 부분 수열의 길이를 올바르게 계산합니다.
[어시스턴트 답변의 끝]

GPT-4o Evaluation of the Korean AI Model's Response

### 평가
사용자 질문에서 요구한 사항은 "**함수에서 버그를 식별할 수 있나요?**"입니다. 이와 관련하여 어시스턴트의 응답과 참조 답변을 비교한 결과, 아래와 같은 관찰이 있습니다:

1. **언어 일치 문제**
질문과 참조 답변 모두 한국어로 작성되어 있지만, 어시스턴트의 답변은 영어로 작성되어 있습니다. 이로 인해 응답 언어가 일치하지 않아 평가 기준에 의해 점수가 0점이 됩니다.

2. **정확성 문제**
- 참조 답변은 함수에 버그가 없으며 코드가 올바르게 작성되었다고 명시하고 있습니다. 따라서, 오직 들여쓰기를 수정하는 것만으로 충분하며 다른 로직적 수정은 필요하지 않습니다.
- 하지만 어시스턴트는 여러 비합리적인 "버그"를 만들어냈습니다. 예를 들어:
- `dp` 테이블의 초기화에 대한 코멘트: 처음 열과 행이 `0`으로 초기화된 것이 버그라며 `1`로 초기화해야 한다고 주장했습니다. 이는 잘못된 수정입니다. LCS 알고리즘에서는 첫 번째 열과 첫 번째 행을 `0`으로 유지하는 것이 맞습니다.
- 잘못된 최대값 계산 주장: 현재 코드는 동적 프로그래밍 알고리즘의 원리를 따르며, 이미 `dp[i][j] = max(dp[i-1][j], dp[i][j-1])`로 정확하게 구현되어 있습니다. 그러나 어시스턴트는 이 코드가 버그라고 잘못 진술했습니다.
- 불필요한 추가 작업 (`dp[i][0]` 및 `dp[0][j]`의 재초기화 부분)는 실제로 코드를 덜 효율적이게 만듭니다.

3. **유용성 문제**
어시스턴트의 수정된 코드는 불필요한 초기화 루프와 잘못된 논리를 포함하고 있어, 사용자가 혼란스러워할 가능성이 큽니다. 결과적으로, 어시스턴트의 답변은 문제 해결에서 유용하지 않았습니다.

---

### 최종 평가
어시스턴트의 답변은 질문에 대한 요구사항과 참조 답변의 설명에 크게 벗어났으며, 추가로 언어가 일치하지 않는 문제가 있었습니다. 올바른 평가 기준에 따라 점수는:

Rating: [[0]]

keval-2-9b Evaluation of the Korean AI Model's Response

어시스턴트의 답변은 주어진 함수에 문제가 있다고 주장하고 있습니다. 그러나 주어진 함수는 실제로 두 입력 문자열의 가장 긴 공통 부분 수열의 길이를 찾는 데 사용되는 동적 프로그래밍 접근 방식으로, 전반적으로 올바르게 구현되었습니다. 어시스턴트의 지적은 잘못된 부분이 있습니다.

1. 인덱싱 문제: 어시스턴트는 인덱싱이 잘못되었다고 주장하지만, 주어진 함수에서 `str1[i - 1]`과 `str2[j - 1]`은 올바르게 사용되고 있습니다. 이는 동적 프로그래밍 테이블을 채우는 데 필요한 인덱싱입니다.

2. 초기화 문제: 어시스턴트는 `dp` 테이블의 첫 번째 행과 열이 1로 초기화되어야 한다고 주장하지만, 이는 잘못된 지적입니다. LCS 문제에서 첫 번째 행과 열은 0으로 초기화되어야 하며, 이는 주어진 코드에서 올바르게 구현되었습니다.

3. 최대값 찾기 문제: 어시스턴트는 최대값을 찾는 방법이 잘못되었다고 주장하지만, 주어진 코드에서 `dp[i - 1][j]`와 `dp[i][j - 1]` 중 최대값을 찾는 것은 올바른 접근 방식입니다. 이는 LCS 문제에서 현재 요소를 채울 때 이전 요소들과 비교하여 최대값을 선택하는 방식입니다.

따라서 어시스턴트의 답변은 잘못된 지적을 포함하고 있으며, 주어진 함수는 올바르게 작동합니다.

Rating: [[2]]

kgrammar

The evaluation dataset consists of 80 samples: 40 general questions, 20 coding-related questions, and 20 math-related questions. To benchmark kgrammar’s performance, we compared it against GPT-4o and GPT-4o-mini.

Diff

Diff refers to the difference between the label and predicted wrong count scores, represented as a score. Unlike keval, which calculates it based on Ko-Bench scores ranging from 0 to 10, kgrammar measures the number of detected grammatical errors.

wrong: The number of incorrectly formatted responses.

The wrong column in the table is different from the wrong count used in kgrammar. While wrong refers to formatting errors, wrong count indicates all grammatical errors found in a sentence. Other numeric columns represent the count and percentage of samples with each difference between the labeled and predicted wrong count.

Score Calculation

The score is calculated based on how closely the predicted wrong count matches the labeled wrong count:

Calculate the difference between the labeled and predicted values in the tag for each sample.
Assign points based on the difference:
- A difference of 0 → 1 point
- A difference of 1 → 0.5 points
- Any other difference → 0 points
Compute the final score as:
- score = (total points) / (length)

	model	wrong	score	0	1	2	3	4	5	6	7	9	10	12
0	kgrammar-2-9b	0 (0.0%)	77.5%	52 (65.0%)	20 (25.0%)	5 (6.2%)	1 (1.2%)	1 (1.2%)	0	1 (1.2%)	0	0	0	0
1	kgrammar-2-3b	0 (0.0%)	74.4%	51 (63.7%)	17 (21.2%)	8 (10.0%)	1 (1.2%)	1 (1.2%)	0	1 (1.2%)	1 (1.2%)	0	0	0
2	kgrammar-2-1b	1 (1.2%)	67.5%	44 (55.0%)	20 (25.0%)	8 (10.0%)	2 (2.5%)	2 (2.5%)	1 (1.2%)	0	2 (2.5%)	0	0	0
3	gpt-4o	1 (1.2%)	56.9%	34 (42.5%)	23 (28.7%)	14 (17.5%)	3 (3.8%)	2 (2.5%)	2 (2.5%)	0	0	0	0	1 (1.2%)
4	gpt-4o-mini	0 (0.0%)	44.4%	19 (23.8%)	33 (41.2%)	18 (22.5%)	3 (3.8%)	1 (1.2%)	3 (3.8%)	0	1 (1.2%)	1 (1.2%)	1 (1.2%)	0

Accuracy

Accuracy measures the ratio of correct predictions to the total predictions.

score: The proportion of correctly predicted labels, calculated as the number of correct predictions divided by the total dataset size.
wrong: The number and proportion of responses with incorrect formatting.

The column names (0–11) correspond to the wrong count labels in the kgrammar dataset, with each value representing the count and percentage of correctly predicted scores for that label. A higher score indicates a greater alignment between the model's predictions and the ground truth labels.

	model	wrong	score	0	1	2	3	4	5	6	7
0	kgrammar-2-9b	0 (0.0%)	65.0%	35 (97.2%)	5 (71.4%)	7 (50.0%)	3 (37.5%)	2 (40.0%)	0	0	0
1	kgrammar-2-3b	0 (0.0%)	63.7%	35 (97.2%)	2 (28.6%)	8 (57.1%)	3 (37.5%)	2 (40.0%)	1 (50.0%)	0	0
2	kgrammar-2-1b	1 (1.2%)	55.0%	34 (94.4%)	3 (42.9%)	4 (28.6%)	2 (25.0%)	0	0	1 (50.0%)	0
3	gpt-4o	1 (1.2%)	42.5%	9 (25.0%)	6 (85.7%)	8 (57.1%)	7 (87.5%)	1 (20.0%)	2 (100.0%)	0	1 (100.0%)
4	gpt-4o-mini	0 (0.0%)	23.8%	1 (2.8%)	5 (71.4%)	8 (57.1%)	5 (62.5%)	0	0	0	0

Error Detection Accuracy

Error Detection Accuracy measures a model's ability to correctly identify whether an error is present in a given text. Unlike conventional accuracy metrics that assess overall classification correctness, this metric specifically focuses on the precision of error predictions.

The kgrammar models demonstrate strong performance in detecting grammatical errors. Among them, kgrammar-2-9b achieves the highest accuracy, with kgrammar-2-3b and kgrammar-2-1b performing comparably well. Due to their strong performance in detecting grammatical errors, all kgrammar models serve as reliable tools for precise grammatical evaluation.

	model	score	wrong
0	kgrammar-2-9b	95.0%	0
1	kgrammar-2-3b	93.8%	0
2	kgrammar-2-1b	92.5%	1
3	gpt-4o	65.0%	1
4	gpt-4o-mini	55.0%	0

Comparison of GPT and kgrammar Evaluation Results

Unlike keval, which focuses on evaluating content accuracy, usefulness, and creativity, kgrammar specializes in identifying grammatical errors in Korean sentences. These errors include not only spelling and spacing mistakes but also foreign expressions within Korean text. GPT focuses more on content evaluation, making it less effective in detecting Korean grammatical errors—the main purpose of kgrammar.

Below are the evaluation results of GPT-4o and kgrammar-2-9b for a document written entirely in Korean. While GPT replaces words with what it considers better expressions and counts the number of replacements rather than directly identifying errors, kgrammar-2-9b correctly determines that there are no grammatical errors in the text.

Input Text

인보사 사건은 코오롱생명과학이 개발한 골관절염 유전자 치료제인 '인보사케이주'와 관련된 법적 문제를 일컫습니다. 이 사건은 처음에는 혁신적인 치료제로 주목받았으나, 나중에 임상시험 과정에서 성분 변경이 밝혀지며 논란이 촉발되었습니다.

요약하자면, 인보사는 두 가지 성분을 혼합한 치료제인데, 그 중 하나가 예상치 못한 신장세포로 대체된 것이 드러났습니다. 이로 인해 암 발병 가능성 등 안전성 문제와 더불어 허가 과정에서의 자료 조작 및 은폐 혐의가 불거졌습니다. 한국 식약처와 미국 FDA의 조사 결과, 코오롱생명과학이 허위 자료를 제출한 정황이 밝혀졌습니다.

이에 따라 관련 임원들이 기소되었으나, 최근 법원의 결정에 따라 상당수의 임원들이 보석금을 납부하고 석방된 상태입니다. 다만, 이우석 대표는 여전히 구속 상태로 남아 있으며 사건의 법적 절차는 아직 진행 중입니다. 이 사건은 의약품 허가 및 관리의 중요성을 다시 한번 환기시키며, 관련 규제와 법적 절차의 투명성에 대한 요구를 높입니다.

GPT-4o Evaluation of the Input Text

 <incorrect grammar> - "인보사는 두 가지 성분을 혼합한 치료제인데, 그 중 하나가 예상치 못한 신장세포로 대체된 것이 드러났습니다."  
(문맥상 '예상치 못한'이라는 표현 대신 '제조 과정에서 신장세포로 대체된 것이 드러났다'와 같은 명확한 문장으로 수정하는 것이 문맥적으로 자연스러움) </incorrect grammar>
<incorrect grammar> - "한국 식약처와 미국 FDA의 조사 결과..."  
('한국 식약처'라는 표현 대신 정확한 공식 명칭인 '식품의약품안전처'로 표기하는 것이 더 적절함) </incorrect grammar>
<incorrect grammar> - "이에 따라 관련 임원들이 기소되었으나, 최근 법원의 결정에 따라 상당수의 임원들이 보석금을 납부하고 석방된 상태입니다."  
('석방된 상태입니다'라는 표현은 다소 행정적 문맥에 부적합해 보이며, '임시 석방된 상태에 있습니다'라고 수정하는 것이 더 자연스러움) </incorrect grammar>

<wrong count> 3 </wrong count>

kgrammar-2-9b Evaluation of the Input Text

 <incorrect grammar>이 문서에는 문맥상 부자연스러운 외국어 사용이 없습니다. 모든 내용이 자연스럽고 일관되게 한국어로 작성되어 있습니다. </incorrect grammar>
            
<wrong count>0</wrong count>

Conclusion

K-judge is an evaluation system that includes keval and kgrammar, designed for the accurate evaluation of Korean AI models. It not only enhances the quality and reliability of Korean language models but also reduces time and costs through an automated evaluation process. Additionally, its offline-ready capability ensures reliable evaluations even in environments with limited internet access, enhancing data security and usability.

K-judge contributes to the advancement of Korean AI technology in several ways:

Delivering automated feedback for fine-tuning AI models, enabling them to optimize their performance.
Identifying linguistic errors and inconsistencies in AI-generated text, aiding developers in debugging and refinement.
Facilitating deployment in resource-constrained environments, ensuring broader accessibility of AI evaluation.
Providing a reliable framework for evaluating Korean language models, enabling objective performance comparisons.

With these capabilities, K-judge is not just a tool for evaluating Korean AI models—it lays the foundation for future advancements, driving continuous innovation and excellence in Korean AI technology.

K-judge: Offline-Ready Evaluation Frameworks for Korean AI Models

Improving Korean AI Evaluation with K-judge

K-judge: A Comprehensive Korean AI Evaluation Framework

Components of K-judge: keval & kgrammar

Training

Dataset

keval (Ko-Bench)

kgrammar

Fine-Tuning

Evaluation

keval

Diff

Score Calculation

Accuracy

Comparison of GPT and keval Evaluation Results

Response from the Korean AI Model

GPT-4o Evaluation of the Korean AI Model's Response

keval-2-9b Evaluation of the Korean AI Model's Response

kgrammar

Diff

Score Calculation

Accuracy

Error Detection Accuracy

Comparison of GPT and kgrammar Evaluation Results

Input Text

GPT-4o Evaluation of the Input Text

kgrammar-2-9b Evaluation of the Input Text

Conclusion

References