With the advancement of large language models (LLMs), extracting structured information—such as companies, people, themes, and sentiment—from financial news has become increasingly important. Financial news articles often describe multiple entities with distinct roles and mixed sentiment polarities within a single document, making their understanding more complex than single-task document-level analysis.
This setting naturally leads to a document-level multi-aspect structured information extraction problem. However, existing financial NLP benchmarks largely focus on isolated tasks—such as named entity recognition, sentiment analysis, or question answering—often in English and at the task level. Consequently, they fail to capture the integrated information extraction challenges inherent in real Korean financial news.
To address this gap, we propose FinNewsBench, a benchmark for evaluating multi-aspect structured information extraction from Korean financial news. FinNewsBench requires models to simultaneously extract five core information types—companies, people, keywords, themes, and central-company sentiment—from a single news document.
The benchmark is constructed from Korean financial news texts used in practical financial information services, reflecting realistic content structures and information density.
Reference labels are created through a hybrid labeling pipeline that combines LLM-based automatic extraction with expert human verification, ensuring both scalability and reliability.
Model outputs are evaluated based on semantic and contextual correctness rather than strict string matching, enabling fine-grained and cost-effective comparison across models.
Existing benchmarks for evaluating large language models in the financial domain span a wide range of languages, data sources, and task formulations. Table 1 summarizes representative financial benchmarks and highlights key differences in language coverage, input data, and evaluation targets. As shown in the table, most existing benchmarks are designed either for financial knowledge assessment through question–answering or for multi-task evaluation frameworks in which individual NLP tasks are evaluated independently.
| Benchmark | Language | Input Data Structure | Primary Task Formulation & Evaluation Aspects |
|---|---|---|---|
| PIXIU (FLARE) | English | Short financial texts (news snippets, statements) | 5 financial NLP tasks (Sentiment, NER, Classification, QA) + 1 prediction task (Stock movement) |
| FinanceBench | English | Long-form financial reports (10-K, 10-Q, 8-K) and earnings disclosures | Open-book financial question answering requiring specific evidence strings and page-level citations |
| FinBen | English / Spanish | Diverse corpora (news, reports, synthetic cases, stock market data) | Holistic evaluation across 24 tasks and 8 aspects: IE, TA, QA, TG, RM, FO, DM, and Multilingual |
| CFinBench | Chinese | Professional finance exam questions (99,100 items) | Knowledge-oriented evaluation across Single-choice, Multiple-choice, and Judgment types |
| FinEval-KR | Chinese | Reasoning-focused financial QA with multi-layered annotations | Decoupled assessment of Knowledge (K) and Reasoning (R) scores across 22 financial subfields |
| KFinEval-Pilot | Korean | Expert-validated prompts (1,145 instances) aligned with domestic regulations | Evaluation of Financial Knowledge, Legal Reasoning, and Financial Toxicity/Safety |
| FinNewsBench (Ours) | Korean | Full-length financial news articles | Document-level multi-aspect structured information extraction |
Table 1. Comparison of Financial Benchmarks
Benchmarks such as PIXIU (FLARE) and FinBen encompass multiple financial NLP tasks, including classification, named entity recognition, sentiment analysis, and question answering. However, as indicated in Table 1, their inputs are typically short financial texts, curated task-specific samples, or heterogeneous datasets, and their evaluation frameworks treat each task as a separate objective. These benchmarks are not designed to evaluate the joint extraction and organization of multiple interrelated information elements from a single full-length financial news article.
Other benchmarks focus more narrowly on financial question answering and reasoning. FinanceBench, for example, evaluates open-book question answering over financial reports such as earnings disclosures and regulatory filings, emphasizing factual correctness and evidence grounding. While effective for assessing reasoning over structured financial documents, its task formulation—also summarized in Table 1—differs fundamentally from news-based information extraction, where multiple entities, themes, and sentiment cues co-occur within a single narrative. Benchmarks developed in non-English settings, such as CFinBench and KFinEval-Pilot, extend linguistic coverage to Chinese and Korean but similarly center on question–answering or prompt-based evaluations rather than structured information extraction from news articles.
These characteristics highlight not only a task-level gap, but also an evaluation challenge. Document-level, multi-aspect structured information extraction from financial news requires assessing partially correct outputs, contextual relevance, and the relative importance of extracted elements—properties that are difficult to capture using conventional evaluation schemes. Consequently, while existing financial benchmarks cover a broad range of tasks and evaluation paradigms, they do not sufficiently focus on this setting, nor do they provide standardized evaluation protocols tailored to document-level financial news extraction in the Korean language.
Traditional evaluation approaches have relied on human annotators or static automatic metrics such as exact match (EM) and F1 scores; however, these methods exhibit inherent limitations in scalability, cost efficiency, and evaluation consistency when applied to large-scale datasets. In particular, lexical overlap–based metrics fail to capture contextual appropriateness, judgment quality, and semantic adequacy in outputs where multiple companies, individuals, themes, and sentiment polarities co-occur within a single article. Although human evaluation can provide richer qualitative signals, it incurs substantial time and cost overhead and is subject to inter-annotator variability, making it difficult to apply consistently at scale.
To address these limitations, the LLM-as-a-Judge evaluation paradigm has been introduced, in which large language models serve as automated evaluators that directly compare and score complex model outputs. Recent studies demonstrate that LLM-based evaluation enables scalable, cost-effective, and relatively consistent assessment across diverse generation and reasoning tasks. For example, Zheng et al. report that GPT-4–level evaluators achieve over 80% agreement with human preferences on open-ended generation benchmarks such as MT-Bench and Chatbot Arena, while substantially reducing evaluation cost and latency. These findings suggest that LLM-as-a-Judge–based evaluation constitutes a practical and reliable alternative to traditional metrics, particularly for tasks requiring nuanced qualitative judgment and contextual understanding. Accordingly, we adopt this paradigm as a core evaluation strategy for FinNewsBench.
Figure 1. FinNewsBench dataset construction and annotation pipeline
FinNewsBench is designed to evaluate the structured information extraction capabilities of large language models from news text. For the construction of this benchmark, we utilize AI-generated Korean financial news provided by Nine Memos, a financial news service developed by 2Digit. The financial news articles from Nine Memos are generated in a format that summarizes and describes recent market trends, industry issues, and policy changes centered on specific company. As a result, each document densely contains information related to companies, people, themes, events, and sentiment. This structure closely resembles real-world investment information consumption environments while maintaining relatively consistent formats and clearly defined information units, making it well-suited as input data for an information extraction benchmark.
During dataset construction, news articles were selected to cover a wide range of companies and industries. To prevent overrepresentation of specific companies and mitigate the risk of models overfitting to company-specific expressions, the dataset includes a maximum of one news document per company. Specifically, we adopted the single primary company tag provided by Nine Memos for each article and included at most one article per tagged company. This design maintains a balanced distribution of samples across companies, enabling fair performance evaluation even in a non-trivial dataset setting and facilitating efficient comparison and analysis of multiple large language models.
As illustrated in Figure 2, the economic news articles within this benchmark dataset span 12 distinct industrial sectors based on the Global Industry Classification Standard (GICS). The dataset extensively represents major industry groups, particularly those with high visibility in economic reporting, such as Industrials (23.2%), Health Care(17.1%), Financials (11.3%), and Information Technology (12%). This composition effectively mirrors the data distribution found in real-world market environments, providing sufficient diversity to rigorously evaluate the generalizability of the model.
Figure 2. Distribution of Benchmark Entities by GICS Industrial Sector
The reference annotations in FinNewsBench are constructed through a Hybrid Labeling Pipeline that combines LLM-based automatic extraction with human verification. This human-in-the-loop framework is designed to efficiently identify core metadata in generative financial news while incorporating domain-specific judgment criteria relevant to the financial sector.
First, multiple LLMs—including GPT-5.2 and recent open-source models—are prompted to extract companies, people, themes, keywords, and central-company sentiment from each article. The outputs generated at this stage serve as initial annotation candidates, restricted to information explicitly mentioned in the news text. Candidate labels are first aggregated by taking the intersection of labels identified by multiple models. These candidate labels are then passed through GPT-5.2 to reassess their importance and reweight their relevance scores, producing a ranked set of salient, news-central entities. This step ensures that the final reference labels emphasize core information while remaining robust to variations and noise in initial model outputs.
Subsequently, expert human annotators review all candidate labels to correct misclassifications, remove irrelevant items, and adjust missing or excessive information. A category balancing procedure is also applied to mitigate distributional imbalances across entity types, ensuring more stable and fair model comparisons.
During this process, we acknowledge that the reference annotations are guiding signals rather than exhaustive ground truth lists. Due to annotation scope, ambiguity in financial narratives, or human oversight, some entities, themes, or keywords present in the article may not appear in the reference. Therefore, labels that are explicitly supported by the news text are considered valid, even if absent from the reference. This design principle allows the benchmark to focus on capturing news-central information while maintaining robustness to reasonable variations in candidate labels and human annotation decisions.
By integrating automatic extraction with human verification and incorporating this reference-augmented perspective, the Hybrid Labeling Pipeline produces high-quality gold-standard annotations that reflect both domain expertise and the open-ended, information-dense nature of financial news.
FinNewsBench consists of 260 Korean financial news articles with moderate variation in length. As shown in Table 2, the average article length is 920 characters, with lengths ranging from 457 to 1,826 characters. This range provides sufficient contextual information for structured information extraction while avoiding extreme cases of overly short or excessively long documents.
| count | mean | min | max | std | |
|---|---|---|---|---|---|
| length | 260.00 | 920.73 | 457.00 | 1826.00 | 207.52 |
Table 2. Article Length Statistics of FinNewsBench
The density of annotated labels per article is summarized in Table 3. Each article contains a limited number of core entities, with an average of 2.2 company entities and 1.96 person entities, indicating that most news items focus on a small set of central actors. In contrast, semantic categories such as themes and keywords appear more frequently, reflecting the descriptive and multi-faceted nature of financial news. Sentiment annotations also exhibit a mixed distribution at the article level, with positive, negative, and neutral sentiment expressions co-occurring within individual articles. This highlights the nuanced and non-monolithic sentiment structure of financial news reporting. Notably, positive sentiment labels appear more frequently than negative or neutral ones in FinNewsBench. This skew can be partly attributed to the data source: the financial news articles are generated by an AI system used in the Nine Memos service, which is designed to summarize market trends, policy developments, and company-related outlooks for investment information delivery. As a result, the generated news tends to emphasize forward-looking perspectives, growth expectations, and interpretive context rather than purely adverse or incident-driven reporting, leading to a relatively higher prevalence of positive sentiment annotations.
| Avg per Article | |
|---|---|
| Company | 2.20 |
| People | 1.96 |
| Theme | 3.53 |
| Keyword | 7.68 |
| Positive | 1.98 |
| Negative | 0.64 |
| Neutral | 0.67 |
Table 3. Average Annotation Density per Article
As shown in Table 4, each theme is associated with a substantially large and diverse set of companies, with certain themes—most notably AI—linking several dozen distinct firms. This reflects the fact that the benchmark draws on financial news from 2025, a period characterized by intensified research, commercialization, and investment activity in AI-related industries. As a result, AI functions as a broad market-level investment narrative that spans multiple sectors rather than a narrow, article-specific topic.
This expanded many-to-many relationship between themes and companies further underscores that themes in FinNewsBench represent market-recognized constructs, reinforcing the need to identify news-central entities within each thematic context during annotation rather than treating themes as isolated topical labels.
| Theme | AI | 바이오 | 로봇 | 반도체 | 방산 | 2차전지 | 우주항공 | 조선 |
|---|---|---|---|---|---|---|---|---|
| Unique Company Count | 45 | 23 | 16 | 15 | 12 | 11 | 11 | 9 |
Table 4. Number of Unique Companies per Theme
In the initial experimental phase of FinNewsBench, a traditional recall-based evaluation was employed. This approach considers a model output correct only when it exactly matches a reference label. While simple and interpretable, this criterion often underestimates model performance in tasks with high lexical variability. For example, semantically equivalent outputs may differ due to spacing, abbreviations, compound keyword formations, or domain-specific variations. Such differences can systematically penalize correct outputs, particularly in keyword and theme extraction.
To address these limitations, FinNewsBench adopts an LLM-as-a-Judge framework, which evaluates model outputs based on semantic equivalence, contextual appropriateness, and information validity rather than strict string matching. We employ GPT-5-mini as the LLM-based judge due to its demonstrated capability in semantic comparison and contextual reasoning tasks, ensuring reliable and reproducible assessments. This framework enables the evaluation to recognize outputs that convey the correct meaning even when lexical forms differ from the reference.
Evaluation is conducted using a multi-aspect rubric, with category-specific criteria reflecting the characteristics of each information type:
Each category is scored on a continuous scale from 0 to 1, allowing partial correctness and incomplete extractions to be quantitatively reflected. This fine-grained scoring provides a nuanced measure of a model’s information selection capability and judgment quality, which are critical in financial news understanding tasks.
FinNewsBench evaluates the structured information extraction performance of large language models on Korean financial news by jointly considering both commercial API–based models and open-source models. A total of eight LLMs were evaluated, covering both proprietary and open models to reflect practical deployment scenarios. Detailed model specifications and individual performance comparisons are presented in the results section.
All models were evaluated using the same input news articles and an identical prompt. GPT-5 models were executed via the OpenAI API, while open-source models were deployed locally using a vLLM-based OpenAI-compatible inference server. This setup ensured a unified request–response interface across models and minimized experimental variance arising from differences in inference environments.
All models were required to generate outputs in a unified structured format. Model outputs were standardized as a JSON-formatted NewsExtraction object consisting of five categories: companies, people, themes, keywords, and central-company sentiment. For each extracted item, models were instructed to provide the entity name, a relevance score ranging from 0 to 1, and a supporting sentence directly extracted from the original news article.
| Category | Extracted Item | Relevance Score | Supporting Evidence (Source Sentence) |
|---|---|---|---|
| Company | HMM | 0.95 | ...HMM의 1만6000톤급 대형 컨테이너선에서 사고가 발생, 선원 1명이 현장에서 숨졌다. |
| KCC | 0.60 | 이날 작업은 HMM 선체에 도료 작업을 한 KCC가 하부 세척작업 전문업체에 도급을 줘서 진행했다. |
|
| People | 박성용 | 0.25 | 박성용 전국해상선원노동조합연맹 위원장은 “...사고가 반복되는 원인이 될 수 있다”고 강조했다. |
| Theme | 조선 | 0.80 | 작업자들의 안전을 보장할 수 있는 방식으로 선박을 설계하지 않으면 비슷한 유형의 사고가 또 발생할 수 있다. |
| Keyword | 홋줄 사고 | 0.90 | 사고는 선박을 부두에 정박할 때 사용하는 홋줄을 감아 들이는 과정에서 발생했다. |
| Sentiment | 부정 (HMM) | 0.90 | ...대형 컨테이너선에서 사고가 발생, 선원 1명이 현장에서 숨졌다. |
Table 5. Example of Structured Information Extraction from a News Article
This schema-based output design reduces variability due to inconsistent output formats and enables stable category-wise comparison in the subsequent LLM-as-a-Judge evaluation stage. Moreover, by requiring both relevance scores and evidence sentences, the benchmark evaluates not only the presence of extracted information but also the appropriateness of information selection and the ability to ground outputs in the source text.
A single prompt was consistently applied to all models. The prompt explicitly defines each extraction category and restricts outputs to information explicitly mentioned in the news text. It further requires models to assign relevance scores for each extracted item and to provide the corresponding source sentence from the original article. Finally, models are instructed to determine the sentiment polarity (positive, negative, or neutral) and impact score for the central company.
To discourage the extraction of opinion-driven content, the prompt explicitly specifies that entities and keywords associated with securities firms, analysts, or investment opinions must always receive a relevance score of zero. This constraint helps distinguish informational content from investment commentary and reflects information selection criteria commonly required in real-world financial services.
To apply the LLM-as-a-Judge framework introduced in Section 4, we employed a multi-aspect rubric to assess the quality of extracted information from each model. Specifically, the evaluation considered:
1. Entity Accuracy: Evaluators assessed whether the extracted company and people names accurately reflected the entities mentioned in the news, focusing on semantic correctness rather than exact string matching.
2. Theme and Keyword Appropriateness: Extracted themes and keywords were scored based on their relevance to the news content. Items not present in the reference but still appearing in the news were not penalized.
3. Sentiment Correctness: For each central company, the evaluator compared the sentiment indicated in the model output with that expressed in the corresponding news sentences. Opposite polarities resulted in score reductions.
4. Grounding Quality: Evaluators checked whether the model provided evidence sentences from the original article to justify its extraction and assigned relevance score.
Each aspect was scored on a continuous scale from 0 to 1, allowing fine-grained comparison between models. In addition to numerical scoring, the evaluator was instructed to provide concise reasoning for each score, ensuring transparency and interpretability.
This evaluation procedure was applied consistently across all news articles and models. By combining semantic matching, contextual relevance, sentiment verification, and evidence grounding, FinNewsBench ensures a robust and reproducible assessment of structured information extraction in Korean financial news.
This subsection examines how evaluation outcomes differ between traditional recall-based metrics and the proposed LLM-as-a-Judge framework, and why such differences arise. Table A compares recall and judge scores computed from identical model outputs across extraction categories.
As shown in Figure 3, the impact of LLM-as-a-Judge evaluation is strongly task-dependent. For open-ended extraction tasks, namely keyword and theme extraction, all evaluated models exhibit a consistently positive score gap. This indicates that recall-based evaluation systematically underestimates model performance in these categories due to its reliance on strict lexical matching. In financial news, the same concept is frequently expressed through synonyms, abbreviations, domain-specific variants, or hierarchical refinements. Outputs that are semantically correct but lexically different from the reference—such as synonym substitutions or contextually more specific expressions—are therefore penalized under recall-based metrics.
In contrast, sentiment analysis, which operates over a closed label set, exhibits a bipolar gap pattern. High-end models show little performance gain under judge-based evaluation and, in some cases, a slight decrease. This reflects a calibration effect: recall-based metrics overestimate performance by ignoring hallucinated or over-generated sentiment labels, as they only verify the presence of correct labels rather than their alignment with supporting evidence. The LLM-as-a-Judge framework corrects this bias by explicitly validating whether each predicted sentiment label is grounded in an appropriate source sentence.
Conversely, mid-tier models benefit substantially from judge-based evaluation. In these cases, models often predict the correct overall polarity but receive excessive penalties under recall-based metrics due to missing auxiliary labels or minor format mismatches. The LLM-as-a-Judge framework rescues such outputs by recognizing semantic correctness and contextual alignment that are invisible to surface-form matching metrics.
Quantitative evidence for this contrast between high-end and mid-tier models is provided in Table A in the Appendix, which reports recall-based and LLM-as-a-Judge scores and their corresponding gaps across all categories.
Figure 3. Comparison of Recall-based and LLM-as-a-Judge Evaluation Scores across Categories
Together, these results show that LLM-as-a-Judge does not uniformly inflate evaluation scores. Instead, it adaptively corrects metric bias depending on both task characteristics and model behavior, functioning as a dual mechanism for calibration and recovery. Representative cases illustrating semantic rigidity and set-based overestimation are presented in Table 6.
| Category | Reference | Model Output | Recall (Exact) | LLM-as-a-Judge |
|---|---|---|---|---|
| Theme | 보톡스 |
보툴리눔톡신 |
Incorrect | Correct |
| Theme | AI |
생성형 AI |
Incorrect | Correct |
| Keyword | 한미 관세 협상 |
관세 협상 타결 |
Incorrect | Correct |
| Keyword | 업무 협약 |
업무협약 체결 |
Incorrect | Correct |
| Sentiment | Context: 모화공장에도 동일한 서머타임제를 적용할 계획이라고 밝혔다. Gold: 긍정 |
중립 |
Correct* | Incorrect |
| Sentiment | Context: 라이프스타일 분과는 ... 대표이사 등으로 구성됐습니다. Gold: 중립 |
긍정 |
Correct* | Incorrect |
Table 6. Representative cases of evaluation refinement by LLM-as-a-Judge: Semantic recovery and logic calibration. In sentiment tasks, traditional recall often fails to penalize incorrect logic if the output falls within the predefined label set, whereas LLM-as-a-Judge identifies contextual inconsistencies.
Overall performance on FinNewsBench shows a clear but compact ranking structure among the top-performing models, as illustrated in Figure 4. OpenAI’s gpt-5.2 forms a distinct leading tier with an overall score of 0.90, while several models—including openai/gpt-oss-20b, Qwen/Qwen3-14B, and openai/gpt-5-mini—cluster tightly around 0.83, indicating broadly comparable aggregate performance. Models such as CohereLabs/aya-expanse-32b (0.82) and HyperCLOVAX-SEED-Think-32B (0.80) occupy the mid-range of this group, with Qwen/Qwen3-8B and google/gemma-3-27b-it (both 0.78) defining the lower bound of the top-8 set.
Despite these rank differences, the overall score range remains relatively narrow, suggesting that aggregate metrics alone obscure meaningful variation in model behavior. This motivates a closer examination of performance composition at the category level, where differences become more pronounced.
Figure 4. Overall performance scores of the top 8 models on FinNewsBench
A category-wise breakdown in Table 7 reveals that entity-centric extraction tasks, such as Company and People, are handled robustly across all evaluated models. Scores in these categories consistently exceed 0.90, with minimal dispersion, indicating that extracting well-defined named entities from financial news has largely stabilized and contributes little to differentiation among high-performing systems.
In contrast, open-ended extraction tasks, particularly Theme and Keyword, account for most of the observed performance divergence. Theme extraction scores vary widely—from 0.529 to 0.840—even among models with similar overall performance, and keyword extraction exhibits a comparable spread. These tasks require abstract concept identification and normalization beyond surface-form matching, making them inherently more challenging and more sensitive to model-specific extraction strategies.
Sentiment analysis shows moderate variability across models but does not dominate overall score differences to the same extent as theme and keyword extraction. Taken together, this breakdown indicates that aggregate FinNewsBench rankings are primarily driven by performance on semantically open information extraction tasks, rather than by closed-set entity recognition.
| Overall | Company | People | Theme | Keyword | Sentiment | |
|---|---|---|---|---|---|---|
| openai/gpt-5.2 | 0.900 | 0.981 | 0.964 | 0.840 | 0.868 | 0.847 |
| openai/gpt-oss-20b | 0.834 | 0.963 | 0.932 | 0.677 | 0.802 | 0.798 |
| Qwen/Qwen3-14B | 0.829 | 0.899 | 0.950 | 0.703 | 0.755 | 0.839 |
| openai/gpt-5-mini | 0.826 | 0.975 | 0.926 | 0.760 | 0.794 | 0.676 |
| CohereLabs/aya-expanse-32b | 0.820 | 0.926 | 0.895 | 0.726 | 0.741 | 0.811 |
| naver-hyperclovax/HyperCLOVAX-SEED-Think-32B | 0.799 | 0.951 | 0.904 | 0.675 | 0.640 | 0.825 |
| Qwen/Qwen3-8B | 0.781 | 0.890 | 0.798 | 0.725 | 0.735 | 0.759 |
| google/gemma-3-27b-it | 0.778 | 0.984 | 0.904 | 0.529 | 0.675 | 0.796 |
Table 7. Model Performance Comparison Across Information Extraction Categories
Further insight into these differences emerges when examining extraction failures summarized in Table 8. Because all evaluated articles were pre-filtered to ensure metadata availability, missing outputs directly reflect cases where a model failed to extract a required field. Under this criterion, google/gemma-3-27b-it exhibits a highly asymmetric failure pattern across categories.
Although gemma-3-27b-it achieves the highest company extraction score among all evaluated models (0.984), surpassing even openai/gpt-5.2, it shows a pronounced tendency to fail in Theme extraction, with 59 missing cases corresponding to 23% of applicable articles. This imbalance indicates that the model is highly reliable when extracting concrete, well-defined entities, yet frequently omits abstract or high-level thematic labels. As a result, its strong performance in company extraction does not translate into a higher overall rank, and the model appears lower in the benchmark primarily due to systematic theme extraction failures rather than broad extraction weakness.
| model | company | people | theme | keyword | sentiment |
|---|---|---|---|---|---|
| openai/gpt-5.2 | 0 | 3 (1%) | 0 | 0 | 0 |
| openai/gpt-oss-20b | 1 (0%) | 13 (5%) | 6 (2%) | 0 | 1 (0%) |
| openai/gpt-5-mini | 1 (0%) | 14 (5%) | 0 | 0 | 4 (2%) |
| Qwen/Qwen3-14B | 0 | 3 (1%) | 23 (9%) | 0 | 0 |
| CohereLabs/aya-expanse-32b | 0 | 15 (6%) | 3 (1%) | 0 | 0 |
| naver-hyperclovax/HyperCLOVAX-SEED-Think-32B | 0 | 12 (5%) | 15 (6%) | 2 (1%) | 0 |
| google/gemma-3-27b-it | 0 | 22 (8%) | 59 (23%) | 0 | 0 |
| Qwen/Qwen3-8B | 0 | 36 (14%) | 1 (0%) | 1 (0%) | 0 |
Table 8. Number of Missing Extractions by Category across Models Percentages indicate the proportion of articles where the corresponding metadata field was not extracted.
Although FinNewsBench provides a systematic framework for evaluating core information extraction from Korean financial news, it has several limitations. First, due to the AI-generated nature of the data source, certain complex narrative structures or highly unstructured expressions commonly found in real-world journalism may be underrepresented. Second, sentiment analysis is conducted only for a single central company per article, which limits the benchmark’s ability to capture multi-company sentiment dynamics that frequently arise in financial news.
Another potential concern relates to the use of large language models in both the annotation pipeline and the evaluation stage, which may raise questions about self-referential bias in an LLM-as-a-Judge setting. In FinNewsBench, GPT-5–family models are involved in reference refinement and judge-based evaluation; however, this design does not imply direct preference toward specific generators. The judge model (GPT-5-mini) is distinct from the primary evaluated model (GPT-5.2) and operates under a fixed evaluation rubric that emphasizes semantic alignment, contextual appropriateness, and evidence grounding rather than stylistic similarity. Moreover, the evaluation framework explicitly penalizes hallucinated or weakly grounded outputs, including those produced by high-capacity models, as reflected in the negative score gaps observed in sentiment analysis for top-tier systems. In addition, open-source models exhibit comparable patterns of score adjustment under LLM-as-a-Judge evaluation, indicating that the framework consistently corrects metric bias across model families rather than amplifying the performance of any single provider. Taken together, these observations suggest that the LLM-as-a-Judge framework in FinNewsBench functions as a calibration mechanism for semantic and contextual correctness, rather than as a self-reinforcing evaluation loop.
Future research can further address the remaining limitations of FinNewsBench through several directions.
First, multi-aspect and inter-task analysis could be explored to examine dependencies among extracted information elements, such as relationships between entities, themes, and sentiment, or the evolution of financial issues over time.
Second, enhanced evaluation metrics, including ranking-based evaluation, relevance-weighted scoring, or partial-match measures, may further improve the ability to capture nuanced extraction quality beyond current rubric-based scoring.
These directions aim to extend the scope of FinNewsBench and advance the evaluation of document-level financial news understanding, ultimately improving the practical applicability of structured information extraction models in real-world financial analysis settings.
This study introduces FinNewsBench, a benchmark for evaluating Korean financial news understanding from a document-level, multi-aspect information extraction perspective. By jointly assessing core elements—companies, people, themes, keywords, and central-company sentiment—FinNewsBench enables systematic and reproducible comparison of large language models’ information recognition capabilities.
The benchmark contributes a practical evaluation framework that reduces experimental cost while maintaining assessment rigor. It provides a foundation for future research on financial news information extraction, model comparison, and multi-aspect understanding. Future extensions may include multi-company sentiment evaluation, inter-task reasoning, and the adoption of more sophisticated scoring metrics to further enhance the benchmark’s coverage and analytical depth.
| Model | Total | Company | People | Theme | Keyword | Sentiment |
|---|---|---|---|---|---|---|
| openai/gpt-5.2 | 0.77 / 0.90 (+0.13) | 0.96 / 0.98 (+0.02) | 0.92 / 0.96 (+0.05) | 0.67 / 0.84 (+0.17) | 0.39 / 0.87 (+0.47) | 0.92 / 0.85 (-0.07) |
| openai/gpt-oss-20b | 0.61 / 0.83 (+0.23) | 0.88 / 0.96 (+0.09) | 0.78 / 0.93 (+0.15) | 0.40 / 0.68 (+0.28) | 0.36 / 0.80 (+0.44) | 0.62 / 0.80 (+0.18) |
| Qwen/Qwen3-14B | 0.60 / 0.83 (+0.22) | 0.85 / 0.90 (+0.05) | 0.94 / 0.95 (+0.01) | 0.31 / 0.70 (+0.40) | 0.23 / 0.75 (+0.53) | 0.70 / 0.84 (+0.14) |
| openai/gpt-5-mini | 0.62 / 0.83 (+0.20) | 0.86 / 0.97 (+0.12) | 0.82 / 0.93 (+0.11) | 0.44 / 0.76 (+0.32) | 0.21 / 0.79 (+0.59) | 0.80 / 0.68 (-0.12) |
| CohereLabs/aya-expanse-32b | 0.62 / 0.82 (+0.20) | 0.87 / 0.93 (+0.06) | 0.87 / 0.89 (+0.03) | 0.29 / 0.73 (+0.44) | 0.17 / 0.74 (+0.57) | 0.91 / 0.81 (-0.10) |
| naver-hyperclovax/HyperCLOVAX-SEED-Think-32B | 0.52 / 0.80 (+0.28) | 0.83 / 0.95 (+0.13) | 0.78 / 0.90 (+0.12) | 0.28 / 0.67 (+0.40) | 0.13 / 0.64 (+0.51) | 0.60 / 0.83 (+0.23) |
| Qwen/Qwen3-8B | 0.57 / 0.78 (+0.21) | 0.82 / 0.89 (+0.07) | 0.79 / 0.80 (+0.01) | 0.34 / 0.73 (+0.38) | 0.14 / 0.73 (+0.60) | 0.77 / 0.76 (-0.01) |
| google/gemma-3-27b-it | 0.66 / 0.78 (+0.12) | 0.96 / 0.98 (+0.02) | 0.90 / 0.90 (+0.01) | 0.35 / 0.53 (+0.18) | 0.17 / 0.68 (+0.50) | 0.90 / 0.80 (-0.11) |
Table A. Comparison of Recall-based and LLM-as-a-Judge Evaluation Scores across Categories
Each cell reports the average score under Recall and LLM-as-a-Judge evaluation (Recall / LLM Judge), with the parenthesized value indicating the Δ (LLM-as-a-Judge score minus Recall). Positive Δ values indicate instances where the LLM Judge recognized correct outputs that were not captured by simple Recall.