EN KO
← All Publications

Korean Semantic Role Labeling with Machine Reading Comprehension

Korea Computer Congress 2024 (KCC 2024)
Kangmin Lee, Donggeon Seo, Eunrang Kwon, Junmo Song, Jeonghan Kang, Taeuk Kim

One-Line Summary

A framework that reformulates Korean semantic role labeling (SRL) as a machine reading comprehension (MRC) task, replacing abstract role labels with natural language questions tailored to Korean grammar, achieving improved argument identification especially for omitted and structurally complex arguments.

Background & Motivation

Semantic role labeling (SRL) identifies "who did what to whom" in a sentence by assigning predicate-argument structures. Traditional SRL systems use sequence labeling or span extraction with role-specific tags (e.g., ARG0, ARG1, ARGM-LOC). However, Korean SRL poses unique challenges that make standard approaches unreliable.

Why Korean SRL Is Especially Difficult:

  • Free word order: Korean allows highly flexible constituent ordering (SOV, OSV, etc.), so positional cues that work in English are ineffective.
  • Frequent pro-drop: Subjects and objects are routinely omitted when recoverable from context, meaning arguments may not appear as explicit spans in the sentence.
  • Complex postpositional system: Korean case markers (조사) encode grammatical relations, but their many forms (e.g., subject markers -이/-가, topic markers -은/-는) introduce ambiguity that sequence labeling struggles with.
  • Label opacity: Abstract labels like ARG0 and ARG1 carry no inherent semantic meaning, forcing models to learn role semantics purely from data without linguistic guidance.

Recent English-language research has shown that reformulating structured prediction tasks as reading comprehension -- where each label is expressed as a natural language question -- provides richer semantic signals and improves both performance and generalization. This work investigates whether this MRC-based paradigm is particularly well-suited for Korean, where the linguistic challenges of SRL are amplified.

Traditional SRL vs. MRC Reformulation

The key conceptual shift lies in how semantic roles are communicated to the model. In traditional SRL, roles are encoded as abstract labels that the model must learn to assign purely from data. In the MRC formulation, each role is expressed as a natural language question that describes the role's meaning:

Semantic RoleTraditional LabelMRC Question (Korean)
Agent (행위자)ARG0"[서술어]하는 행위를 수행한 주체는 누구인가?"
Patient (대상)ARG1"[서술어]의 대상이 되는 것은 무엇인가?"
Location (장소)ARGM-LOC"[서술어]가 일어난 장소는 어디인가?"
Time (시간)ARGM-TMP"[서술어]가 일어난 시간은 언제인가?"

This transformation embeds linguistic knowledge directly into the input, allowing the model to leverage its pretrained language understanding rather than relying on opaque label assignments.

Proposed Method: MRC-Based Korean SRL

The core idea is to transform each semantic role assignment into a question-answering problem: given a sentence and a predicate, the model receives a role-specific question and extracts the answer span corresponding to the argument filler.

1
Role-to-Question Conversion
Each semantic role (e.g., Agent, Patient, Location, Time) is mapped to a natural language question. For example, the agent role for predicate "먹다" (to eat) becomes "먹는 행위를 수행한 주체는 누구인가?" (Who is the entity that performed the eating?). Unlike English templates, these questions are designed to incorporate Korean-specific grammatical cues such as case markers and honorific forms.
2
Korean-Specific Question Template Design
Question templates are crafted to reflect Korean grammatical structures: they use appropriate particles (조사), verb conjugation patterns, and natural Korean phrasing. Each role type has a dedicated template that encodes the semantic meaning of that role in a way that is linguistically natural for Korean speakers and informative for Korean pretrained models. Crucially, the templates incorporate Korean-specific features like the distinction between subject markers (-이/-가) and topic markers (-은/-는), which carry different information-structural implications that aid in argument identification.
3
Pretrained Korean LM as MRC Backbone
The question-sentence pairs are fed to a Korean-specific pretrained language model fine-tuned for extractive question answering. The model predicts start and end positions of the answer span within the input sentence, effectively identifying the argument for each role. Korean pretrained models (e.g., KoBERT, KoELECTRA) are used to maximize language-specific understanding. The choice of backbone model is important: Korean-specific models trained on Korean corpora provide better morphological and syntactic understanding than multilingual models.
4
Handling Omitted (Pro-Drop) Arguments
For Korean's frequent pro-drop cases where arguments are implicit, the system employs a special null-answer strategy: if no valid span can be extracted (confidence below a threshold), the model predicts "no explicit argument," distinguishing between truly absent arguments and those the model failed to find. This prevents hallucinating arguments that are not present in the surface text.

Why Korean-Specific Templates Matter:

Directly translating English question templates to Korean produces unnatural phrasing that fails to leverage Korean pretrained models effectively. Korean-specific design is essential because:

  • Korean predicates undergo extensive conjugation (활용) that changes form based on tense, aspect, and mood -- templates must accommodate these variations.
  • Korean particles (조사) carry rich grammatical information that can disambiguate roles; well-designed questions encode this information naturally.
  • Korean's head-final structure means that key role-indicating elements appear at the end of phrases, requiring different question structures than English.

Experimental Results

The MRC-based Korean SRL system is evaluated against traditional sequence labeling baselines on Korean SRL benchmarks, measuring argument identification (AI) and argument classification (AC) performance.

Key Comparison: MRC vs. Sequence Labeling

ApproachStrengthWeakness
Sequence Labeling (BIO tagging)Simple, fast inferenceStruggles with free word order, no role semantics
Span ExtractionHandles non-contiguous spansStill relies on opaque labels
MRC-Based (Proposed)Encodes role semantics via questionsRequires per-role inference pass

Performance by Sentence Complexity

Sentence TypeSequence LabelingMRC-Based (Proposed)Improvement Source
Simple (1-2 arguments, canonical order)StrongComparable or slightly betterRole semantics have marginal impact
Multiple arguments (3+ arguments)ModerateNotably improvedQuestions disambiguate overlapping roles
Non-canonical word orderWeakSubstantially improvedQuestions are order-independent
Pro-drop (omitted arguments)Frequent false positivesReduced via null-answer strategyExplicit absence modeling

Why It Matters

This work makes three key contributions to Korean natural language understanding:

Broader Implications: The success of this approach suggests that for morphologically rich, free word order languages, the MRC reformulation paradigm may be systematically more advantageous than for English. Where English SRL can rely on relatively fixed positional patterns, Korean SRL must depend on deeper semantic understanding -- exactly what MRC-style questions provide. This insight extends beyond Korean to other agglutinative languages such as Turkish, Japanese, and Finnish, opening a promising research direction for language-specific NLU.

Parsing & Syntax Multilingual