EN KO
← All Publications

Learning Strategies to Improve Table Understanding and Explanation in Korean

The 36th Annual Conference on Human and Cognitive Language Technology (HCLT 2024) Best Paper Award
Changhyeon Kim, Seunghee Kim, Taeuk Kim

One-Line Summary

By comparing four table serialization formats and designing four synthetic auxiliary subtasks, this work shows that Pandas DataFrame preprocessing yields a 19.6% improvement in Korean table cell description generation while reducing token costs.

Background & Motivation

Tables are one of the most common ways to organize structured data, yet automatically generating natural language descriptions of specific table cells remains a difficult task for language models — especially in Korean, where resources and benchmarks are scarce. Given a table and a highlighted target cell, the goal is to produce a fluent Korean sentence explaining what the cell value means in the context of its row, column, and the broader table topic.

Key Challenges:

  • Serialization ambiguity: Tables must be linearized into text before being fed to a language model, but different serialization formats (Markdown, HTML, DataFrame, JSON) encode structural cues very differently, and no prior work has systematically compared them for Korean.
  • Structural understanding gaps: Models often fail to correctly associate a cell with its column header or row context, leading to factually incorrect or nonsensical descriptions.
  • Token efficiency: Verbose formats like HTML consume many tokens without proportional gains in structural clarity, increasing computational costs during both training and inference.
  • Lack of Korean table data: Most table understanding research targets English; Korean-specific training data and evaluation benchmarks are extremely limited.

This paper addresses all of these challenges by conducting a controlled comparison of serialization formats and proposing targeted auxiliary training tasks with synthetic data to build table comprehension skills from the ground up.

Proposed Method

The approach has two main components: (1) identifying the optimal table-to-text serialization format, and (2) designing auxiliary subtasks with synthetic datasets that incrementally teach the model to understand table structure before tackling the full description generation task.

Overall Pipeline: The training procedure follows a two-stage curriculum. First, the model is trained on synthetic auxiliary subtasks to build foundational table comprehension skills (structural grounding). Then, the pre-trained model is fine-tuned on the target task of table cell description generation. This curriculum approach ensures that the model acquires structural understanding before attempting the more complex generation objective.

Table Serialization Formats

Four formats are compared for converting a table into a text string that a language model can process:

FormatCharacteristicsExample EncodingToken Efficiency
MarkdownPipe-delimited columns with header separator row| Col1 | Col2 |\n|---|---|\n| val | val |Moderate
HTMLFull <table>/<tr>/<td> markup preserving structure<table><tr><td>val</td>...</tr></table>Low (verbose tags)
Pandas DataFrameIndex-based row/column representation with aligned spacing  Col1  Col2\n0  val   valHigh
JSONNested key-value pairs per row[{"Col1":"val","Col2":"val"}]Moderate

The key insight behind the DataFrame format is that its whitespace-aligned layout naturally preserves column alignment, making it easier for models to associate cell values with their corresponding headers without requiring explicit structural markup.

Auxiliary Subtasks

Four subtasks are designed to build specific table comprehension skills, each supported by automatically generated synthetic datasets. Together, they form a progressive curriculum that moves from localized cell-level understanding to global table-level comprehension:

1
HPOS — Cell Position Recognition
Given a table and a target cell value, the model must identify the exact row and column position of the cell. This teaches the model to map cell values to their structural coordinates within the table. Skill focus: Cell-level localization, coordinate mapping.
2
HROW — Row Information Recognition
Given a table and a specific row index, the model must list all cell values in that row. This subtask trains the model to correctly parse and extract horizontal (row-level) information from the serialized table. Skill focus: Row-level parsing, horizontal traversal.
3
HCOL — Column Information Recognition
Given a table and a specific column header, the model must extract all values in that column. This builds the model's ability to parse vertical (column-level) relationships, which is critical for associating cells with their column headers. Skill focus: Column-level parsing, header-value association.
4
CRCR — Table Structure Rearrangement
The model is asked to convert a table from one serialization format to another (e.g., Markdown to DataFrame). This forces a deep understanding of the table's full structure, as faithful format conversion requires correctly parsing every cell, row, and column relationship. Skill focus: Holistic structural understanding, format-agnostic table representation.

Synthetic Data Generation: All auxiliary subtask datasets are constructed automatically from existing tables — no manual annotation is required. For each table, multiple training instances are generated by varying target cells (HPOS), row indices (HROW), column headers (HCOL), and source/target format pairs (CRCR). This makes the framework easily scalable to new domains and languages.

Experimental Setup

The experiments are designed to isolate the contributions of serialization format choice and auxiliary subtask training on the downstream task of Korean table cell description generation.

ComponentDetails
Target TaskKorean table cell description generation — producing a fluent Korean sentence explaining a highlighted cell in context
Serialization FormatsMarkdown (baseline), HTML, Pandas DataFrame, JSON
Auxiliary SubtasksHPOS (cell position), HROW (row info), HCOL (column info), CRCR (format conversion)
Training StrategyTwo-stage curriculum: auxiliary subtask pre-training followed by target task fine-tuning
EvaluationComparison of description quality across formats and subtask combinations

Experimental Results

Experiments evaluate the impact of serialization format choice and auxiliary subtask training on Korean table cell description generation quality.

Performance by Serialization Format

FormatRelative PerformanceToken CountStructural Clarity
Baseline (Markdown)BaselineHighModerate — relies on pipe delimiters
HTMLBelow baselineHighestHigh markup overhead obscures content
JSONAbove baselineModerateExplicit key-value pairs aid header association
Pandas DataFrame+19.6% over baselineLowestNatural alignment preserves column structure

Impact of Auxiliary Subtasks

Subtask Contribution Analysis

Subtask CombinationPrimary Skill GainedContribution Level
HPOS aloneCell-to-coordinate mappingHigh
HCOL aloneColumn-header associationHigh
HROW aloneRow-level context extractionModerate
CRCR aloneHolistic structural understandingModerate
All four combinedComplete structural groundingHighest

Key Findings

Why It Matters

This work provides practical, actionable guidelines for building Korean table understanding systems, an area of growing importance as structured data becomes increasingly central to business and government applications. The contributions are threefold:

Practical Takeaway: For any practitioner building a Korean table-to-text system, this paper offers a clear recipe: serialize tables as Pandas DataFrames for the best balance of token efficiency and structural clarity, and pre-train with the HPOS/HROW/HCOL/CRCR subtask curriculum using automatically generated synthetic data before fine-tuning on the target task.

The paper received the Best Paper Award at HCLT 2024 (published in the conference proceedings, pp. 635–640), recognizing its contributions to an underexplored but practically important area of Korean NLP.

Links

Multilingual