Learning Strategies to Improve Table Understanding and Explanation in Korean

One-Line Summary

By comparing four table serialization formats and designing four synthetic auxiliary subtasks, this work shows that Pandas DataFrame preprocessing yields a 19.6% improvement in Korean table cell description generation while reducing token costs.

Background & Motivation

Tables are one of the most common ways to organize structured data, yet automatically generating natural language descriptions of specific table cells remains a difficult task for language models — especially in Korean, where resources and benchmarks are scarce. Given a table and a highlighted target cell, the goal is to produce a fluent Korean sentence explaining what the cell value means in the context of its row, column, and the broader table topic.

Key Challenges:

Serialization ambiguity: Tables must be linearized into text before being fed to a language model, but different serialization formats (Markdown, HTML, DataFrame, JSON) encode structural cues very differently, and no prior work has systematically compared them for Korean.
Structural understanding gaps: Models often fail to correctly associate a cell with its column header or row context, leading to factually incorrect or nonsensical descriptions.
Token efficiency: Verbose formats like HTML consume many tokens without proportional gains in structural clarity, increasing computational costs during both training and inference.
Lack of Korean table data: Most table understanding research targets English; Korean-specific training data and evaluation benchmarks are extremely limited.

This paper addresses all of these challenges by conducting a controlled comparison of serialization formats and proposing targeted auxiliary training tasks with synthetic data to build table comprehension skills from the ground up.

Proposed Method

The approach has two main components: (1) identifying the optimal table-to-text serialization format, and (2) designing auxiliary subtasks with synthetic datasets that incrementally teach the model to understand table structure before tackling the full description generation task.

Overall Pipeline: The training procedure follows a two-stage curriculum. First, the model is trained on synthetic auxiliary subtasks to build foundational table comprehension skills (structural grounding). Then, the pre-trained model is fine-tuned on the target task of table cell description generation. This curriculum approach ensures that the model acquires structural understanding before attempting the more complex generation objective.

Table Serialization Formats

Four formats are compared for converting a table into a text string that a language model can process:

Format	Characteristics	Example Encoding	Token Efficiency
Markdown	Pipe-delimited columns with header separator row	`\| Col1 \| Col2 \|\n\|---\|---\|\n\| val \| val \|`	Moderate
HTML	Full <table>/<tr>/<td> markup preserving structure	`<table><tr><td>val</td>...</tr></table>`	Low (verbose tags)
Pandas DataFrame	Index-based row/column representation with aligned spacing	`Col1 Col2\n0 val val`	High
JSON	Nested key-value pairs per row	`[{"Col1":"val","Col2":"val"}]`	Moderate

The key insight behind the DataFrame format is that its whitespace-aligned layout naturally preserves column alignment, making it easier for models to associate cell values with their corresponding headers without requiring explicit structural markup.

Auxiliary Subtasks

Four subtasks are designed to build specific table comprehension skills, each supported by automatically generated synthetic datasets. Together, they form a progressive curriculum that moves from localized cell-level understanding to global table-level comprehension:

1

HPOS — Cell Position Recognition

Given a table and a target cell value, the model must identify the exact row and column position of the cell. This teaches the model to map cell values to their structural coordinates within the table. Skill focus: Cell-level localization, coordinate mapping.

2

HROW — Row Information Recognition

Given a table and a specific row index, the model must list all cell values in that row. This subtask trains the model to correctly parse and extract horizontal (row-level) information from the serialized table. Skill focus: Row-level parsing, horizontal traversal.

3

HCOL — Column Information Recognition

Given a table and a specific column header, the model must extract all values in that column. This builds the model's ability to parse vertical (column-level) relationships, which is critical for associating cells with their column headers. Skill focus: Column-level parsing, header-value association.

4

CRCR — Table Structure Rearrangement

The model is asked to convert a table from one serialization format to another (e.g., Markdown to DataFrame). This forces a deep understanding of the table's full structure, as faithful format conversion requires correctly parsing every cell, row, and column relationship. Skill focus: Holistic structural understanding, format-agnostic table representation.

Synthetic Data Generation: All auxiliary subtask datasets are constructed automatically from existing tables — no manual annotation is required. For each table, multiple training instances are generated by varying target cells (HPOS), row indices (HROW), column headers (HCOL), and source/target format pairs (CRCR). This makes the framework easily scalable to new domains and languages.

Experimental Setup

The experiments are designed to isolate the contributions of serialization format choice and auxiliary subtask training on the downstream task of Korean table cell description generation.

Component	Details
Target Task	Korean table cell description generation — producing a fluent Korean sentence explaining a highlighted cell in context
Serialization Formats	Markdown (baseline), HTML, Pandas DataFrame, JSON
Auxiliary Subtasks	HPOS (cell position), HROW (row info), HCOL (column info), CRCR (format conversion)
Training Strategy	Two-stage curriculum: auxiliary subtask pre-training followed by target task fine-tuning
Evaluation	Comparison of description quality across formats and subtask combinations

Experimental Results

Experiments evaluate the impact of serialization format choice and auxiliary subtask training on Korean table cell description generation quality.

Performance by Serialization Format

Format	Relative Performance	Token Count	Structural Clarity
Baseline (Markdown)	Baseline	High	Moderate — relies on pipe delimiters
HTML	Below baseline	Highest	High markup overhead obscures content
JSON	Above baseline	Moderate	Explicit key-value pairs aid header association
Pandas DataFrame	+19.6% over baseline	Lowest	Natural alignment preserves column structure

Impact of Auxiliary Subtasks

Cell position recognition (HPOS) and column information recognition (HCOL) contribute the most to final description quality, confirming that accurate structural awareness is the primary bottleneck.
Row information recognition (HROW) provides moderate improvements, suggesting that row-level context is partially captured even without explicit training.
Table structure rearrangement (CRCR) yields complementary gains by forcing the model to develop a holistic understanding of table layout.
Combining all four subtasks with the Pandas DataFrame format produces the best overall results, with consistent improvements across different table sizes and domains.

Subtask Contribution Analysis

Subtask Combination	Primary Skill Gained	Contribution Level
HPOS alone	Cell-to-coordinate mapping	High
HCOL alone	Column-header association	High
HROW alone	Row-level context extraction	Moderate
CRCR alone	Holistic structural understanding	Moderate
All four combined	Complete structural grounding	Highest

Key Findings

Format matters more than expected: Pandas DataFrame achieves a 19.6% improvement over the Markdown baseline while simultaneously using fewer input tokens, showing that concise structural representations help models focus on relevant information.
Token efficiency correlates with performance: The most token-efficient format (DataFrame) also yields the best results, while the most verbose format (HTML) performs worst — extra markup adds noise rather than useful structural signal.
Synthetic subtask data is effective: Even without any real table description training data, the auxiliary subtasks built from synthetic data substantially improve the model's structural understanding.
Curriculum order matters: Pre-training on structural subtasks before fine-tuning on description generation outperforms joint training or reverse ordering, confirming that structural grounding is a prerequisite for generating accurate descriptions.

Why It Matters

This work provides practical, actionable guidelines for building Korean table understanding systems, an area of growing importance as structured data becomes increasingly central to business and government applications. The contributions are threefold:

First systematic format comparison for Korean tables: By rigorously evaluating four serialization formats under controlled conditions, this study gives practitioners a clear recommendation (Pandas DataFrame) backed by empirical evidence.
Reusable subtask framework: The four auxiliary subtasks (HPOS, HROW, HCOL, CRCR) and their synthetic data generation methodology can be applied to other languages and table understanding tasks beyond cell description generation.
Efficiency-performance alignment: The finding that the most token-efficient format also performs best is practically valuable — teams can reduce computational costs and improve quality simultaneously, a rare win-win in NLP system design.

Practical Takeaway: For any practitioner building a Korean table-to-text system, this paper offers a clear recipe: serialize tables as Pandas DataFrames for the best balance of token efficiency and structural clarity, and pre-train with the HPOS/HROW/HCOL/CRCR subtask curriculum using automatically generated synthetic data before fine-tuning on the target task.

The paper received the Best Paper Award at HCLT 2024 (published in the conference proceedings, pp. 635–640), recognizing its contributions to an underexplored but practically important area of Korean NLP.

Links

KoreaScience