By comparing four table serialization formats and designing four synthetic auxiliary subtasks, this work shows that Pandas DataFrame preprocessing yields a 19.6% improvement in Korean table cell description generation while reducing token costs.
Tables are one of the most common ways to organize structured data, yet automatically generating natural language descriptions of specific table cells remains a difficult task for language models — especially in Korean, where resources and benchmarks are scarce. Given a table and a highlighted target cell, the goal is to produce a fluent Korean sentence explaining what the cell value means in the context of its row, column, and the broader table topic.
Key Challenges:
This paper addresses all of these challenges by conducting a controlled comparison of serialization formats and proposing targeted auxiliary training tasks with synthetic data to build table comprehension skills from the ground up.
The approach has two main components: (1) identifying the optimal table-to-text serialization format, and (2) designing auxiliary subtasks with synthetic datasets that incrementally teach the model to understand table structure before tackling the full description generation task.
Overall Pipeline: The training procedure follows a two-stage curriculum. First, the model is trained on synthetic auxiliary subtasks to build foundational table comprehension skills (structural grounding). Then, the pre-trained model is fine-tuned on the target task of table cell description generation. This curriculum approach ensures that the model acquires structural understanding before attempting the more complex generation objective.
Four formats are compared for converting a table into a text string that a language model can process:
| Format | Characteristics | Example Encoding | Token Efficiency |
|---|---|---|---|
| Markdown | Pipe-delimited columns with header separator row | | Col1 | Col2 |\n|---|---|\n| val | val | | Moderate |
| HTML | Full <table>/<tr>/<td> markup preserving structure | <table><tr><td>val</td>...</tr></table> | Low (verbose tags) |
| Pandas DataFrame | Index-based row/column representation with aligned spacing | Col1 Col2\n0 val val | High |
| JSON | Nested key-value pairs per row | [{"Col1":"val","Col2":"val"}] | Moderate |
The key insight behind the DataFrame format is that its whitespace-aligned layout naturally preserves column alignment, making it easier for models to associate cell values with their corresponding headers without requiring explicit structural markup.
Four subtasks are designed to build specific table comprehension skills, each supported by automatically generated synthetic datasets. Together, they form a progressive curriculum that moves from localized cell-level understanding to global table-level comprehension:
Synthetic Data Generation: All auxiliary subtask datasets are constructed automatically from existing tables — no manual annotation is required. For each table, multiple training instances are generated by varying target cells (HPOS), row indices (HROW), column headers (HCOL), and source/target format pairs (CRCR). This makes the framework easily scalable to new domains and languages.
The experiments are designed to isolate the contributions of serialization format choice and auxiliary subtask training on the downstream task of Korean table cell description generation.
| Component | Details |
|---|---|
| Target Task | Korean table cell description generation — producing a fluent Korean sentence explaining a highlighted cell in context |
| Serialization Formats | Markdown (baseline), HTML, Pandas DataFrame, JSON |
| Auxiliary Subtasks | HPOS (cell position), HROW (row info), HCOL (column info), CRCR (format conversion) |
| Training Strategy | Two-stage curriculum: auxiliary subtask pre-training followed by target task fine-tuning |
| Evaluation | Comparison of description quality across formats and subtask combinations |
Experiments evaluate the impact of serialization format choice and auxiliary subtask training on Korean table cell description generation quality.
| Format | Relative Performance | Token Count | Structural Clarity |
|---|---|---|---|
| Baseline (Markdown) | Baseline | High | Moderate — relies on pipe delimiters |
| HTML | Below baseline | Highest | High markup overhead obscures content |
| JSON | Above baseline | Moderate | Explicit key-value pairs aid header association |
| Pandas DataFrame | +19.6% over baseline | Lowest | Natural alignment preserves column structure |
| Subtask Combination | Primary Skill Gained | Contribution Level |
|---|---|---|
| HPOS alone | Cell-to-coordinate mapping | High |
| HCOL alone | Column-header association | High |
| HROW alone | Row-level context extraction | Moderate |
| CRCR alone | Holistic structural understanding | Moderate |
| All four combined | Complete structural grounding | Highest |
This work provides practical, actionable guidelines for building Korean table understanding systems, an area of growing importance as structured data becomes increasingly central to business and government applications. The contributions are threefold:
Practical Takeaway: For any practitioner building a Korean table-to-text system, this paper offers a clear recipe: serialize tables as Pandas DataFrames for the best balance of token efficiency and structural clarity, and pre-train with the HPOS/HROW/HCOL/CRCR subtask curriculum using automatically generated synthetic data before fine-tuning on the target task.
The paper received the Best Paper Award at HCLT 2024 (published in the conference proceedings, pp. 635–640), recognizing its contributions to an underexplored but practically important area of Korean NLP.