EN KO
← All Publications

Analysis of Language Models in Korean Program Synthesis Based on the KR-HumanEval Benchmark

The 36th Annual Conference on Human and Cognitive Language Technology (HCLT 2024) Best Paper Award
Deokyeong Kang, Taeuk Kim

One-Line Summary

Introduces KR-HumanEval, a Korean-translated code generation benchmark, and reveals that few-shot prompting with English descriptions paradoxically yields the best Korean program synthesis performance — exposing a fundamental language bias in current LLMs.

Background & Motivation

Program synthesis — automatically generating code from natural language specifications — has seen dramatic progress with large language models. Benchmarks like OpenAI's HumanEval (164 hand-crafted Python problems) have become the de facto standard for evaluating code generation, but they are exclusively English-based.

Key Problem: As LLM-powered coding assistants (GitHub Copilot, ChatGPT, etc.) are adopted globally, non-English speakers increasingly write task descriptions in their native language. Yet there is no established benchmark for evaluating whether models can correctly synthesize code from Korean natural language descriptions. This gap makes it impossible to assess or improve Korean code generation capabilities systematically.

Prior work on multilingual code generation has focused primarily on high-resource language pairs or simple translation of function signatures, without a comprehensive evaluation framework that accounts for the unique challenges of Korean — such as agglutinative morphology, SOV word order, and the frequent mixing of Korean text with English technical terms in real programming contexts. While HumanEval-XL explored cross-lingual code generation benchmarks, no publicly available Korean program synthesis dataset existed at the time of this work.

Proposed Method

The authors construct the KR-HumanEval benchmark by translating HumanEval's English docstrings into Korean, and then systematically evaluate multiple LLMs under various prompting strategies to understand how language choice in prompts affects code generation quality.

1
KR-HumanEval Benchmark Construction
The original HumanEval's 164 English docstrings are translated into Korean via a multi-stage pipeline: first, DeepL machine translation produces initial drafts; then, human reviewers manually verify each translation and score quality on a 1–3 scale. GPT-4o-mini is subsequently used for quality inspection and refinement. The final dataset contains 155 problems rated score 3 (94.5%), 9 at score 2 (5.5%), and 0 at score 1, indicating high translation quality. Function signatures, test cases, and code structure remain identical to the original, enabling controlled comparison.
2
Four Types of Few-Shot Examples
The study defines four distinct few-shot example types that vary in language composition: [KO, PY] — Korean description with standard Python code; [KO, PY(KO var)] — Korean description with Korean variable names in Python; [KO, PY(KO doc)] — Korean description with Korean comments/annotations added to Python code; and [EN, PY] — English description with standard Python code. Each type is generated from templates and conversion instructions to ensure consistency across examples.
3
Scaling from 1-Shot to 5-Shot
For each of the four example types, experiments are run with 1 to 5 few-shot examples per prompt, yielding 20 few-shot configurations per model plus a zero-shot baseline. Up to 5 examples are used per prompt, each selected from problems crafted by an LLM (Gemini) to avoid data contamination. Instruction-tuned models (GPT-4o-mini, EXAONE-3.0-7.8B-Instruct) use a chat template format, while the base model (DeepSeek-Coder-V2-16B-Base) uses a standard completion template.

Models Evaluated

The evaluation metric is pass@1 with greedy decoding, the standard functional correctness metric that checks whether a single generated code sample passes all unit tests.

Experimental Results

Three models are evaluated on KR-HumanEval across zero-shot and four types of few-shot configurations (1–5 shots each). The results reveal clear patterns about how language choice and example type in prompts impact Korean code generation.

Zero-Shot vs. Best Few-Shot Performance (pass@1)

ModelZero-Shot[KO, PY][KO, PY(KO var)][KO, PY(KO doc)][EN, PY]
GPT-4o-mini72.873.66972.475
DeepSeek-Coder-V2-16B-Base37.250.648.446.549.7
EXAONE-3.0-7.8B-Instruct53.652.350.254.451.5

Values shown are the best pass@1 across 1–5 shots for each example type. Bold indicates the highest score per model across all settings.

Detailed Few-Shot Results by Number of Shots (pass@1)

Model# Shot[KO, PY][KO, PY(KO var)][KO, PY(KO doc)][EN, PY]
GPT-4o-mini173.276.271.375.6
276.870.176.878.7
377.468.37575.6
470.76470.169.5
570.166.468.975.6
DeepSeek-Coder-V2-16B-Base145.74743.344.5
248.850.64751.2
353.751.248.254.3
452.446.344.548.2
552.44749.450.6
EXAONE-3.0-7.8B-Instruct156.156.752.457.3
256.151.854.356.7
351.248.255.550.6
45046.35347
548.148.156.745.7

Bold indicates the single best score across all few-shot configurations for each model.

Key Findings

Code Analysis: Korean Variable Adoption

When [KO, PY(KO var)] examples are used, models differ dramatically in their adoption of Korean variables. With 5-shot examples, EXAONE-3.0-7.8B-Instruct generates code with Korean variable names for 108 of 164 problems, far more than GPT-4o-mini or DeepSeek-Coder-V2-16B-Base. Notably, while GPT-4o-mini and DeepSeek either avoid Korean variables or produce incorrect code when using them, EXAONE successfully generates correct code with Korean variables — highlighting the value of Korean-centric pre-training for code generation.

Implications for Prompt Design

Why It Matters

This work makes four important contributions to multilingual code generation research:

The paper received the Best Paper Award at HCLT 2024, recognizing its significance for the Korean NLP community. Future directions include expanding to other programming languages and exploring Korean data augmentation and cross-lingual transfer techniques for further improvement.

Links

Benchmark Multilingual