Analysis of Language Models in Korean Program Synthesis Based on the KR-HumanEval Benchmark

One-Line Summary

Introduces KR-HumanEval, a Korean-translated code generation benchmark, and reveals that few-shot prompting with English descriptions paradoxically yields the best Korean program synthesis performance — exposing a fundamental language bias in current LLMs.

Background & Motivation

Program synthesis — automatically generating code from natural language specifications — has seen dramatic progress with large language models. Benchmarks like OpenAI's HumanEval (164 hand-crafted Python problems) have become the de facto standard for evaluating code generation, but they are exclusively English-based.

Key Problem: As LLM-powered coding assistants (GitHub Copilot, ChatGPT, etc.) are adopted globally, non-English speakers increasingly write task descriptions in their native language. Yet there is no established benchmark for evaluating whether models can correctly synthesize code from Korean natural language descriptions. This gap makes it impossible to assess or improve Korean code generation capabilities systematically.

Prior work on multilingual code generation has focused primarily on high-resource language pairs or simple translation of function signatures, without a comprehensive evaluation framework that accounts for the unique challenges of Korean — such as agglutinative morphology, SOV word order, and the frequent mixing of Korean text with English technical terms in real programming contexts. While HumanEval-XL explored cross-lingual code generation benchmarks, no publicly available Korean program synthesis dataset existed at the time of this work.

Proposed Method

The authors construct the KR-HumanEval benchmark by translating HumanEval's English docstrings into Korean, and then systematically evaluate multiple LLMs under various prompting strategies to understand how language choice in prompts affects code generation quality.

1

KR-HumanEval Benchmark Construction

The original HumanEval's 164 English docstrings are translated into Korean via a multi-stage pipeline: first, DeepL machine translation produces initial drafts; then, human reviewers manually verify each translation and score quality on a 1–3 scale. GPT-4o-mini is subsequently used for quality inspection and refinement. The final dataset contains 155 problems rated score 3 (94.5%), 9 at score 2 (5.5%), and 0 at score 1, indicating high translation quality. Function signatures, test cases, and code structure remain identical to the original, enabling controlled comparison.

2

Four Types of Few-Shot Examples

The study defines four distinct few-shot example types that vary in language composition: [KO, PY] — Korean description with standard Python code; [KO, PY(KO var)] — Korean description with Korean variable names in Python; [KO, PY(KO doc)] — Korean description with Korean comments/annotations added to Python code; and [EN, PY] — English description with standard Python code. Each type is generated from templates and conversion instructions to ensure consistency across examples.

3

Scaling from 1-Shot to 5-Shot

For each of the four example types, experiments are run with 1 to 5 few-shot examples per prompt, yielding 20 few-shot configurations per model plus a zero-shot baseline. Up to 5 examples are used per prompt, each selected from problems crafted by an LLM (Gemini) to avoid data contamination. Instruction-tuned models (GPT-4o-mini, EXAONE-3.0-7.8B-Instruct) use a chat template format, while the base model (DeepSeek-Coder-V2-16B-Base) uses a standard completion template.

Models Evaluated

GPT-4o-mini (closed-source): Strong general-purpose model with broad multilingual and coding capabilities.
DeepSeek-Coder-V2-16B-Base (open-source): Pre-trained on large code corpora, achieving high scores on code generation benchmarks.
EXAONE-3.0-7.8B-Instruct (open-source): A Korean-focused instruction-tuned model, used to analyze the impact of Korean-centric pre-training.

The evaluation metric is pass@1 with greedy decoding, the standard functional correctness metric that checks whether a single generated code sample passes all unit tests.

Experimental Results

Three models are evaluated on KR-HumanEval across zero-shot and four types of few-shot configurations (1–5 shots each). The results reveal clear patterns about how language choice and example type in prompts impact Korean code generation.

Zero-Shot vs. Best Few-Shot Performance (pass@1)

Model	Zero-Shot	[KO, PY]	[KO, PY(KO var)]	[KO, PY(KO doc)]	[EN, PY]
GPT-4o-mini	72.8	73.6	69	72.4	75
DeepSeek-Coder-V2-16B-Base	37.2	50.6	48.4	46.5	49.7
EXAONE-3.0-7.8B-Instruct	53.6	52.3	50.2	54.4	51.5

Values shown are the best pass@1 across 1–5 shots for each example type. Bold indicates the highest score per model across all settings.

Detailed Few-Shot Results by Number of Shots (pass@1)

Model	# Shot	[KO, PY]	[KO, PY(KO var)]	[KO, PY(KO doc)]	[EN, PY]
GPT-4o-mini	1	73.2	76.2	71.3	75.6
	2	76.8	70.1	76.8	78.7
	3	77.4	68.3	75	75.6
	4	70.7	64	70.1	69.5
	5	70.1	66.4	68.9	75.6
DeepSeek-Coder-V2-16B-Base	1	45.7	47	43.3	44.5
	2	48.8	50.6	47	51.2
	3	53.7	51.2	48.2	54.3
	4	52.4	46.3	44.5	48.2
	5	52.4	47	49.4	50.6
EXAONE-3.0-7.8B-Instruct	1	56.1	56.7	52.4	57.3
	2	56.1	51.8	54.3	56.7
	3	51.2	48.2	55.5	50.6
	4	50	46.3	53	47
	5	48.1	48.1	56.7	45.7

Bold indicates the single best score across all few-shot configurations for each model.

Key Findings

Few-shot consistently beats zero-shot: All three models show overall improvement with few-shot prompting. DeepSeek-Coder-V2-16B-Base sees the largest gain, jumping from 37.2 (zero-shot) to 54.3 (3-shot [EN, PY]) — a 17.1-point improvement.
[EN, PY] achieves the highest peak scores: For both GPT-4o-mini (78.7 at 2-shot) and DeepSeek-Coder-V2-16B-Base (54.3 at 3-shot), English-description examples yield the best results, suggesting cross-lingual transfer from the models' stronger English-code alignment during pre-training.
Korean-focused model benefits from Korean annotations: EXAONE-3.0-7.8B-Instruct, the Korean-centric model, uniquely benefits from [KO, PY(KO doc)] examples, achieving its best overall result of 56.7 at 5-shot — the only model where Korean-annotated code consistently helps. This indicates that Korean-focused pre-training enables models to better leverage Korean comments.
Korean variables can hurt performance: GPT-4o-mini shows performance degradation with [KO, PY(KO var)] examples, indicating that the model may be sensitive to perturbations in variable naming conventions. However, EXAONE-3.0-7.8B-Instruct generates Korean variables more naturally, producing Korean-named variables in 108 out of 164 problems at 5-shot.
More shots are not always better: GPT-4o-mini peaks at 2–3 shots and degrades at 4–5 shots, while DeepSeek-Coder-V2-16B-Base generally improves up to 3 shots. This suggests diminishing returns and potential interference from excessive examples in the prompt.

Code Analysis: Korean Variable Adoption

When [KO, PY(KO var)] examples are used, models differ dramatically in their adoption of Korean variables. With 5-shot examples, EXAONE-3.0-7.8B-Instruct generates code with Korean variable names for 108 of 164 problems, far more than GPT-4o-mini or DeepSeek-Coder-V2-16B-Base. Notably, while GPT-4o-mini and DeepSeek either avoid Korean variables or produce incorrect code when using them, EXAONE successfully generates correct code with Korean variables — highlighting the value of Korean-centric pre-training for code generation.

Implications for Prompt Design

For general-purpose models, a hybrid prompting strategy — using [EN, PY] examples with 2–3 shots for Korean target tasks — appears optimal.
For Korean-specialized models, [KO, PY(KO doc)] examples with Korean annotations provide the best performance, leveraging the model's Korean language understanding.
The finding that increasing [EN, PY] shots can degrade performance suggests the need for research into efficient prompting techniques for cross-lingual code generation.

Why It Matters

This work makes four important contributions to multilingual code generation research:

First publicly available Korean code generation benchmark: KR-HumanEval fills a critical evaluation gap with 164 high-quality Korean-translated programming problems (94.5% rated highest quality), providing the community with a standardized tool for measuring and improving Korean program synthesis capabilities. The authors plan to publicly release the dataset.
Reveals language bias in code LLMs: The finding that English few-shot descriptions outperform Korean ones — even for Korean target tasks — exposes a fundamental bias in current models toward English-centric code generation, pointing to concrete directions for more balanced multilingual training.
Korean-focused models show different behavior: The divergent behavior of EXAONE (Korean-centric) vs. GPT-4o-mini (general-purpose) on Korean annotations and variables demonstrates that pre-training language composition materially affects how models handle non-English code generation, motivating further research into Korean-specialized code models.
Practical prompting guidelines: The systematic 60-configuration analysis (3 models x 4 example types x 5 shot counts) provides actionable guidance: use [EN, PY] with 2–3 shots for general models, or [KO, PY(KO doc)] for Korean-specialized models. Careful prompt engineering can bridge the cross-lingual performance gap by up to 17 points.

The paper received the Best Paper Award at HCLT 2024, recognizing its significance for the Korean NLP community. Future directions include expanding to other programming languages and exploring Korean data augmentation and cross-lingual transfer techniques for further improvement.

Links

KoreaScience