Development and Evaluation of a Dual-Expertise, Utterance-Level Framework for LLM-Based Science Classroom Discourse Analysis

One-Line Summary

A dual-expertise, utterance-level coding framework for LLM-based science classroom discourse analysis — co-developed by science-education faculty and in-service middle-school science teachers, operationalized as a 137-term glossary across 20 thematic categories, validated on 68 authentic Korean middle-school science lessons (8,651 utterances), and benchmarked across prompting and fine-tuning paradigms (best weighted F1 84.59% for rating, 52.01% for theme).

Background & Motivation

Existing automated classroom-discourse analyses (CLASS, MQI, NCTE) were designed for global, lesson-level scoring rather than the local, utterance-level decisions teachers need for actionable feedback. Prior LLM applications also concentrate on mathematics or English-language arts — science discourse, which is interdisciplinary by nature (a single teacher often instructs outside their major), has received much less attention.

Two pressures motivate the framework. First, fine-grained, predicate-level utterances are a better fit for LLM input than 7–15 minute fixed segments, but no LLM-ready coding scheme yet covers all instructional dimensions of authentic K–12 science discourse. Second, ratings and themes derived from raw transcripts must remain pedagogically meaningful, which requires both academic theory (science-education faculty) and classroom realism (experienced teachers).

Research questions. (RQ1) How can domain knowledge in science education be systematically incorporated into an LLM-based coding framework? (RQ2) How do different LLM architectures and configurations compare in predicting rating and theme categories derived from authentic science classroom data?

Dataset

The corpus comprises 68 instructional sessions from 12 in-service middle-school science teachers in South Korea (Fall 2024). Each teacher recorded at least three ~45-minute lectures in both their major and a non-major science discipline.

Subject	Sessions	Coded Utterances	Mean Rating (SD)
Physics	15 (22.1%)	2,001	3.025 (0.263)
Chemistry	15 (22.1%)	2,636	2.989 (0.408)
Life Science	21 (30.9%)	1,500	3.130 (0.363)
Earth Science	17 (25.0%)	2,514	3.144 (0.444)
Total	68	8,651 unique utterances	—

After disaggregating utterances coded under multiple themes (769 dual- and 8 triple-labeled), the analytic dataset contains 9,436 theme–utterance instances, split 8:1:1 into 7,549 / 943 / 944 train/val/test, stratified jointly by theme and rating. 45.59% of sessions were taught in a teacher's major and 54.41% outside it — reflecting the interdisciplinary realities of Korean middle-school science instruction.

Dual-Expertise Coding Framework

1

Predicate-Level Chunking

After comparing sentence-level and paragraph-level units, the team adopted predicate-level chunking as the analytic unit because it best supports fine-grained, LLM-friendly modeling and accommodates the variability in teacher speech (extended utterances vs. brief phrases) and STT segmentation noise. Science-education faculty perform segmentation; coders take responsibility for coding the chunked utterances.

2

137-Term Glossary, 20 Thematic Categories

A bottom-up glossary was developed from 678 initial terms and consolidated to 137 instructional terms through multiple rounds of refinement involving four science-education professors and one professor of educational evaluation. Theme 1 (Teaching Practices) contains 123 terms across 17 categories (e.g., tentativeness of scientific knowledge, explanation of the development of science and technology); Theme 2 (Classroom Management) contains 14 terms across 3 categories (e.g., task presentation, reading textbooks, lesson summary). Development was informed by RTOP and ISIOP observation protocols.

3

1–5 Quality Rating

Each chunked utterance receives a 1–5 rating: 1 = serious content/management errors leading to misconception, 2 = minor errors, 3 = ordinary, 4 = effective, 5 = exemplary. Ratings are heavily mid-scale (86.7% are 3; 1 and 5 each <0.5%), reflecting the high baseline of practice in Korean classrooms but also creating meaningful class imbalance for modeling.

4

Dual-Expertise Coding Process

Each subject team consists of one science-education faculty member plus two experienced teacher-coders. Faculty segment the transcripts; two coders independently rate and code each utterance with reference to the latest glossary, then reconcile via discussion. 8 teacher-raters (6–22 years of teaching experience) and 4 science-education faculty participated. Where consensus could not be reached, the faculty member's vote carried equal weight to the coders' — ensuring experiential expertise is duly respected.

5

Rater Consistency

For 1–5 ratings, weighted accuracy, precision, recall, and F1 all exceeded 0.9 across subjects. For the 20 thematic categories, agreement ranged from 0.709–0.713 (life science, the lowest) to 0.808–0.973 (physics); chemistry and earth science exceeded 0.90 — substantially stronger than the moderate-to-high range typical of prior NCTE studies (κ ≈ 0.3–0.4 to ICC ≈ 0.8–0.9).

LLM Methodology

Two tasks are evaluated: rating prediction (1–5 ordinal) and theme prediction (20-way classification).

Prompting (GPT-5-mini). Five strategies: Default, (a) Contextual Augmentation (5 preceding utterances for rating, 10 for theme), (b) Demonstration Selection (retrieval-augmented few-shot via Qwen3-Embedding-8B; K=7 ratings, K=12 themes), (c) Chain-of-Thought, (d) Subcategory Prompting (definitions + lower-level labels for each theme), and combinations.
Encoder Fine-Tuning. klue/roberta-large with a 2-layer MLP head over [CLS], trained end-to-end with AdamW (lr=2e−5, batch=256, ≤30 epochs). Loss is KL divergence against soft-label distributions: ordinal smoothing (mass at gold ±1) for ratings, label-smoothed one-hot for themes.
Decoder Fine-Tuning. google/gemma-3-12b-it, all parameters tuned for ≤6 epochs (batch=8–16, lr∈{1e−5, 1e−6, 1e−7}). Four imbalance strategies compared: cross-entropy, weighted CE, focal loss, and undersampling the majority class to 500 (rating) / 700 (theme) samples.
Bidirectional Decoder. gte-Qwen2-7B-instruct — an encoder-converted decoder — with the same four loss configurations (batch=128, weight decay=0.01, ≤30 epochs).
Hardware. Two NVIDIA H200 GPUs (141 GB VRAM each).

Results: Rating Prediction

Fine-tuning clearly outperforms prompting on the ordinal 1–5 rating task. Bidirectional Decoders with cross-entropy attain the strongest weighted F1, while Encoders with KL divergence achieve the highest accuracy.

Paradigm	Method	Acc (%)	W. Prec	W. Recall	W. F1
Prompting	Default	60.06	81.36	60.06	68.34
Prompting	Demonstration Selection (b)	69.28	81.98	69.28	74.54
Prompting	Chain-of-Thought (c)	64.62	83.75	64.62	71.15
Prompting	a + b + c	47.78	84.20	47.78	57.49
Fine-tuning	Encoder (KL-div)	88.24	77.95	88.24	82.78
Fine-tuning	Decoder (CE)	88.12	77.83	88.12	82.66
Fine-tuning	Bidirectional Decoder (CE)	87.92	83.50	87.92	84.59

Demonstration Selection is the single most useful prompting move, lifting rating accuracy from 60.06% to 69.28% — semantically similar few-shot exemplars help more than CoT or contextual history.
Naive context hurts: raw preceding turns (Contextual Augmentation alone) drop accuracy to 36.86%, suggesting unguided history adds noise rather than signal.
Cross-entropy beats imbalance-aware losses for fine-tuning here: weighted CE, focal, and undersampling all underperform plain CE on Decoders and Bidirectional Decoders, because the dominant rating 3 class genuinely reflects authentic instructional quality.

Results: Theme Prediction

Theme prediction across 20 expert-defined categories is harder, and the prompting–fine-tuning gap narrows: the best prompting configuration (a + b + c + d) reaches a weighted F1 of 52.01%, comparable to the best fine-tuned models.

Paradigm	Method	Acc (%)	W. Prec	W. Recall	W. F1
Prompting	Default	27.22	33.42	27.22	28.92
Prompting	Chain-of-Thought (c)	47.99	50.64	47.99	48.00
Prompting	a + b + c + d	52.22	54.65	52.22	52.01
Fine-tuning	Encoder (KL-div)	53.18	51.58	53.18	51.63
Fine-tuning	Decoder (Undersampling)	49.79	55.66	49.79	50.37
Fine-tuning	Bidirectional Decoder (Weighted CE)	53.92	52.47	53.92	51.93

Each prompting strategy adds value cumulatively (a + b + c + d > b + c + d > b + d > CoT alone), unlike rating prediction where naive stacking hurt.
Subcategory prompting helps: feeding the model lower-level labels per theme sharpens decision boundaries on the 20-way space.
Best-prompted GPT-5-mini ties fine-tuned models on weighted F1 (52.01% vs. 51.63% encoder, 51.93% bidirectional decoder), but fine-tuning yields more stable performance overall — especially on minority themes.
Five categories — problem identification, drawing conclusions, use of media, instructional strategies, and student communication — each account for <1% of inputs, exposing where additional teacher annotation is most valuable.

Why It Matters

The study contributes the first utterance-level LLM coding framework purpose-built for science classroom discourse, grounded in established observation protocols (RTOP, ISIOP) yet aligned with the predicate-level granularity that LLMs handle best. Methodologically, it shows that LLMs can engage meaningfully with fine-grained, expert-curated instructional categories when supported by a theoretically informed glossary — and that integrating science-education faculty with experienced teachers strengthens both interpretability and pedagogical relevance.

Domain coverage: the first framework spanning all four middle-school science domains (physics, chemistry, life, earth) with a unified 20-category, 137-term scheme.
Fine-tuning still wins for ordinal ratings (Bidirectional Decoder CE: 84.59% W-F1), but well-designed prompting can match fine-tuning on theme classification (52.01% vs. 51.93%) — a practical option when labeled data is scarce.
Toward formative feedback for teacher PD: the utterance-level focus aligns with how RTOP/ISIOP-style instructional moves are operationalized, enabling actionable, theory-grounded feedback for teacher self-reflection.
Open release: the 137-term glossary is available upon request, supporting replication and cross-context validation in other languages and educational systems.

Links

ACM DL DOI