EN KO
← All Publications

Development and Evaluation of a Dual-Expertise, Utterance-Level Framework for LLM-Based Science Classroom Discourse Analysis

LAK 2026
Jin Eun Yoo, Nam-Hwa Kang, Suna Ryu, Jun-ki Lee, Youngsun Kwak, Taeuk Kim, Hyeong Gwan Kim, Youngwoo Shin, Uiji Hwang

One-Line Summary

A dual-expertise, utterance-level coding framework for LLM-based science classroom discourse analysis — co-developed by science-education faculty and in-service middle-school science teachers, operationalized as a 137-term glossary across 20 thematic categories, validated on 68 authentic Korean middle-school science lessons (8,651 utterances), and benchmarked across prompting and fine-tuning paradigms (best weighted F1 84.59% for rating, 52.01% for theme).

Background & Motivation

Existing automated classroom-discourse analyses (CLASS, MQI, NCTE) were designed for global, lesson-level scoring rather than the local, utterance-level decisions teachers need for actionable feedback. Prior LLM applications also concentrate on mathematics or English-language arts — science discourse, which is interdisciplinary by nature (a single teacher often instructs outside their major), has received much less attention.

Two pressures motivate the framework. First, fine-grained, predicate-level utterances are a better fit for LLM input than 7–15 minute fixed segments, but no LLM-ready coding scheme yet covers all instructional dimensions of authentic K–12 science discourse. Second, ratings and themes derived from raw transcripts must remain pedagogically meaningful, which requires both academic theory (science-education faculty) and classroom realism (experienced teachers).

Research questions. (RQ1) How can domain knowledge in science education be systematically incorporated into an LLM-based coding framework? (RQ2) How do different LLM architectures and configurations compare in predicting rating and theme categories derived from authentic science classroom data?

Dataset

The corpus comprises 68 instructional sessions from 12 in-service middle-school science teachers in South Korea (Fall 2024). Each teacher recorded at least three ~45-minute lectures in both their major and a non-major science discipline.

SubjectSessionsCoded UtterancesMean Rating (SD)
Physics15 (22.1%)2,0013.025 (0.263)
Chemistry15 (22.1%)2,6362.989 (0.408)
Life Science21 (30.9%)1,5003.130 (0.363)
Earth Science17 (25.0%)2,5143.144 (0.444)
Total688,651 unique utterances

After disaggregating utterances coded under multiple themes (769 dual- and 8 triple-labeled), the analytic dataset contains 9,436 theme–utterance instances, split 8:1:1 into 7,549 / 943 / 944 train/val/test, stratified jointly by theme and rating. 45.59% of sessions were taught in a teacher's major and 54.41% outside it — reflecting the interdisciplinary realities of Korean middle-school science instruction.

Dual-Expertise Coding Framework

1
Predicate-Level Chunking
After comparing sentence-level and paragraph-level units, the team adopted predicate-level chunking as the analytic unit because it best supports fine-grained, LLM-friendly modeling and accommodates the variability in teacher speech (extended utterances vs. brief phrases) and STT segmentation noise. Science-education faculty perform segmentation; coders take responsibility for coding the chunked utterances.
2
137-Term Glossary, 20 Thematic Categories
A bottom-up glossary was developed from 678 initial terms and consolidated to 137 instructional terms through multiple rounds of refinement involving four science-education professors and one professor of educational evaluation. Theme 1 (Teaching Practices) contains 123 terms across 17 categories (e.g., tentativeness of scientific knowledge, explanation of the development of science and technology); Theme 2 (Classroom Management) contains 14 terms across 3 categories (e.g., task presentation, reading textbooks, lesson summary). Development was informed by RTOP and ISIOP observation protocols.
3
1–5 Quality Rating
Each chunked utterance receives a 1–5 rating: 1 = serious content/management errors leading to misconception, 2 = minor errors, 3 = ordinary, 4 = effective, 5 = exemplary. Ratings are heavily mid-scale (86.7% are 3; 1 and 5 each <0.5%), reflecting the high baseline of practice in Korean classrooms but also creating meaningful class imbalance for modeling.
4
Dual-Expertise Coding Process
Each subject team consists of one science-education faculty member plus two experienced teacher-coders. Faculty segment the transcripts; two coders independently rate and code each utterance with reference to the latest glossary, then reconcile via discussion. 8 teacher-raters (6–22 years of teaching experience) and 4 science-education faculty participated. Where consensus could not be reached, the faculty member's vote carried equal weight to the coders' — ensuring experiential expertise is duly respected.
5
Rater Consistency
For 1–5 ratings, weighted accuracy, precision, recall, and F1 all exceeded 0.9 across subjects. For the 20 thematic categories, agreement ranged from 0.709–0.713 (life science, the lowest) to 0.808–0.973 (physics); chemistry and earth science exceeded 0.90 — substantially stronger than the moderate-to-high range typical of prior NCTE studies (κ ≈ 0.3–0.4 to ICC ≈ 0.8–0.9).

LLM Methodology

Two tasks are evaluated: rating prediction (1–5 ordinal) and theme prediction (20-way classification).

Results: Rating Prediction

Fine-tuning clearly outperforms prompting on the ordinal 1–5 rating task. Bidirectional Decoders with cross-entropy attain the strongest weighted F1, while Encoders with KL divergence achieve the highest accuracy.

ParadigmMethodAcc (%)W. PrecW. RecallW. F1
PromptingDefault60.0681.3660.0668.34
PromptingDemonstration Selection (b)69.2881.9869.2874.54
PromptingChain-of-Thought (c)64.6283.7564.6271.15
Promptinga + b + c47.7884.2047.7857.49
Fine-tuningEncoder (KL-div)88.2477.9588.2482.78
Fine-tuningDecoder (CE)88.1277.8388.1282.66
Fine-tuningBidirectional Decoder (CE)87.9283.5087.9284.59

Results: Theme Prediction

Theme prediction across 20 expert-defined categories is harder, and the prompting–fine-tuning gap narrows: the best prompting configuration (a + b + c + d) reaches a weighted F1 of 52.01%, comparable to the best fine-tuned models.

ParadigmMethodAcc (%)W. PrecW. RecallW. F1
PromptingDefault27.2233.4227.2228.92
PromptingChain-of-Thought (c)47.9950.6447.9948.00
Promptinga + b + c + d52.2254.6552.2252.01
Fine-tuningEncoder (KL-div)53.1851.5853.1851.63
Fine-tuningDecoder (Undersampling)49.7955.6649.7950.37
Fine-tuningBidirectional Decoder (Weighted CE)53.9252.4753.9251.93

Why It Matters

The study contributes the first utterance-level LLM coding framework purpose-built for science classroom discourse, grounded in established observation protocols (RTOP, ISIOP) yet aligned with the predicate-level granularity that LLMs handle best. Methodologically, it shows that LLMs can engage meaningfully with fine-grained, expert-curated instructional categories when supported by a theoretically informed glossary — and that integrating science-education faculty with experienced teachers strengthens both interpretability and pedagogical relevance.

Links

Domain LLM Benchmark Dialogue