EN KO
← All Publications

RAISE: Enhancing Scientific Reasoning in LLMs via Step-by-Step Retrieval

The 5th Workshop on Mathematical Reasoning and AI (MATH-AI) at NeurIPS 2025
Minhae Oh, Jeonghye Kim, Nakyung Lee, Donggeon Seo, Taeuk Kim, Jungwoo Lee

One-Line Summary

RAISE is a three-stage retrieval-augmented framework that decomposes scientific problems into subquestions, generates logic-enriched queries capturing reasoning intent, and retrieves step-relevant documents from open-domain corpora—achieving an average 13% improvement over baselines on graduate-level science benchmarks (GPQA, SuperGPQA, MMLU).

Examples comparing query generation methods (Step-back, HyDE, and RAISE) for the same subquestion
Figure 2. Examples comparing query generation methods (Step-back, HyDE, and RAISE) for the same subquestion, illustrating how RAISE produces logic-enriched queries that capture reasoning intent beyond surface-level domain similarity.

Background & Motivation

Scientific reasoning tasks require LLMs to handle long-chain reasoning processes alongside domain-specific terminology and up-to-date knowledge. Two common strategies exist: step-wise reasoning (e.g., Chain-of-Thought decomposition) and retrieval-augmented generation (RAG). Recent work combines them, but typically targets simpler multi-hop QA or assumes curated, task-specific corpora.

Conventional RAG approaches retrieve documents using a single query derived from the full problem, which often returns vague or superficially related content that fails to support the multi-step logic needed for graduate-level science questions. The fundamental challenge is: what to search for and how to retrieve the appropriate external knowledge for each step when solving complex scientific reasoning tasks.

Key Problem: Existing retrieval methods (e.g., standard RAG, HyDE, Step-Back) match documents based on surface similarity rather than logical relevance. Retrieved passages share domain keywords but lack the essential scientific mechanisms—such as reaction mechanisms, mathematical derivations, or physical principles—needed to actually solve each reasoning step. Neither initial search queries (which lack reasoning context) nor subquestions alone (which can be noisy or overly specific) are sufficient for effective retrieval.

Proposed Method: RAISE

RAISE (Retrieval-Augmented framework for Improving Scientific rEasoning) operates through three sequential phases. It uses Dense Passage Retrieval (DPR) over approximately 21 million Wikipedia passages (each ~100 words) as the retrieval backbone, with no task-specific corpus required:

1
Problem Decomposition
The LLM breaks down the original question into n subquestions (r1, ..., rn) paired with corresponding initial search queries (q1, ..., qn). These initial queries serve as input for the next stage, not directly for retrieval. This decomposition structures the reasoning pathway and ensures different reasoning steps can access distinct pieces of information.
2
Logical Query Generation
For each subquestion, the model generates a logically enriched query (qi*) by combining the initial query qi with its subquestion ri through a reformulation prompt. A key insight is that even if the reformulated query contains factual inaccuracies, it tends to retrieve passages that are logically relevant and supportive of the reasoning step, because it captures the underlying reasoning intent rather than just surface keywords.
3
Logical Retrieval & Answer Composition
For each subquestion, the top-10 documents are retrieved using DPR with inner-product similarity on L2-normalized embeddings. A similarity threshold (T=0.84 for GPQA/SuperGPQA/MMLU-Pro; T=0.80 for MMLU-STEM) filters out low-relevance passages. The model then generates a subanswer using the filtered documents, original question, and all previous subanswers. Finally, all subanswers are composed to produce the final answer.

Experimental Results

RAISE is evaluated on GPQA Diamond (198 expert-written graduate-level questions in physics, biology, and chemistry), SuperGPQA (graduate-level questions across science and engineering at multiple difficulty levels), and MMLU (college/professional-level chemistry and biology). The primary LLM is Mistral Small 3.1-Instruct-2503 (24B) for GPQA and LLaMA 3.1-8B for SuperGPQA/MMLU. Seven baselines are compared: CoT, CoT+RAG, Least-to-Most, Step-Back, Least-to-Most+RAG, Step-Back+RAG, and HyDE.

MethodGPQA DiamondSuperGPQA Sci-HardSuperGPQA Sci-MidSuperGPQA Eng-HardMMLU College Chem
CoT42.424.5215.086.5349.50
CoT + RAG45.967.5412.567.5443.00
Least-to-Most44.956.0314.5710.0545.40
Step-Back44.445.0315.086.0343.00
Least-to-Most + RAG45.956.0314.578.0446.00
Step-Back + RAG43.435.5315.589.0543.00
HyDE46.467.5413.077.0449.00
RAISE51.0110.0519.6010.5551.00

Cross-model generalization on GPQA Diamond:

ModelCoTHyDERAISEGain
LLaMA 3.1-8B22.2225.7530.30+7.1%
GPT-4o mini40.9138.8947.98+5.6%
Mistral-24B42.4246.4651.01+9.8%

Why It Matters

Scientific reasoning is one of the most challenging frontiers for LLMs, requiring both domain knowledge and multi-step logical deduction. RAISE demonstrates that step-wise decomposition combined with logic-enriched retrieval can substantially improve performance on graduate-level science questions, achieving an average 13% improvement across benchmarks—without requiring additional model training or curated domain-specific corpora.

The framework is practical and broadly applicable: it uses only open-domain Wikipedia as the retrieval corpus, works across different model scales (8B to 24B parameters), and generalizes across physics, chemistry, biology, and engineering. The key insight—that retrieval queries should capture reasoning intent rather than surface-level domain similarity—opens a promising direction for improving LLM performance on complex reasoning tasks in science, mathematics, and beyond.

Links

Reasoning RAG