EN KO
← All Publications

RAISE: Enhancing Scientific Reasoning in LLMs via Step-by-Step Retrieval

MATH-AI @ NeurIPS 2025
Minhae Oh, Jeonghye Kim, Nakyung Lee, Donggeon Seo, Taeuk Kim, Jungwoo Lee

One-Line Summary

RAISE decomposes scientific problems into sub-questions, generates logic-enriched retrieval queries for each step, and retrieves logically relevant documents from open-domain corpora (e.g., Wikipedia), achieving an average 13% improvement over the best baselines on graduate-level science benchmarks.

Overview of the RAISE framework for step-by-step retrieval-augmented scientific reasoning
Figure 1. Overview of RAISE. The framework is divided into three steps: (1) Problem Decomposition into subquestions and search queries, (2) Logical Query Generation containing logic-enriched contexts, and (3) Logical Retrieval of relevant documents from in-the-wild corpus for each subquestion to assist step-by-step reasoning for scientific problems.

Background & Motivation

Scientific reasoning in LLMs requires both long-chain logical inference and domain-specific knowledge. Existing approaches fall into two camps -- step-wise reasoning (e.g., Chain-of-Thought) that structures multi-step inference, and RAG that grounds answers in external evidence. However, combining these two strategies remains challenging: standard RAG retrieves documents based on surface-level similarity, which often misses the logically relevant knowledge needed at each reasoning step.

The core insight behind RAISE is that different intermediate reasoning steps require distinct pieces of information that cannot be jointly retrieved with a single query. For example, solving a graduate-level chemistry problem may require one step about reaction mechanisms and another about thermodynamic principles -- each needing separate, targeted retrieval. Moreover, merely retrieving domain-similar documents is insufficient; the retrieved content must contain the specific logical connections (e.g., scientific mechanisms, equations, principles) needed to advance each reasoning step.

Existing decomposition-based RAG methods such as Least-to-Most prompting use naive sub-question text as queries, Step-Back prompting abstracts to overly general queries, and HyDE generates hypothetical answers that may introduce hallucinations. RAISE addresses all these limitations by reformulating queries that capture both the reasoning intent and the logical structure of each sub-problem, enabling retrieval from open-domain corpora like Wikipedia rather than curated task-specific databases.

Proposed Method

1
Problem Decomposition
The LLM breaks down the original scientific question into a sequence of subquestions and corresponding initial search queries: {(ri, qi)}ni=1. Each subquestion ri captures a specific reasoning step, while the initial query qi provides a targeted retrieval handle. These serve as structured inputs for the next stage rather than direct retrieval queries, ensuring the decomposition guides the entire downstream process.
2
Logical Query Generation
Initial queries qi alone lack reasoning context, while raw subquestions may be noisy. RAISE reformulates each initial query into a logic-enriched query qi* by combining both components through the LLM. The reformulated query captures the reasoning intent and encodes the logical structure needed to retrieve relevant knowledge. A key finding is that even when the reformulated queries contain factual inaccuracies, they still tend to retrieve passages that are logically relevant and supportive of the correct reasoning path.
3
Logical Retrieval & Step-by-Step Answering
For each subquestion, external knowledge Di is retrieved from an in-the-wild corpus (21M Wikipedia passages) using DPR with a similarity threshold T to filter irrelevant documents. The model then generates a subanswer ai conditioned on Di, the original question, and all previous subanswers. After all subquestions are answered, the LLM synthesizes the subanswers into a final answer. Four dedicated prompts control decomposition, query reformulation, subanswer generation, and final composition.

Experimental Results

RAISE is evaluated on three challenging scientific reasoning benchmarks using open-domain Wikipedia retrieval (21M passages, DPR retriever, top-10 documents per query). Baselines include Chain-of-Thought (CoT), CoT+RAG, Least-to-Most, Step-Back, their RAG variants, and HyDE.

BenchmarkBest BaselineRAISEImprovement
GPQA Diamond (198 Qs)46.46 (HyDE)51.01+9.8%
SuperGPQA Science-Hard7.54 (CoT+RAG)10.05+33.3%
SuperGPQA Science-Mid15.58 (Step-Back)19.60+25.8%
SuperGPQA Eng-Hard10.05 (L2M)10.55+5.0%
MMLU Prof. Chemistry25.44 (Direct)28.36+11.5%
MMLU Prof. Biology58.02 (L2M+RAG)59.27+2.2%
MMLU College Chemistry49.50 (Direct)51.00+3.0%

Why It Matters

RAISE demonstrates that different reasoning steps in scientific problems require distinct pieces of external knowledge, and that decomposing a problem then retrieving logic-enriched evidence for each step is fundamentally more effective than single-query retrieval. Unlike prior methods that rely on curated or task-specific corpora, RAISE works with open-domain sources like Wikipedia, making it broadly applicable. The framework generalizes across model scales (8B to 24B parameters), difficulty levels (undergraduate to graduate), and scientific domains (physics, chemistry, biology, engineering). This points toward a practical paradigm for building science-capable AI systems in education, research assistance, and automated discovery -- wherever precise multi-step inference over specialized knowledge is required.

Links

Reasoning RAG