RAISE: Enhancing Scientific Reasoning in LLMs via Step-by-Step Retrieval

One-Line Summary

RAISE decomposes scientific problems into sub-questions, generates logic-enriched retrieval queries for each step, and retrieves logically relevant documents from open-domain corpora (e.g., Wikipedia), achieving an average 13% improvement over the best baselines on graduate-level science benchmarks.

Background & Motivation

            Scientific reasoning in LLMs requires both long-chain logical inference and domain-specific knowledge. Existing approaches fall into two camps -- step-wise reasoning (e.g., Chain-of-Thought) that structures multi-step inference, and RAG that grounds answers in external evidence. However, combining these two strategies remains challenging: standard RAG retrieves documents based on surface-level similarity, which often misses the logically relevant knowledge needed at each reasoning step.
        

The core insight behind RAISE is that different intermediate reasoning steps require distinct pieces of information that cannot be jointly retrieved with a single query. For example, solving a graduate-level chemistry problem may require one step about reaction mechanisms and another about thermodynamic principles -- each needing separate, targeted retrieval. Moreover, merely retrieving domain-similar documents is insufficient; the retrieved content must contain the specific logical connections (e.g., scientific mechanisms, equations, principles) needed to advance each reasoning step.

Existing decomposition-based RAG methods such as Least-to-Most prompting use naive sub-question text as queries, Step-Back prompting abstracts to overly general queries, and HyDE generates hypothetical answers that may introduce hallucinations. RAISE addresses all these limitations by reformulating queries that capture both the reasoning intent and the logical structure of each sub-problem, enabling retrieval from open-domain corpora like Wikipedia rather than curated task-specific databases.

Proposed Method

1

Problem Decomposition

The LLM breaks down the original scientific question into a sequence of subquestions and corresponding initial search queries: {(r_i, q_i)}ⁿ_i=1. Each subquestion r_i captures a specific reasoning step, while the initial query q_i provides a targeted retrieval handle. These serve as structured inputs for the next stage rather than direct retrieval queries, ensuring the decomposition guides the entire downstream process.

2

Logical Query Generation

Initial queries q_i alone lack reasoning context, while raw subquestions may be noisy. RAISE reformulates each initial query into a logic-enriched query q_i* by combining both components through the LLM. The reformulated query captures the reasoning intent and encodes the logical structure needed to retrieve relevant knowledge. A key finding is that even when the reformulated queries contain factual inaccuracies, they still tend to retrieve passages that are logically relevant and supportive of the correct reasoning path.

3

Logical Retrieval & Step-by-Step Answering

For each subquestion, external knowledge D_i is retrieved from an in-the-wild corpus (21M Wikipedia passages) using DPR with a similarity threshold T to filter irrelevant documents. The model then generates a subanswer a_i conditioned on D_i, the original question, and all previous subanswers. After all subquestions are answered, the LLM synthesizes the subanswers into a final answer. Four dedicated prompts control decomposition, query reformulation, subanswer generation, and final composition.

Experimental Results

RAISE is evaluated on three challenging scientific reasoning benchmarks using open-domain Wikipedia retrieval (21M passages, DPR retriever, top-10 documents per query). Baselines include Chain-of-Thought (CoT), CoT+RAG, Least-to-Most, Step-Back, their RAG variants, and HyDE.

Benchmark	Best Baseline	RAISE	Improvement
GPQA Diamond (198 Qs)	46.46 (HyDE)	51.01	+9.8%
SuperGPQA Science-Hard	7.54 (CoT+RAG)	10.05	+33.3%
SuperGPQA Science-Mid	15.58 (Step-Back)	19.60	+25.8%
SuperGPQA Eng-Hard	10.05 (L2M)	10.55	+5.0%
MMLU Prof. Chemistry	25.44 (Direct)	28.36	+11.5%
MMLU Prof. Biology	58.02 (L2M+RAG)	59.27	+2.2%
MMLU College Chemistry	49.50 (Direct)	51.00	+3.0%

Consistent gains across difficulty levels: RAISE achieves an average ~13% improvement over the best baselines. Gains are especially large on the hardest subsets (up to +33.3% on SuperGPQA Science-Hard).
Cross-model generalizability: On GPQA, RAISE improves performance across three LLM scales -- LLaMA 3.1-8B (30.30 vs. 28.28), GPT-4o mini (47.98 vs. 42.42), and Mistral 24B (51.01 vs. 46.46).
Decomposition is critical: Ablation with RAISE-Direct (skipping problem decomposition) shows significant performance drops, confirming that step-specific tailored retrieval outperforms single-query approaches.
Highest logical relevance: LLM-as-judge and human evaluation (including chemistry Ph.D. students/faculty) confirm RAISE retrieves significantly fewer irrelevant documents and the highest proportion of logically relevant documents compared to all baselines.

Why It Matters

RAISE demonstrates that different reasoning steps in scientific problems require distinct pieces of external knowledge, and that decomposing a problem then retrieving logic-enriched evidence for each step is fundamentally more effective than single-query retrieval. Unlike prior methods that rely on curated or task-specific corpora, RAISE works with open-domain sources like Wikipedia, making it broadly applicable. The framework generalizes across model scales (8B to 24B parameters), difficulty levels (undergraduate to graduate), and scientific domains (physics, chemistry, biology, engineering). This points toward a practical paradigm for building science-capable AI systems in education, research assistance, and automated discovery -- wherever precise multi-step inference over specialized knowledge is required.

Links

arXiv Paper