RAISE is a three-stage retrieval-augmented framework that decomposes scientific problems into subquestions, generates logic-enriched queries capturing reasoning intent, and retrieves step-relevant documents from open-domain corpora—achieving an average 13% improvement over baselines on graduate-level science benchmarks (GPQA, SuperGPQA, MMLU).
Scientific reasoning tasks require LLMs to handle long-chain reasoning processes alongside domain-specific terminology and up-to-date knowledge. Two common strategies exist: step-wise reasoning (e.g., Chain-of-Thought decomposition) and retrieval-augmented generation (RAG). Recent work combines them, but typically targets simpler multi-hop QA or assumes curated, task-specific corpora.
Conventional RAG approaches retrieve documents using a single query derived from the full problem, which often returns vague or superficially related content that fails to support the multi-step logic needed for graduate-level science questions. The fundamental challenge is: what to search for and how to retrieve the appropriate external knowledge for each step when solving complex scientific reasoning tasks.
Key Problem: Existing retrieval methods (e.g., standard RAG, HyDE, Step-Back) match documents based on surface similarity rather than logical relevance. Retrieved passages share domain keywords but lack the essential scientific mechanisms—such as reaction mechanisms, mathematical derivations, or physical principles—needed to actually solve each reasoning step. Neither initial search queries (which lack reasoning context) nor subquestions alone (which can be noisy or overly specific) are sufficient for effective retrieval.
RAISE (Retrieval-Augmented framework for Improving Scientific rEasoning) operates through three sequential phases. It uses Dense Passage Retrieval (DPR) over approximately 21 million Wikipedia passages (each ~100 words) as the retrieval backbone, with no task-specific corpus required:
RAISE is evaluated on GPQA Diamond (198 expert-written graduate-level questions in physics, biology, and chemistry), SuperGPQA (graduate-level questions across science and engineering at multiple difficulty levels), and MMLU (college/professional-level chemistry and biology). The primary LLM is Mistral Small 3.1-Instruct-2503 (24B) for GPQA and LLaMA 3.1-8B for SuperGPQA/MMLU. Seven baselines are compared: CoT, CoT+RAG, Least-to-Most, Step-Back, Least-to-Most+RAG, Step-Back+RAG, and HyDE.
| Method | GPQA Diamond | SuperGPQA Sci-Hard | SuperGPQA Sci-Mid | SuperGPQA Eng-Hard | MMLU College Chem |
|---|---|---|---|---|---|
| CoT | 42.42 | 4.52 | 15.08 | 6.53 | 49.50 |
| CoT + RAG | 45.96 | 7.54 | 12.56 | 7.54 | 43.00 |
| Least-to-Most | 44.95 | 6.03 | 14.57 | 10.05 | 45.40 |
| Step-Back | 44.44 | 5.03 | 15.08 | 6.03 | 43.00 |
| Least-to-Most + RAG | 45.95 | 6.03 | 14.57 | 8.04 | 46.00 |
| Step-Back + RAG | 43.43 | 5.53 | 15.58 | 9.05 | 43.00 |
| HyDE | 46.46 | 7.54 | 13.07 | 7.04 | 49.00 |
| RAISE | 51.01 | 10.05 | 19.60 | 10.55 | 51.00 |
Cross-model generalization on GPQA Diamond:
| Model | CoT | HyDE | RAISE | Gain |
|---|---|---|---|---|
| LLaMA 3.1-8B | 22.22 | 25.75 | 30.30 | +7.1% |
| GPT-4o mini | 40.91 | 38.89 | 47.98 | +5.6% |
| Mistral-24B | 42.42 | 46.46 | 51.01 | +9.8% |
Scientific reasoning is one of the most challenging frontiers for LLMs, requiring both domain knowledge and multi-step logical deduction. RAISE demonstrates that step-wise decomposition combined with logic-enriched retrieval can substantially improve performance on graduate-level science questions, achieving an average 13% improvement across benchmarks—without requiring additional model training or curated domain-specific corpora.
The framework is practical and broadly applicable: it uses only open-domain Wikipedia as the retrieval corpus, works across different model scales (8B to 24B parameters), and generalizes across physics, chemistry, biology, and engineering. The key insight—that retrieval queries should capture reasoning intent rather than surface-level domain similarity—opens a promising direction for improving LLM performance on complex reasoning tasks in science, mathematics, and beyond.