RAISE - HYU NLP Lab

One-Line Summary

RAISE is a three-stage retrieval-augmented framework that decomposes scientific problems into subquestions, generates logic-enriched queries capturing reasoning intent, and retrieves step-relevant documents from open-domain corpora—achieving an average 13% improvement over baselines on graduate-level science benchmarks (GPQA, SuperGPQA, MMLU).

Background & Motivation

Scientific reasoning tasks require LLMs to handle long-chain reasoning processes alongside domain-specific terminology and up-to-date knowledge. Two common strategies exist: step-wise reasoning (e.g., Chain-of-Thought decomposition) and retrieval-augmented generation (RAG). Recent work combines them, but typically targets simpler multi-hop QA or assumes curated, task-specific corpora.

Conventional RAG approaches retrieve documents using a single query derived from the full problem, which often returns vague or superficially related content that fails to support the multi-step logic needed for graduate-level science questions. The fundamental challenge is: what to search for and how to retrieve the appropriate external knowledge for each step when solving complex scientific reasoning tasks.

Key Problem: Existing retrieval methods (e.g., standard RAG, HyDE, Step-Back) match documents based on surface similarity rather than logical relevance. Retrieved passages share domain keywords but lack the essential scientific mechanisms—such as reaction mechanisms, mathematical derivations, or physical principles—needed to actually solve each reasoning step. Neither initial search queries (which lack reasoning context) nor subquestions alone (which can be noisy or overly specific) are sufficient for effective retrieval.

Proposed Method: RAISE

RAISE (Retrieval-Augmented framework for Improving Scientific rEasoning) operates through three sequential phases. It uses Dense Passage Retrieval (DPR) over approximately 21 million Wikipedia passages (each ~100 words) as the retrieval backbone, with no task-specific corpus required:

1

Problem Decomposition

The LLM breaks down the original question into n subquestions (r₁, ..., r_n) paired with corresponding initial search queries (q₁, ..., q_n). These initial queries serve as input for the next stage, not directly for retrieval. This decomposition structures the reasoning pathway and ensures different reasoning steps can access distinct pieces of information.

2

Logical Query Generation

For each subquestion, the model generates a logically enriched query (q_i*) by combining the initial query q_i with its subquestion r_i through a reformulation prompt. A key insight is that even if the reformulated query contains factual inaccuracies, it tends to retrieve passages that are logically relevant and supportive of the reasoning step, because it captures the underlying reasoning intent rather than just surface keywords.

3

Logical Retrieval & Answer Composition

For each subquestion, the top-10 documents are retrieved using DPR with inner-product similarity on L2-normalized embeddings. A similarity threshold (T=0.84 for GPQA/SuperGPQA/MMLU-Pro; T=0.80 for MMLU-STEM) filters out low-relevance passages. The model then generates a subanswer using the filtered documents, original question, and all previous subanswers. Finally, all subanswers are composed to produce the final answer.

Experimental Results

RAISE is evaluated on GPQA Diamond (198 expert-written graduate-level questions in physics, biology, and chemistry), SuperGPQA (graduate-level questions across science and engineering at multiple difficulty levels), and MMLU (college/professional-level chemistry and biology). The primary LLM is Mistral Small 3.1-Instruct-2503 (24B) for GPQA and LLaMA 3.1-8B for SuperGPQA/MMLU. Seven baselines are compared: CoT, CoT+RAG, Least-to-Most, Step-Back, Least-to-Most+RAG, Step-Back+RAG, and HyDE.

Method	GPQA Diamond	SuperGPQA Sci-Hard	SuperGPQA Sci-Mid	SuperGPQA Eng-Hard	MMLU College Chem
CoT	42.42	4.52	15.08	6.53	49.50
CoT + RAG	45.96	7.54	12.56	7.54	43.00
Least-to-Most	44.95	6.03	14.57	10.05	45.40
Step-Back	44.44	5.03	15.08	6.03	43.00
Least-to-Most + RAG	45.95	6.03	14.57	8.04	46.00
Step-Back + RAG	43.43	5.53	15.58	9.05	43.00
HyDE	46.46	7.54	13.07	7.04	49.00
RAISE	51.01	10.05	19.60	10.55	51.00

Cross-model generalization on GPQA Diamond:

Model	CoT	HyDE	RAISE	Gain
LLaMA 3.1-8B	22.22	25.75	30.30	+7.1%
GPT-4o mini	40.91	38.89	47.98	+5.6%
Mistral-24B	42.42	46.46	51.01	+9.8%

GPQA Diamond: RAISE achieves 51.01%, a +9.8% relative improvement over the best baseline (HyDE at 46.46%), using Mistral-24B
SuperGPQA Science-Hard: 10.05% accuracy, a +33.3% relative improvement over the best baseline (CoT+RAG and HyDE at 7.54%)
SuperGPQA Science-Middle: 19.60%, a +25.8% relative improvement over the best baseline (Step-Back+RAG at 15.58%)
MMLU College Chemistry: 51.00%, outperforming all baselines including CoT (49.50%) and HyDE (49.00%)
Ablation (RAISE-Direct): Removing the problem decomposition stage and generating logical queries directly from the full question consistently degrades performance, confirming that step-wise decomposition is critical for guiding retrieval
Retrieval Quality: Both LLM-as-a-judge evaluation (GPT-4o mini scoring logical relevance on a 4-point scale) and human evaluation by Ph.D. students and chemistry faculty confirm that RAISE retrieves the fewest irrelevant documents and the highest proportion of logically relevant passages across all methods
Generalization: Consistent improvements across three model families (LLaMA-8B, GPT-4o mini, Mistral-24B), demonstrating the framework is model-agnostic

Why It Matters

Scientific reasoning is one of the most challenging frontiers for LLMs, requiring both domain knowledge and multi-step logical deduction. RAISE demonstrates that step-wise decomposition combined with logic-enriched retrieval can substantially improve performance on graduate-level science questions, achieving an average 13% improvement across benchmarks—without requiring additional model training or curated domain-specific corpora.

The framework is practical and broadly applicable: it uses only open-domain Wikipedia as the retrieval corpus, works across different model scales (8B to 24B parameters), and generalizes across physics, chemistry, biology, and engineering. The key insight—that retrieval queries should capture reasoning intent rather than surface-level domain similarity—opens a promising direction for improving LLM performance on complex reasoning tasks in science, mathematics, and beyond.

Links