EN KO
← All Publications

Latent Preference Modeling for Cross-Session Personalized Tool Calling

arXiv 2026
Yejin Yoon*, Minseo Kim*, Taeuk Kim (*: equal contribution)

One-Line Summary

The paper introduces MPT, a 265-dialogue / 2,020-session benchmark for cross-session personalized tool calling across Preference Recall, Induction, and Transfer, and proposes PRefine, a generate–verify–refine memory that stores latent preferences as revisable hypotheses and consumes only 1.24% of full-history tokens while improving tool-calling accuracy.

PRefine generate-verify-refine loop for preference-aware tool calling
Figure 1. PRefine maintains an abstract preference memory that is grounded to API arguments at inference time, enabling preference-driven tool calls across sessions and domains.

Background & Motivation

LLM-based agents increasingly rely on external tools that demand fully specified arguments, yet real users habitually under-specify their requests. Completing such incomplete requests requires more than retrieving similar past actions — it requires abstracting implicit, persistent constraints that guide decision-making across contexts and sessions.

Prior personalization work assumes preferences are either directly available (as static profiles or predefined instructions) or emerge through explicit repeated actions inside a narrow domain. Modern agents, however, operate over diverse task spaces without predefined preference specifications. While LLMs can surface similar past actions easily, they struggle to abstract cross-domain regularities into generalizable preference hypotheses that transfer to new contexts.

Motivating example: A user who consistently picks budget-friendly options across flights, restaurants, and hotels reveals a latent preference for economy-class travel — even without ever stating it. This constraint is distributed across sessions and domains, emerges only from accumulated evidence, and must be refined as new behaviors are observed.

This motivates two open questions that drive the paper: (i) how should such latent preferences be represented and maintained efficiently, and (ii) can they transfer across domains and even across differing API schemas?

Proposed Method: PRefine

PRefine treats latent preferences not as static facts to retrieve but as revisable hypotheses of preference constraints. It maintains a single compact memory entry — the best current abstraction of behavioral regularities — and updates it through each new session via a generate–verify–refine loop.

1
Generate
At session T+1, the generator proposes candidate preference hypotheses from the current dialogue, executed API calls, and prior memory MT. Candidates span multiple abstraction levels, from slot-level reuses to cross-domain behavioral signals.
2
Verify
Each hypothesis is judged against four criteria: Evidence Support (grounded in multiple consistent interactions), Abstraction Quality (generalizes beyond a single event or slot restatement), Actionability (can meaningfully bias future API argument selection), and Temporal Consistency (compatible with recent stable behavior).
3
Refine
Rejected hypotheses return to the generator with verifier feedback for revision. The loop iterates up to three times (experiments show more iterations give no consistent gain), and the accepted hypothesis becomes the new memory MT+1.
4
Schema-Agnostic Memory
Crucially, memory stores abstract preference constraints (e.g., "budget-conscious interaction style") rather than schema-specific API signatures. Grounding to concrete arguments happens at inference time, so memory built under one API schema remains applicable when slot names or argument inventories differ at test time.
5
Inference
At session T+1, the inference LLM conditions on the current query q, retained memory MT, and the target-domain schema. Abstract preferences are grounded into concrete API arguments through LLM reasoning, jointly handling preference expression and schema adaptation.

Worked example. Session 1 observes GetMovies(average_rating=6); the initial "prefers moderately rated movies" hypothesis fails verification as over-specific and refines to "minimal interest in movies". Later sessions with GetRentalCars(car_type=Standard) and GetRestaurants(price_range=Cheap) lead to the cross-domain abstraction "economical and simple options across domains," which is retained as memory and later grounded when booking flights.

MPT Benchmark

MPT (Multi-Session Personalized Tool Calling) targets cross-session personalization directly. It is built on top of Schema-Guided Dialogue (SGD) with multi-session grouping and manual preference annotations.

StatisticValue
Multi-session dialogues265
Total sessions2,020
Total turns39,884
Avg. sessions per dialogue7.6
Avg. turns per session19.7

MPT defines three evaluation challenges that isolate distinct personalization abilities:

1
Preference Recall (332 instances)
The missing argument can be filled by direct reuse of previous identical choices within the same domain (e.g., repeatedly choosing flight_class=Economy).
2
Preference Induction (293 instances)
Direct reuse is insufficient; the model must aggregate behavioral evidence spanning different tasks and domains and instantiate a latent preference as concrete argument values.
3
Preference Transfer (472 instances)
No in-domain evidence exists for the target argument — a preference observed in other domains (e.g., budget-conscious restaurant and hotel choices) must be transferred to guide selection in a new domain (e.g., flights).

Each instance is evaluated under two query types: context-guided queries include in-session dialogue providing partial explicit constraints, while context-free queries omit context to isolate pure preference modeling capability. Preferences are manually grouped over 58 API–argument pairs (e.g., Budget: price_range=cheap, car_type=Compact, flight_class=Economy; Travel group size: passengers=1 vs. passengers=2-4). Nineteen human annotators achieved 89.7% agreement on budget groupings and 97.4% on travel groupings.

Experimental Results

Eight inference LLMs were evaluated (CodeGemma-7B, Gemma-3-12B, R1-Distill-Llama-8B, R1-Distill-Qwen-7B, GPT-4o-mini, GPT-5-mini, GPT-5, Gemini-3-Flash) against four memory baselines: RAG, Mem0, LangMem, and PRefine.

Context-Guided Queries — Average Gains over Base Prompting

ChallengeBase P-EMBase OA-F1PRefine ΔP-EMPRefine ΔOA-F1
Preference Recall33.51%54.94%+13.11+11.99
Preference Induction16.89%51.46%+6.88+10.52
Preference Transfer7.81%44.81%+2.87+9.27

Context-Free Queries — Average F1 Gains

ChallengeBase F1PRefine ΔF1
Preference Recall33.62%+9.82
Preference Induction13.41%+5.20
Preference Transfer22.25%+3.38

Memory Design Comparison

MethodMemory TypeUpdate MechanismActionableLatent-Preference-Aware
RAGRaw utterancesStatic index
Mem0Extracted factsAppend / overwrite
LangMemStructured factsLLM rewrite
PRefineLatent constraintsGenerate–verify–refine

Why It Matters

Personal agents will increasingly be judged not by isolated tool calls but by how well they accumulate understanding of a user across many sessions. This paper argues that effective personalization depends on capturing the reasons behind user choices, not just the choices themselves, and shows that a compact, revisable hypothesis is a far more scalable representation than raw history.

Links

Dialogue Benchmark